## Usage

1. **Data Fetching**: Use the `fetch_gutendex` function in `fetch_gutendex.py` to retrieve book data from the Gutendex API.
2. **Downloading Texts**: Utilize the `download_texts` function to download the text files of the books.
3. **Data Enrichment**: Enrich the book data using the functions in `ol_enrichment.py` to gather additional information from the Open Library API.
4. **Preprocessing**: Clean and normalize the downloaded text data with functions in `preprocessing.py`.
5. **Analysis**: Perform vectorization and k-NN analysis using `vectorize_and_knn.py` to find similar books.
6. **Graph Construction**: Build semantic and subject-based graphs with the functions in `graph_builder.py`.

## 0) Configuration

In [6]:
# CONFIG
TOPIC          = "romance"      # Gutenberg category (bookshelf/subject)
LANGS          = "en"           # languages accepted
COPYRIGHT      = "false"        # only get non-copyrighted books
MAX_BOOKS      = 100            # max number of books to download
RATE_LIMIT_S   = 0.15           # delay between requests to not overload servers

# Create output directory
import sys, pathlib
from IPython.display import display

# assume notebook is in social-graphs/notebooks -> project root is parent
NOTEBOOK_DIR = pathlib.Path.cwd().resolve()
PROJECT_ROOT = NOTEBOOK_DIR.parent
CODE_DIR = PROJECT_ROOT / "code"


## 1) Getting gutenberg data

In [8]:
# Use fetch_gutendex to get metadata of books in the specified topic
#from code.fetch_gutendex import fetch_gutendex

metadata_df = fetch_gutendex(
    query=TOPIC,
    limit=MAX_BOOKS,
    languages={LANGS},
    min_downloads=0,
    only_plain=True,
)

print("Fetched rows:", len(metadata_df))
display(metadata_df.head())

Gutendex search: 'romance': 100%|██████████| 100/100 [00:05<00:00, 17.90it/s]

Fetched rows: 100





Unnamed: 0,pg_id,title,authors,languages,download_count,subjects,bookshelves,text_url
0,30254,The Romance of Lust: A classic Victorian eroti...,[Anonymous],[en],14063,"[Corporal punishment -- Fiction, Erotic fictio...",[Category: Sexuality & Erotica],https://www.gutenberg.org/ebooks/30254.txt.utf-8
1,35664,Titan: A Romance. v. 1 (of 2),[Jean Paul],[en],7736,[Fiction],"[Category: German Literature, Category: Novels]",https://www.gutenberg.org/ebooks/35664.txt.utf-8
2,5230,The Invisible Man: A Grotesque Romance,"[Wells, H. G. (Herbert George)]",[en],6533,"[Mentally ill -- Fiction, Psychological fictio...","[Category: British Literature, Category: Novel...",https://www.gutenberg.org/ebooks/5230.txt.utf-8
3,23997,Eugene Oneguine [Onegin]: A Romance of Russian...,"[Pushkin, Aleksandr Sergeevich]",[en],5659,"[Novels in verse, Russia -- Social life and cu...","[Category: Novels, Category: Russian Literatur...",https://www.gutenberg.org/ebooks/23997.txt.utf-8
4,14568,Sir Gawayne and the Green Knight: An Alliterat...,[],"[en, enm]",5350,"[Arthurian romances, Gawain (Legendary charact...","[Category: British Literature, Category: Class...",https://www.gutenberg.org/ebooks/14568.txt.utf-8


## 2) downloading texts

## 3) Use Open Library to add attributes

## 4) Normalize and clean actual texts

## 5) Find similarities in texts to construct network

## 6) Construct network/graph