# Prelude

Contains the following:
1. Imports and defining helper functions
2. Scraping data from wikipedia and downloads it
3. Building the network either from
    - (A) Downloaded files
    - (B) Local pickle file (created from last time A was run)
4. Simple prelimenary data analysis of network

## 1. Imports

In [None]:
import shutil
import os
import networkx as nx
import pickle
from wiki_utils import getJsonResponse, findLinks, build_graph_from_files

# Set the directory to downloads
DOWNLOADS_DIR = "downloads"

## 2. Scraping data
Fetches philosopher data from their wikipedia pages and downloads the wikipedia pages as `{philosopher_name}.txt` in a `downloads/` directory.

>**NOTES**
> 1. This takes a while to run
> 2. It deletes all previous content in `downloads`
> 3. Downloads all pages but skips pages with *no content* or *redirects*.

In [None]:
wiki_links = ["List of philosophers (A–C)", "List of philosophers (D–H)", "List of philosophers (I–Q)", "List of philosophers (R–Z)"]
title_links = []

verbose = False # Debug output during loops
invalid_links = []  # Track titles that could not be saved
redirect_links = []  # Track titles that are redirects

# Delete and recreate the downloads directory
if os.path.exists(DOWNLOADS_DIR):
    shutil.rmtree(DOWNLOADS_DIR)  # Delete the directory and all its contents
os.makedirs(DOWNLOADS_DIR, exist_ok=True)  # Recreate the directory


for wiki_link in wiki_links:
  wiki_markup = getJsonResponse(wiki_link)
  title_links.extend(findLinks(wiki_markup))

# Remove irrelevant links if they exist
for unwanted in ["List_of_philosophers", "Philosopher", "Stanford_Encyclopedia_of_Philosophy", "Encyclopedia_of_Philosophy", "Routledge_Encyclopedia_of_Philosophy", "The_Cambridge_Dictionary_of_Philosophy", "The_Oxford_Companion_to_Philosophy"]:
    if unwanted in title_links:
        title_links.remove(unwanted)

# Writing to files (warning this takes a while)
for title_link in title_links:
  all_wikitext = getJsonResponse(title_link)
  if not all_wikitext:
    if verbose: print(f"Skipping '{title_link}' as it has no content.")
    invalid_links.append(title_link)  # Track invalid pages without modifying the list directly
    continue
  
  # Skip if the content starts with #REDIRECT
  if all_wikitext.strip().startswith("#REDIRECT"):
      if verbose: print(f"Skipping '{title_link}' as it is a redirect.")
      redirect_links.append(title_link)  # Track redirect pages
      continue
  
  filename = os.path.join(DOWNLOADS_DIR, f"{title_link}.txt")
  with open(filename, "w", encoding="utf-8") as file:
    file.write(all_wikitext) # save all the wikitext into one file

title_links = [link for link in title_links if link not in invalid_links + redirect_links]
print(f"Downloaded {len(title_links)} pages.")
print(f"Skipped {len(invalid_links)} pages with no content.")
print(f"Skipped {len(redirect_links)} redirect pages.")

## 3. Building the network 

### (A) Create from scratch 
From `downloads/`directory (saves local pickle file for later)

In [3]:
S = build_graph_from_files(DOWNLOADS_DIR)
pickle.dump(S, open("graph.pkl", "wb")) # Saved as local version for later use for (B)


### (B) OR use local version 
From `pickle` file created last time you ran (A)

In [2]:
# load graph:
S = pickle.load(open("graph.pkl", "rb"))

## 4. Prelimenary data analysis

In [4]:
print(f"Number of nodes: {S.number_of_nodes()}")
print(f"Number of edges: {S.number_of_edges()}")

# Calculating total data size
download_size = sum(os.path.getsize(os.path.join(DOWNLOADS_DIR, f)) for f in os.listdir(DOWNLOADS_DIR) if f.endswith(".txt"))
download_size_mb = download_size / (1024 * 1024)  # Convert bytes to MB
print(f"Size of downloaded data: {download_size_mb:.2f} MB")

Number of nodes: 1366
Number of edges: 10850
Size of downloaded data: 48.80 MB
