<a href="https://colab.research.google.com/github/jacomyma/mapping-controversies/blob/main/notebooks/Wikipedia_articles_to_hyperlinks_network_slow_and_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🍣 Wikipedia articles to hyperlinks network (slow and clean)



**Input:** a list of Wikipedia articles (CSV).

**Output:** a network of Wikipedia articles connected by hyperlinks (GEXF).

This scripts queries Wikipedia for each article of the input list. It retrieves the hyperlinks for each article. Then it outputs a directed network where the nodes are exactly the articles of the input list, and the edges are the hyperlinks between these articles (if any).

*Note 1: some articles of the list might be redirections to other articles of the list, therefore the output may have less articles than the input.*

*Note 2: this version of the script is dubbed "slow and clean" because it only takes the hyperlinks from the actual content of the article. It requires more parsing and is thus slower, but the links are more relevant because it does not include the article footer. If you do not have the time, [use the quick and dirty version instead](https://colab.research.google.com/github/jacomyma/mapping-controversies/blob/main/notebooks/Wikipedia_articles_to_hyperlinks_network_quick_and_dirty.ipynb).*

## How to use

1. Put your input file in the same folder as the notebook
1. Edit the settings if needed
1. Run all the cells
1. Take the output file from the notebook folder

# SETTINGS

In [None]:
# Input file
input_file = "wikipedia-articles.csv"

# Which column contains the article title?
article_name_column = "Article"

# Output file
output_file = "Wikipedia-articles-hyperlinks-network.GEXF"

# SCRIPT

### Install and import libraries
This notebook draws on existing code.
You can ignore the output.

In [None]:
# Install (if needed)
!pip install wikipedia-api
!pip install pandas
!pip install networkx

# Import
import wikipediaapi
import pandas as pd
import networkx as nx
import csv

print("Done.")

### Read the input file

In [None]:
article_df = pd.read_csv(input_file, quotechar='"', encoding='utf8', doublequote=True, quoting=csv.QUOTE_NONNUMERIC, dtype=object)
print("Preview of the article list:")
article_df

### Harvest Wikipedia

In [None]:
# This is an object we use to connect to the API.
# Note that we configure it to use the English Wikipedia.
wiki_wiki = wikipediaapi.Wikipedia(
  language='en',
  extract_format = wikipediaapi.ExtractFormat.WIKI,
  user_agent='ControversyMapping/0.0 (https://jacomyma.github.io/mapping-controversies/)'
)

seen = []
network = {}
print("Harvesting all links from "+str(len(article_df.index))+" wikipedia pages. This might take a while...")
count=1

# Harvest each article one by one
for title in article_df[article_name_column]:
  if count % 50 == 0:
    print("All links harvested from "+str(count)+" pages out of "+str(len(article_df.index))+". Continuing...")
  if not title in seen: # Do not harvest twice the same...
    seen.append(title)
    try:
      page = wiki_wiki.page(title)
      text_links = []
      links = page.links
      for link_title in sorted(links.keys()):
        text_links.append(link_title)
      network.update({title:text_links})

    except:
        print('SKIPPED: '+title+' (an error occurred)')
  count=count+1

print("Done.")

### Build network

In [None]:
print("Building network...")

# Build the nodes, with attributes
nodes = []
for index, row in article_df.iterrows():
  nodes.append((row[article_name_column], {**row, 'label':row[article_name_column]}))

# Build edges (only between source nodes)
edges = []
source_node_set = network.keys()
for source in network:
  for target in network[source]:
    edge = (source,target)
    if target in source_node_set:
      edges.append(edge)
print("Saving network...")

# Assemble
G = nx.DiGraph()
G.add_nodes_from(nodes)
G.add_edges_from(edges)
nx.write_gexf(G, output_file)

print('Done.')