<a href="https://colab.research.google.com/github/jacomyma/mapping-controversies/blob/main/notebooks/Wikipedia_articles_to_co_reference_network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wikipedia articles to co-reference network


**Input:** a list of Wikipedia articles (CSV).

**Output:** a network of Wikipedia articles connected when they share one or references (GEXF).

This scripts queries Wikipedia for each article of the input list. It retrieves the references for each article. Then it outputs a weighted, undirected network where the nodes are exactly the articles of the input list, and the edges are the references in common between these articles (if any).

Note: some articles of the list might be redirections to other articles of the list, therefore the output may have less articles than the input.

## How to use

1. Put your input file in the same folder as the notebook
1. Edit the settings if needed
1. Run all the cells
1. Take the output file from the notebook folder

# SETTINGS

In [None]:
# Input file
input_file = "wikipedia-articles.csv"

# Which column contains the article title?
article_name_column = "Article"

# Output file
output_file = "wikipedia-articles-coreference-network.GEXF"

# SCRIPT

### Install and import libraries
This notebook draws on existing code.
You can ignore the output.

In [None]:
# Install (if needed)
!pip install wikipedia
!pip install pandas
!pip install networkx

# Import
import wikipedia
import pandas as pd
import networkx as nx
import csv

print("Done.")

### Read the input file

In [None]:
article_df = pd.read_csv(input_file, quotechar='"', encoding='utf8', doublequote=True, quoting=csv.QUOTE_NONNUMERIC, dtype=object)
print("Preview of the article list:")
article_df

### Harvest Wikipedia

In [None]:
article_dict={}
article_list=[]
print("Harvesting references from "+str(len(article_df.index))+" wikipedia pages. This might take a while...")
count=1
for title in article_df[article_name_column]:
  if count % 50 == 0:
    print("References harvested from "+str(count)+" pages out of "+str(len(article_df.index))+". Continuing...")
  count=count+1
  try:
    page = wikipedia.page(title,auto_suggest=False)
  except wikipedia.exceptions.DisambiguationError:
    print("Wikipedia thinks "+title+" is ambiguous (returns several candidate pages). Trying again with all capitalized letters")
    try:
      page = wikipedia.page(title.capitalize(),auto_suggest=False)
      print("Success! "+title+" is no longer ambiguous")
    except wikipedia.exceptions.DisambiguationError:
      print("Wikipedia still thinks "+title+" is ambiguous (returns several candidate pages). Trying again with all lower letters")
      try:
        page = wikipedia.page(title.lower(),auto_suggest=False)
        print("Success! "+title+" is no longer ambiguous")
      except wikipedia.exceptions.DisambiguationError:
        print("Wikipedia still thinks "+title+" is ambiguous (returns several candidate pages). Skipping page...")
        continue
  except wikipedia.exceptions.PageError:
    print("The page "+title+" could not be found. Skipping page...")
    continue
  except Exception as e:
    print(e)
      
  try:
    refs = page.references
    #  print(target_refs)
    article_dict[title]={"references":refs}
    article_list.append(title)

  except KeyError:
    print("Could not retrieve references for "+title+". Skipping page...")
    continue
  
print("Succesfully retrieved references from "+str(len(article_dict))+" out of "+str(len(article_df.index))+" wikipedia pages!")
print("Done.")

### Build network

In [None]:
print("Building network...")

# Build the nodes, with attributes
nodes = []
for index, row in article_df.iterrows():
  nodes.append((row[article_name_column], {**row, 'label':row[article_name_column]}))

# Build edges (only between source nodes)
edges = []
for i,source in enumerate(article_list):
  source_refs = article_dict[source]["references"]
  if len(source_refs)>0:
    for target in article_list[i+1:]:
      if target==source:
        continue
      target_refs=article_dict[target]["references"]
      if len(target_refs)>0:
        overlap = len(set(source_refs).intersection(target_refs))
        if overlap>0:
          if len(source_refs) < len(target_refs):
            norm_overlap_by_smallest = overlap / len(source_refs)
          else:
            norm_overlap_by_smallest = overlap / len(target_refs)
          edge = (source,target,{'overlap':overlap,
                                 'norm_overlap_by_smallest':norm_overlap_by_smallest,
                                 'weight':norm_overlap_by_smallest})
          edges.append(edge)

print("Network has been generated. Saving...")
G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(edges)
nx.write_gexf(G, output_file)
print("Done.")