<a href="https://colab.research.google.com/github/jacomyma/mapping-controversies/blob/main/notebooks/Wikipedia_articles_to_articles_and_editors_network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🍄 Wikipedia articles to articles and editors network

**Input:** a list of Wikipedia articles (CSV).

**Output:** a bipartite network of articles and editors connected when the editor edited that article (GEXF).

This scripts queries Wikipedia for the list of edits (i.e. revisions) for each article of the input list. Then it takes the editors of these edits, and makes a network where nodes are of two types: articles, and editors. You may set a time range. In that case, some revisions will not be taken into account, and that will ignore certain editors.

## How to use

1. Put your input file in the same folder as the notebook
1. Edit the settings if needed
1. Run all the cells
1. Take the output file from the notebook folder

# SETTINGS

In [None]:
# Input file
input_file = "wikipedia-articles.csv"

# Which column contains the article title?
article_name_column = "Article"

# Start date
start_date="2000-01-01"

# End date
end_date="2100-01-01"

# Output file
output_file = "wikipedia-articles-editors-network.gexf"

# SCRIPT

### Install and import libraries
This notebook draws on existing code.
You can ignore the output.

In [None]:
# In this cell Jupyter checks whether you have the right libraries installed 

import sys

try: #First, Jupyter tries to import a library
  import requests
  print("Requests library has been imported")
except: #If it fails, it will try to install the library
  print("Requests library not found. Installing...")
  !pip install requests
  try:#... and try to import it again
    import requests
  except: #unless it fails, and raises an error.
    print("Something went wrong in the installation of the requests library. Please check your internet connection and consult output from the installation below")
try:
  import networkx as nx
  print("NetworkX library has been imported")
except:
  print("NetworkX library not found. Installing...")
  !pip install networkx
  
  try:
    import networkx as nx
  except:
    print("Something went wrong in the installation of the NetworkX library. Please check your internet connection and consult output from the installation below")


# Install (if needed)
!pip install pandas

# Import
import pandas as pd
import csv

### Read the input file

In [None]:
article_df = pd.read_csv(input_file, quotechar='"', encoding='utf8', doublequote=True, quoting=csv.QUOTE_NONNUMERIC, dtype=object)
print("Preview of the article list:")
article_df

### Harvest the list of edits

In [None]:
# Language
lan = "en"

print("Harvesting data from "+str(len(article_df.index))+" input pages...")
S = requests.Session()
count=1
data_index = {}
for title in article_df[article_name_column]:
  Revisions = []
  URL = "http://"+lan+".wikipedia.org/w/api.php"
  if count % 50 == 0:
    print("Data harvested from "+str(count)+" articles out of "+str(len(article_df.index))+". Continuing harvest...")
  PARAMS = {
    "action": "query",
    "prop": "revisions",
    "titles": title,
    "rvlimit": "500",
    "rvprop": "user|timestamp",
    "rvdir": "newer",
    "rvstart": start_date+"T00:00:00Z",
    "rvend": end_date+"T00:00:00Z",
    "formatversion": "2",
    "format": "json"
  }

  R = S.get(url=URL, params=PARAMS)
  if R.status_code==404:
    print("The page does not exist. Skipping...")
    continue
  DATA = R.json()
  for each in DATA['query']['pages']:
    Revisions.append(each)

  while 'continue' in DATA.keys():
    PARAMS = {
      "action": "query",
      "prop": "revisions",
      "titles": title,
      "rvlimit": "500",
      "rvprop": "user|timestamp",
      "rvdir": "newer",
      "rvstart": start_date+"T00:00:00Z",
      "rvend": end_date+"T00:00:00Z",
      "formatversion": "2",
      "format": "json",
      "rvcontinue": DATA['continue']['rvcontinue']
    }

    R = S.get(url=URL, params=PARAMS)
    DATA = R.json()
    for each in DATA['query']['pages']:
      Revisions.append(each)

  for each in Revisions:
    if "revisions" in each:
      for every in each["revisions"]:
        if "user" in every:
          user = every["user"]
          if not user in data_index:
            data_index[user] = {}
          if not title in data_index[user]:
            data_index[user][title] = 0
          data_index[user][title] += 1
  count=count+1

print("Done.")

### Build bipartite network

In [None]:
print("Building network...")

# Build the nodes
nodes = []
for index, row in article_df.iterrows():
  nodes.append((row[article_name_column], {**row, 'label':row[article_name_column], 'type':'article'}))
for user in data_index.keys():
  nodes.append((user, {'label':user, 'type':'editor'}))

# Build edges
edges = []
for user in data_index.keys():
  for title in data_index[user].keys():
    edge = (user,title,{"count":data_index[user][title]})
    edges.append(edge)

print("Network has been generated. Saving...")
G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(edges)
nx.write_gexf(G, output_file)
print("Done.")