<a href="https://colab.research.google.com/github/jacomyma/mapping-controversies/blob/main/notebooks/Wikipedia_words_and_articles_to_edit_list_with_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🍱 Wikipedia words and articles to edit list with words

**Inputs:**
* a SMALL list of Wikipedia articles (CSV)
* a small list of words, like a dozen (CSV)

**Outputs:**
* a list of term-revision pairs, with article and timestamp (CSV)

This script tells you which words are in which revisions for which article, and when.

## How to use

1. Put your input files in the same folder as the notebook
1. Edit the settings if needed.
1. Run all the cells
1. Take the output file from the notebook folder

# SETTINGS

In [1]:
# Input file 1: Wikipedia articles
input_file_articles = "wikipedia-articles.csv"
# Which column contains the article title?
article_name_column = "Article"

# Input file 2: small list of words
input_file_words = "words-small-list.csv"
# Which column contains the words?
words_text_column = "text"

# Output files
output_file = "terms-and-revisions.csv"

# SCRIPT

### Install and import libraries
This notebook draws on existing code.
You can ignore the output.

In [2]:
# Install (if needed)
!pip install pandas
!pip install requests

# Import
import csv
import pandas as pd
import requests

print("Done.")

Done.


### Read the input file 1 (documents)

In [3]:
article_df = pd.read_csv(input_file_articles, quotechar='"', encoding='utf8', doublequote=True, quoting=csv.QUOTE_NONNUMERIC, dtype=object)
print("Preview of the article list:")
article_df

Preview of the article list:


Unnamed: 0,Article
0,Search engine privacy
1,Member Berries
2,Real-name system
3,CSipSimple
4,Spam blog
5,Worldwide Protests for Free Expression in Bang...
6,FTC fair information practice
7,Spam mass


### Read the input file 2 (words)

In [4]:
word_df = pd.read_csv(input_file_words, quotechar='"', encoding='utf8', doublequote=True, quoting=csv.QUOTE_NONNUMERIC, dtype=object)
print("Preview of the word list:")
word_df

Preview of the word list:


Unnamed: 0,text,type,count-occurences-total,count-documents
0,congress,ORG,5,3
1,the united states,GPE,4,3
2,yahoo,ORG,19,2
3,europe,LOC,2,2
4,the european union,ORG,2,2
5,google,ORG,28,1
6,ftc,ORG,10,1
7,aol,ORG,7,1
8,doubleclick,PERSON,5,1
9,oecd,ORG,5,1


### Harvest Wikipedia

In [5]:
# Index the terms
terms = set()
for index, row in word_df.iterrows():
  terms.add(row[words_text_column])

# Make a dump for security
dump_filename = "dump-data.csv"

# Define an empty dataframe for the output datafile
df = pd.DataFrame(columns=['Page','OldRevision_Url','Time','Term'])

# Iterate over the list of pages
for title in article_df[article_name_column]:
  URL = "http://en.wikipedia.org/w/api.php" # we are going to call the API for English Wikipedia
  S = requests.Session()
    
  # Below some paramters for the API query. We are getting the ID and timestamp for each revision.
  PARAMS = {
    "action": "query",
    "prop": "revisions",
    "titles": title,
    "rvlimit": "500",
    "rvprop": "timestamp|ids|content",
    "rvdir": "newer",
    "rvstart": "2001-01-01"+"T00:00:00Z",
    "formatversion": "2",
    "format": "json"
  }

  R = S.get(url=URL, params=PARAMS)
  if R.status_code==404:
    print("The page does not exist")
  DATA = R.json()
  for each in DATA['query']['pages']:
    for revision in each['revisions']:
      for term in terms:
        if 'content' in revision.keys():
          row = [title,'https://en.wikipedia.org/w/index.php?title='+title+'&oldid='+str(revision['revid']),revision['timestamp']]
          # Search for the term
          if term.lower() in revision['content'].lower():
            # and add a result to the data output if the term is found
            row.append(term)
            df.loc[len(df)] = row

    # Dump the latest version of the reuslts
    df.to_csv(dump_filename)
    print('Queried another 500 revisions for ' + title + ' until '+revision['timestamp'])
  
  # When there are more than 500 revisions we need this addition to keep paging through the revisions.
  while 'continue' in DATA.keys():
    PARAMS = {
      "action": "query",
      "prop": "revisions",
      "titles": title,
      "rvlimit": "500",
      "rvprop": "timestamp|ids|content",
      "rvdir": "newer",
      "rvstart": "2001-01-01"+"T00:00:00Z",
      "formatversion": "2",
      "format": "json",
      "rvcontinue": DATA['continue']['rvcontinue']
    }

    R = S.get(url=URL, params=PARAMS)
    DATA = R.json()
    for each in DATA['query']['pages']:
      for revision in each['revisions']:
        for term in terms:
          if 'content' in revision.keys():
            row = [title,'https://en.wikipedia.org/w/index.php?title='+title+'&oldid='+str(revision['revid']),revision['timestamp']]
            #search for the term
            if term.lower() in revision['content'].lower():
              #and add a result to the data output if the term is found
              row.append(term)
              df.loc[len(df)] = row

    # Dump the latest version of the reuslts
    df.to_csv(dump_filename)
    print('Queried another 500 revisions for ' + title + ' until '+revision['timestamp'])

print('Done.')

Queried another 500 revisions for Search engine privacy until 2020-04-18T06:11:36Z
Queried another 500 revisions for Search engine privacy until 2021-12-01T17:34:12Z
Queried another 500 revisions for Member Berries until 2016-09-15T03:13:34Z
Queried another 500 revisions for Member Berries until 2016-09-18T19:58:16Z
Queried another 500 revisions for Member Berries until 2017-01-01T16:39:30Z
Queried another 500 revisions for Member Berries until 2022-01-27T18:22:38Z
Queried another 500 revisions for Real-name system until 2016-08-06T18:55:15Z
Queried another 500 revisions for Real-name system until 2019-04-16T15:13:54Z
Queried another 500 revisions for Real-name system until 2022-01-28T23:41:47Z
Queried another 500 revisions for CSipSimple until 2016-04-17T21:51:35Z
Queried another 500 revisions for CSipSimple until 2022-01-17T07:17:17Z
Queried another 500 revisions for Spam blog until 2005-12-08T08:01:17Z
Queried another 500 revisions for Spam blog until 2006-04-10T20:22:47Z
Queried an

### Save the CSV

In [7]:
try:
  df.to_csv(output_file, index = False, encoding='utf-8')
  print('Done.')
except IOError:
  print("/!\ Error while writing the output file")

Done.
