<a href="https://colab.research.google.com/github/jacomyma/mapping-controversies/blob/main/notebooks/Enrich_Wikipedia_articles_with_API_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Enrich Wikipedia articles with API data

**Input:** a list of Wikipedia articles (CSV).

**Output:** a list of Wikipedia articles with additional columns (CSV).

This scripts queries Wikipedia for each article of the input list. It retrieves additional information for each article. It saves the enriched list.

*NOTE: THE SCRIPT IS SLOW.*

## How to use

1. Put your input file in the same folder as the notebook
1. Edit the settings if needed
1. Run all the cells
1. Take the output file from the notebook folder

## List of available additional information



* Correct title
* URL
* Summary (text)
* Images (number and/or list)
* Links (number and/or list)
* Categories (number and/or list)
* References (number and/or list)

# SETTINGS

In [None]:
# Input file
input_file = "wikipedia-articles.csv"

# Which column contains the article title?
article_name_column = "Article"

# Output file
output_file = "wikipedia-articles-enriched.csv"

# Which information do you need? Keep your file size in check!
get_correct_title = True
get_URL = True
get_summary = True
get_images_count = True
get_images_list = False
get_links_count = True
get_links_list = False
get_categories_count = True
get_categories_list = False
get_references_count = True
get_references_list = False

# SCRIPT

### Install and import libraries
This notebook draws on existing code.
You can ignore the output.

In [None]:
# Install (if needed)
!pip install wikipedia
!pip install pandas

# Import
import wikipedia
import pandas as pd
import io
import csv

print("Done.")

### Read the input file

In [None]:
article_df = pd.read_csv(input_file, quotechar='"', encoding='utf8', doublequote=True, quoting=csv.QUOTE_NONNUMERIC, dtype=object)
print("Preview of the article list:")
article_df

### Harvest Wikipedia

In [None]:
# A CSV to String function
def csvize(csvdata):
  output = io.StringIO()
  writer = csv.writer(output, quoting=csv.QUOTE_MINIMAL)
  writer.writerow(csvdata)
  return output.getvalue()

# Declare the data harvesting function
def harvest(title):
  try:
    page = wikipedia.page(title,auto_suggest=False)

  except wikipedia.exceptions.DisambiguationError:
    print("Wikipedia thinks "+title+" is ambiguous (returns several candidate pages). Trying again with all capitalized letters")
    try:
      page = wikipedia.page(title.capitalize(),auto_suggest=False)
      print("Success! "+title+" is no longer ambiguous")
    except wikipedia.exceptions.DisambiguationError:
      print("Wikipedia still thinks "+title+" is ambiguous (returns several candidate pages). Trying again with all lower letters")
      try:
        page = wikipedia.page(title.lower(),auto_suggest=False)
        print("Success! "+title+" is no longer ambiguous")
      except wikipedia.exceptions.DisambiguationError:
        print("Wikipedia still thinks "+title+" is ambiguous (returns several candidate pages). Skipping page...")
        return []
  except wikipedia.exceptions.PageError:
    print("The page "+title+" could not be found. Skipping page...")
    return []

  except:
    print("The page "+title+" failed due to unknown reason. Skipping...")
    print("")
    return []
  
  data = {}
  if get_correct_title:
    data['correct-title'] = page.title
  if get_URL:
    data['url'] = page.url
  if get_summary:
    data['summary'] = page.summary
  if get_images_count:
    data['images-count'] = len(page.images)
  if get_images_list:
    data['images'] = csvize(page.images)
  if get_links_count:
    data['links-count'] = len(page.links)
  if get_links_list:
    data['links'] = csvize(page.links)
  if get_categories_count:
    data['categories-count'] = len(page.categories)
  if get_categories_list:
    data['categories'] = csvize(page.categories)
  if get_references_count:
    data['references-count'] = len(page.references)
  if get_references_list:
    data['references'] = csvize(page.references)
  return data
  
# Harvest each article one by one
enriched_article_list = []
seen = []
print("Harvesting information from "+str(len(article_df.index))+" wikipedia pages. This might take a while...")
count=1
for index, row in article_df.iterrows():
  title = row[article_name_column]
  if count % 10 == 0:
    print("Additional information harvested from "+str(count)+" pages out of "+str(len(article_df.index))+". Continuing...")
  if not title in seen: # Do not harvest twice the same...
    seen.append(title)
    try:
      data = harvest(title)
      new_row = {**row, **data}
      enriched_article_list.append(new_row)
    except:
        print('SKIPPED: '+title+' (an error occurred)')
  count=count+1

enriched_articles_df = pd.DataFrame(enriched_article_list)
print("Done.")
print("Preview of the enriched article list:")
enriched_articles_df

### Save as CSV

In [None]:
try:
  enriched_articles_df.to_csv(output_file, index = False, encoding='utf-8')
  print("Done.")
except IOError:
  print("/!\ Error while writing the output file")