<a href="https://colab.research.google.com/github/jacomyma/mapping-controversies/blob/main/notebooks/Wikipedia_articles_extract_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🍾 Wikipedia articles extract text

**Input:** a list of Wikipedia articles (CSV).

**Output:** a list of Wikipedia articles with text (CSV).

This scripts queries Wikipedia for each article of the input list. It retrieves their text content. It saves the enriched list. It is slow.

## How to use

1. Put your input file in the same folder as the notebook
1. Edit the settings if needed
1. Run all the cells
1. Take the output file from the notebook folder

# SETTINGS

In [None]:
# Input file
input_file = "wikipedia-articles.csv"

# Which column contains the article title?
article_name_column = "Article"

# Output file
output_file = "wikipedia-articles-with-text.csv"

# SCRIPT

### Install and import libraries
This notebook draws on existing code.
You can ignore the output.

In [None]:
try:
  import wikipediaapi
  print("Wikipedia api library has been imported")
except:
  print("wikipedia api library not found. Installing...")
  !pip install wikipedia-api
  
  try:
    import wikipediaapi
  except:
    print("Something went wrong in the installation of the wikipedia api library. Please check your internet connection and consult output from the installation below")

# Install (if needed)
!pip install pandas

# Import
import pandas as pd
import csv

print("Done.")

### Read the input file

In [None]:
article_df = pd.read_csv(input_file, quotechar='"', encoding='utf8', doublequote=True, quoting=csv.QUOTE_NONNUMERIC, dtype=object)
print("Preview of the article list:")
article_df

### Harvest Wikipedia

In [None]:
# Language
lan = "en"

# Query object
wiki_wiki = wikipediaapi.Wikipedia(
  language=lan,
  extract_format=wikipediaapi.ExtractFormat.WIKI
)

# Harvest
article_text_list = []
seen = []
print("Harvesting text from "+str(len(article_df.index))+" wikipedia pages. This might take a while...")
count=1
for index, row in article_df.iterrows():
  title = row[article_name_column]
  if count % 10 == 0:
    print("Text harvested from "+str(count)+" pages out of "+str(len(article_df.index))+". Continuing...")
  if not title in seen: # Do not harvest twice the same...
    seen.append(title)
    try:
      p_wiki = wiki_wiki.page(title)
      page_text = p_wiki.text.lower()
      new_row = {**row, 'Text': page_text}
      article_text_list.append(new_row)
    except:
        print('SKIPPED: '+title+' (an error occurred)')
  count=count+1

text_articles_df = pd.DataFrame(article_text_list)
print("Done.")
print("Preview of the article list with text:")
text_articles_df

### Save as CSV

In [None]:
try:
  text_articles_df.to_csv(output_file, index = False, encoding='utf-8')
  print("Done.")
except IOError:
  print("/!\ Error while writing the output file")