# DIGI405 Week 3 - Getting text data: Example corpus-building using an API

For your corpus building project you can collect data using an API. This notebook is intended to collect some data using [Wikipedia's API](https://www.mediawiki.org/wiki/API:Main_page), to introduce the way Jupyter mixes code and text, and to show you that with some simple edits we can collect some new data.

Wikipedia articles are often used for language modelling. However, it should be obvious that Wikipedia represents a specific form of discourse that attempts to be factual. Wikipedia articles do not provide a good source for corpus-assisted discourse analysis where we are often interested in different points of view. What is interesting about Wikipedia is that there are "Talk" pages where Wikipedia editors can discuss specific pages. 

**Take a quick look at the [Talk:COVID-19](https://en.wikipedia.org/wiki/Talk:COVID-19) pages. Note: there are also archived conversations.**

These are public pages for Wikipedia editors/users to discuss changes to Wikipedia articles.  

These pages can be studied to understand how Wikipedia articles are produced and specific points of uncertainty, debate, disagreement or controversy about knowledge production.

**Read through the text and comments below. Run each of the cells in turn.**

## Import Python Libraries

In [1]:
# import required code libraries
# requests literally allows us to do web requests
import requests 

# BeautifulSoup is used here to convert HTML to text (more on its use in scraping next lab)
from bs4 import BeautifulSoup

# time library allows us to pause between requests below
import time

# library to do things with the operating system - in case with the file system
import os

# library used to zip your corpus
import zipfile

## Settings (you can change these!)

This cell allows you to change the wiki page to collect, and other settings related to data collection with the wiki.

Perhaps most importantly you can change the Wikipedia Talk pages to collect and where you are saving your corpus. 

Leave settings as they are the first time you run the notebook.

In [2]:
# copy the page slug for the wiki page you are interested in 
# e.g. for wiki talk pages about https://en.wikipedia.org/wiki/COVID-19, page var should be 'COVID-19'
page = 'COVID-19'

# directory to save corpus - if doesn't exist will be created by next cell
corpus_path = 'talk-corpus/'

# set api path - you could change this to another language wiki
api_url = 'https://en.wikipedia.org/w/api.php'

# set headers for our requests so that wikipedia knows where we are coming from as per their policy here:
# https://meta.wikimedia.org/wiki/User-Agent_policy
# my email address is in there as contact so don't abuse their API!
headers = {'user-agent': 'DIGI405 Class Exercise/0.1-dev (https://www.canterbury.ac.nz/courseinfo/GetCourseDetails.aspx?course=DIGI405&occurrence=21S2(C)&year=2021)'}

# seconds to pause between requests - set this to conservative number
# don't set this to zero unless you want the class banned from wikipedia
sleep_seconds = 2

## Creates the directory to save your corpus

In [3]:
# if corpus path doesn't exist create it
if not os.path.exists(corpus_path):
    os.makedirs(corpus_path)

## Define some functions

In [4]:
# convert wikipedia HTML output to text - note more cleanup could be done here
# ideally wikipedia would output text from their API, they only output HTML or wikitext
# both need to be cleaned up to get text.
def wiki_html_to_txt(html):
    # this uses BeautifulSoup - we will work with this next week
    soup = BeautifulSoup(html)
    
    # cleanup the markup a little (removing tags not related to text)
    # more cleanup is possible with the Wiki markup 
    for s in soup.select('.mw-references-wrap'):
        s.extract()
    for s in soup.select('sup.reference'):
        s.extract()    
    for s in soup.select('style'):
        s.extract()    
    for s in soup.select('.mw-editsection'):
        s.extract()    
    for s in soup.select('#toc'):
        s.extract()    
    for s in soup.select('.tmbox'):
        s.extract()    

    return soup.get_text().strip()
    
# rough and ready function to output readable filesnames with filesystem safe characters
def url_to_filename(url):
    url = url.replace('https://', '').replace('http://', '')
    safe = []
    for x in url:
        if x.isalnum():
            safe.append(x)
        else:
            safe.append('-')
    filename = "".join(safe)

    if len(filename) > 100: #prevent filenames over 200 - note this could create a conflict of filenames so check
        filename = filename[:50] + '___' + filename[-50:]
    
    return filename

## Collect a corpus of Wikipedia Talk pages

Running the next cell will do some API calls, retrieve Wikipedia's HTML markup, convert this to text and save it.

In [5]:
# initiate a queue of Talk pages related to the page
talk_pages = ['Talk:' + page]

# build URL for request to Wikipedia API to retrieve archived talk pages
# Documentation of API call: https://en.wikipedia.org/w/api.php?action=help&modules=query%2Ballpages
# Note: max 100 results specified via aplimit, apnamespace=1 specifies that we want Talk pages only
url = api_url + '?action=query&list=allpages&apnamespace=1&apprefix='+ page +'/Archive&aplimit=100'
print('Requesting', url) 
print('Click link above for human-readable view of json')

# request url
response = requests.get(url + '&format=json', headers=headers)

# decode the json received
data = response.json()

# loop through the list of pages retrieved and add them to the talk_pages queue
for archive_page in data['query']['allpages']:
    print('Add page:', archive_page['title'])
    talk_pages.append(archive_page['title'].replace(' ','_'))
    
# loop through talk_pages queue
for talk_page in talk_pages:
    # create url for API call to get markup of talk page
    url = api_url + '?action=parse&prop=text&page=' + talk_page
    print('Requesting', url) 
    
    # request url
    response = requests.get(url + '&format=json', headers=headers)

    # decode the json received
    data = response.json()
    
    # convert url to filename
    filename = url_to_filename(talk_page) + '.txt'
            
    #convert html from the api to txt
    txt = wiki_html_to_txt(data['parse']['text']['*'])
            
    #save txt to file ...
    with open(corpus_path + filename, 'w', encoding='utf8') as f:
        f.write(txt)
            
    # rest before another request
    time.sleep(sleep_seconds)
    
print('All done.')

Requesting https://en.wikipedia.org/w/api.php?action=query&list=allpages&apnamespace=1&apprefix=COVID-19/Archive&aplimit=100
Click link above for human-readable view of json
Add page: Talk:COVID-19/Archive 1
Add page: Talk:COVID-19/Archive 10
Add page: Talk:COVID-19/Archive 11
Add page: Talk:COVID-19/Archive 12
Add page: Talk:COVID-19/Archive 13
Add page: Talk:COVID-19/Archive 14
Add page: Talk:COVID-19/Archive 15
Add page: Talk:COVID-19/Archive 16
Add page: Talk:COVID-19/Archive 17
Add page: Talk:COVID-19/Archive 18
Add page: Talk:COVID-19/Archive 19
Add page: Talk:COVID-19/Archive 2
Add page: Talk:COVID-19/Archive 3
Add page: Talk:COVID-19/Archive 4
Add page: Talk:COVID-19/Archive 5
Add page: Talk:COVID-19/Archive 6
Add page: Talk:COVID-19/Archive 7
Add page: Talk:COVID-19/Archive 8
Add page: Talk:COVID-19/Archive 9
Requesting https://en.wikipedia.org/w/api.php?action=parse&prop=text&page=Talk:COVID-19
Requesting https://en.wikipedia.org/w/api.php?action=parse&prop=text&page=Talk:COV

## Zip your corpus

The next cell zips your corpus so you can download it and use it in software like AntConc. Tick the checkbox next to the .zip file and click the Download button to save a copy to your computer.

In [6]:
zip_filename = os.path.dirname(corpus_path) + '.zip'

zipf = zipfile.ZipFile(zip_filename, 'w', zipfile.ZIP_DEFLATED)

# ziph is zipfile handle
for root, dirs, files in os.walk(corpus_path):
    for file in files:
        zipf.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), os.path.join(corpus_path, '..')))
      
zipf.close()

## What to do next?

### Try a little analysis

Load your corpus in AntConc. Run a Keyword comparison against the BNC to derive lists of over-represented words. Can you group these? Some will be related to features of a Wikipedia Talk page and the editors who are leaving comments. Other words will be related to the topics of discussion within the Talk pages.

### A better corpus?

We could probably improve this corpus. There is some cleaning being done when the Wikipedia HTML is converted to text. More cleaning is possible. Choices about what to leave in and remove (or "clean") could be important to our analysis.

Another way we could improve our corpus would be to segment the Talk pages. Currently we are collecting each Talk page and related archived Talk pages and saving each page as a separate file. However, Talk pages represent different topics of discussion and specific comments by users. This is beyond the scope of this lab, but splitting up the Talk pages by topic sections or by user comments could be useful for specific kinds of research. Why might this be useful?

### Other APIs

What other APIs are there for collecting texts? Search the web and see what you can find. You will find code others have written to build data-sets of texts on Github, Kaggle and elsewhere.

### Collect another corpus

After you have collected a Talk:COVID-19 corpus scope Wikipedia for another Talk page to collect. Look for an article that has a few archived pages of talk (10-20 pages is ideal).

When you have decided on a page go back to the Settings cell and change the page and corpus_path and collect the other corpus. 

#### Ideas:
1. You could collect the Talk pages for articles that address pandemic response in specific countries:  
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_Kingdom  
https://en.wikipedia.org/wiki/COVID-19_pandemic_in_the_United_States  
You could compare these with the first corpus we created.

2. You can browse Wikipedia for controversial pages here: 
https://en.wikipedia.org/wiki/Wikipedia:List_of_controversial_issues

3. What interests you? What do you think might be controversial?