> **DO NOT EDIT IF INSIDE `computational_analysis_of_big_data_2018_spring` folder** 

# Week 3: Getting data—scraping and APIs.

*Thursday, February 1, 2018*

This week is about getting data from the big ol' Internet, with the Wikipedia as our guinea pig. The main task today is to retrieve the Wikipedia pages of **all Marvel characters** using the MediaWiki **API**. There are three parts to this exercise.

* Learn the basics of how to retrieve data from Wiki sites using the MediaWiki API
* Download all Marvel character Wikipedia articles
* Explore the data

The data you acquire today, you will be working with for the remainder of the course. Try to get as far as possible, structure the data nicely and write your code so that it makes sense to you in the coming weeks.

Also, there's an **important practice** you should start getting used to—which matters when we grade assignments. 
1. Openly reflect on how you solve a problem. It can be code comments, or markup below/above the code cell, just as long as you share your thoughts. 
2. Comment on your results, discussing:
    * Whether they make sense
    * If they look somewhat as you expected, and if not, what the reasons for this difference may be
    * What—interesting or not—insight they reveal about the given system you analyze
    
    *Note: of course you can't always say something profound about every little thing, so rest assured, I will only expect explanations in your assignments when *it makes sense* that there should be one.*

**Feedback:** Send me anonymous feedback about the exercises, lectures and course in general at http://ulfaslak.com/vent.

## Exercises

**Why use an API?** You could just go ahead and scrape the HTML from a Wikipedia page as simple as:

    import requests as rq
    rq.get("https://en.wikipedia.org/wiki/Batman").text
    
Well... to navigate data in XML format is not always easy. Therefore, MediaWiki offers its users direct use of its API. To load the MediaWiki markup using the API, one would do something like:

    rq.get("https://en.wikipedia.org/w/api.php?format=json&action=query&titles=Batman&prop=revisions&rvprop=content").json()
    
This returns a JSON object inside which you can find all sorts of information about the page, including the latest revision of the Batman page markup.

**Helpful code to display JSON object as a tree**

In [1]:
def print_json_tree(d, indent=0):
    """Print tree of keys in JSON object.
    
    Prints the different levels of nested keys in a JSON object. When there
    are no more dictionaries to key into, prints objects type and byte-size.

    Input
    -----
    d : dict
    """
    for key, value in d.iteritems():
        print '    ' * indent + unicode(key),
        if isinstance(value, dict):
            print; print_json_tree(value, indent+1)
        else:
            print ":", str(type(d[key])).split("'")[1], "-", str(len(unicode(d[key])))
            
# Example
import requests as rq
data = rq.get("https://en.wikipedia.org/w/api.php?format=json&action=query&titles=Batman&prop=revisions&rvprop=content").json()
print_json_tree(data)

batchcomplete : unicode - 0
query
    pages
        4335
            ns : int - 1
            pageid : int - 4
            revisions : list - 140431
            title : unicode - 6


### Part 0: Learn to access Wikipedia data with Python

Figure out how Wikipedia markup works .You'll need to know a bit about formatting of MediaWiki pages so that you can parse the markup that you retrieve from wikipedia. See http://www.mediawiki.org/wiki/Help:Formatting. In particular, look into how links work and how tables work and make sure you can answer the following questions.

>**Ex. 3.0.1**: How do you link to another Wikipedia page from within a Wikipedia-page, using the wikimedia markup? Write down a simple example that links to a specific section in another page.

`[[Fox Broadcasting Company|Fox]]` In the finished article, the text will read 'Fox', and link to the article called 'Fox Broadcasting Company'.

> **Ex. 3.0.2**: What is the MediaWiki markup to create a simple table like the one below?

>| True Positive  | False Positive |
| -------------- |:--------------:|
| False Negative | True Negative  |

> **Ex. 3.0.3**: Figure out how to download pages from Wikipedia. Familiarize yourself with [the API](http://www.mediawiki.org/wiki/API:Main_page) and learn how to extract the markup. The API query that returns the markup of the Batman page is:
    
>`api.php?format=json&action=query&titles=Batman&prop=revisions&rvprop=content`

>1. Explain the structure of this query. What are the parameters and arguments and what do they mean? What happens if you remove `rvprop=content`?
2. Download the Batman page data from the API. Extract the markup from the JSON object and save it to a file called "batman.txt".

format: What type do you want the data in? <br>
action: What to do in your request (get data, remove data, write?)<br>
titles: What pages to get (delimited by "|")<br>
prop: What to get from the page (could have also been images, links, etc).<br>
rvprop: What content to get from the revision (if prop=revision)<br>

### Part 1: Get data (main part)

For a good part of this course we will be working with data from Wikipedia. Today, your objective is to crawl a large dataset with good and bad characters from the Marvel characters.

>**Ex. 3.1.1**: From the Wikipedia API, get a list of all Marvel superheroes and another list of all Marvel supervillains. Use 'Category:Marvel_Comics_supervillains' and 'Category:Marvel_Comics_superheroes' to get the characters in each category.
1. How many superheroes are there? How many supervillains?
2. How many characters are both heroes and villains? What is the Jaccard similarity between the two groups?

>*Hint: Google something like "get list all pages in category wikimedia api" if you're struggling with the query.*

>**Ex. 3.1.2**: Using this list you now want to download all data you can about each character. However, because this is potentially Big Data, you cannot store it your computer's memory. Therefore, you have to store it in your harddrive somehow. 
* Create three folders on your computer, one for *heroes*, one for *villains*, and one for *ambiguous*.
* For each character, download the markup on their pages and save in a new file in the corresponding hero/villain/ambiguous folder.

>*Hint: Some of the characters have funky names. The first problem you may encounter is problems with encoding. To solve that you can call `.encode('utf-8')` on your markup string. Another problem you may encounter is that characters have a slash in their names. This, you should just replace with some other meaningful character.*

In [None]:
for c in set(superheroes) | set(supervillains):
    
    # Find the right folder for the character
    if c in superheroes and c in supervillains:
        folder = "ambiguous"
    elif c in superheroes:
        folder = "heroes"
    elif c in supervillains:
        folder = "villains"
    
    # Only download new pages
    if c[0] + ".txt" in os.listdir('../../data/%s' % folder):
        continue

    # Replace slash with dash
    if "/" in c[0]:
        c = (c[0].replace("/", "-"), c[1])
    
    # Get the data
    data = rq.get(
        "https://en.wikipedia.org/w/api.php?&prop=revisions&rvprop=content&action=query&pageids=%d&format=json" % c[1]
    ).json()
    
    # Get the markup
    markup = data['query']['pages'].values()[0]['revisions'][0]['*']
    
    # Save it
    with open("../../data/%s/%s.txt" % (folder, c[0]), 'w') as fp:
        fp.write(markup.encode('utf-8'))

### Part 2: Explore data

#### Page lengths

>**Ex. 3.2.1**: Extract the length of the page of each character, and plot the distribution of this variable for each class (heroes/villains/ambiguous). Can you say anything about the popularity of characters in the Marvel universe based on your visualization?

>*Hint: The simplest thing is to make a probability mass function, i.e. a normalized histogram. Use `plt.hist` on a list of page lengths, with the argument `normed=True`. Other distribution plots are fine too, though.*

>**Ex. 3.2.2**: Find the 10 characters from each class with the longest Wikipedia pages. Visualize their page lengths with bar charts. Comment on the result.

#### Timeline

>**Ex. 3.2.3**: We are interested in knowing if there is a time-trend in the debut of characters.
* Extract into three lists, debut years of heroes, villains, and ambiguous characters.
* Do all pages have a debut year? Do some have multiple? How do you handle these inconsistencies?
* For each class, visualize the amount of characters introduced over time. You choose how you want to visualize this data, but please comment on your choice. Also comment on the outcome of your analysis.

>*Hint: The debut year is given on the debut row in the info table of a character's Wiki-page. There are many ways that you can extract this variable. You should try to have a go at it yourself, but if you are short on time, you can use this horribly ugly regular expression code:*

>*`re.findall(r"\d{4}\)", re.findall(r"debut.+?\n", markup_text)[0])[0][:-1]`*

#### Alliances

>**Ex. 3.2.4**: In this exercise you want to find out what the biggest alliances in the Marvel universe are. The data you need for doing this is in the *alliances*-field of the markup of each character. Below I suggest steps you can take to solve the problem if you get stuck.
* Write a regex that extracts the *alliances*-field of a character's markup.
* Write a regex that extracts each team from the *alliance*-field.
* Count the number of members for each team (hint: use a `defaultdict`).
* Inspect your team names. Are there any that result from inconsistencies in the information on the pages? How do you deal with this?
* **Print the 10 largest alliances and their number of members.**

In [19]:
from collections import defaultdict
import os, re

team_characters = defaultdict(list)

def populate_team_characters(cla):
    for c in os.listdir("../../data/" + cla):
        
        # Load character markup
        with open("../../data/%s/%s" % (cla, c)) as fp:
            markup = fp.read()
    
        # Get alliance field
        alliances_field = re.findall(r"alliances[\w\W]+?\n", markup)
        if alliances_field == []:
            continue
        
        # Extract teams from alliance field
        teams = re.findall(r"\[\[.+?[\]\|]", alliances_field[0][10:])
        
        # Append the character name to the team key in team_characters
        for t in teams:
            team_characters[t[2:-1]].append(c[:-4])

populate_team_characters('heroes')
populate_team_characters('villains')
populate_team_characters('ambiguous')

In [31]:
sorted([(team, len(members)) for team, members in team_characters.items()], key=lambda element: element[1], reverse=True)[:10]

[('X-Men', 99),
 ('Avengers (comics)', 89),
 ('Thunderbolts (comics)', 76),
 ('Masters of Evil', 75),
 ('Defenders (comics)', 62),
 ('S.H.I.E.L.D.', 60),
 ('Brotherhood of Mutants', 49),
 ('New Warriors', 44),
 ('Sinister Six', 44),
 ('Hellfire Club (comics)', 42)]