# Week 3: Getting data—scraping and APIs.

This week is about getting data from the big ol' Internet, with the Wikipedia as our guinea pig. The main task today is to retrieve the Wikipedia pages of **all Marvel characters** using the MediaWiki **API**. There are three parts to this exercise.

* Learn the basics of how to retrieve data from Wiki sites using the MediaWiki API
* Download all Marvel character Wikipedia articles
* Explore the data

The data you acquire today, you will be working with for the remainder of the course. Try to get as far as possible, structure the data nicely and write your code so that it makes sense to you in the coming weeks.

Also, there's an **important practice** you should start getting used to—which matters when we grade assignments. 
1. Openly reflect on how you solve a problem. It can be code comments, or markup below/above the code cell, just as long as you share your thoughts. 
2. Comment on your results, discussing:
    * Whether they make sense
    * If they look somewhat as you expected, and if not, what the reasons for this difference may be
    * What—interesting or not—insight they reveal about the given system you analyze
    
    *Note: of course you can't always say something profound about every little thing, so rest assured, I will only expect explanations in your assignments when *it makes sense* that there should be one.*

**[Feedback](http://ulfaslak.com/vent)**

## Exercises

**Why use an API?** You could just go ahead and scrape the HTML from a Wikipedia page as simple as:

    import requests as rq
    rq.get("https://en.wikipedia.org/wiki/Batman").text
    
Well... to navigate data in XML format is not always easy. Therefore, MediaWiki offers its users direct use of its API. To load the MediaWiki markup using the API, one would do something like:

    rq.get("https://en.wikipedia.org/w/api.php?format=json&action=query&titles=Batman&prop=revisions&rvprop=content").json()
    
This returns a JSON object inside which you can find all sorts of information about the page, including the latest revision of the Batman page markup.

**Helpful code to display JSON object as a tree**

In [1]:
def print_json_tree(d, indent=0):
    """Print tree of keys in JSON object.
    
    Prints the different levels of nested keys in a JSON object. When there
    are no more dictionaries to key into, prints objects type and byte-size.

    Input
    -----
    d : dict
    """
    for key, value in d.items():
        print('    ' * indent + str(key), end=' ')
        if isinstance(value, dict):
            print(); print_json_tree(value, indent+1)
        else:
            print(":", str(type(d[key])).split("'")[1], "-", str(len(str(d[key]))))
            
# Example
import requests as rq
data = rq.get("https://en.wikipedia.org/w/api.php?format=json&action=query&titles=Batman&prop=revisions&rvprop=content").json()
print_json_tree(data)

batchcomplete : str - 0
    main 
        * : str - 267
    revisions 
        * : str - 163
query 
    pages 
        4335 
            pageid : int - 4
            ns : int - 1
            title : str - 6
            revisions : list - 144042


### Part 0: Learn to access Wikipedia data with Python

Figure out how Wikipedia markup works. You'll need to know a bit about formatting of MediaWiki pages so that you can parse the markup that you retrieve from wikipedia. See http://www.mediawiki.org/wiki/Help:Formatting. In particular, look into how links work and how tables work and make sure you can answer the following questions.

>**Ex. 3.0.1**: How do you link to another Wikipedia page from within a Wikipedia-page, using the wikimedia markup? Write down a simple example that links to a specific section in another page.


[[Help:Contents:]]


> **Ex. 3.0.2**: What is the MediaWiki markup to create a simple table like the one below?

>| True Positive  | False Positive |
| -------------- |:--------------:|
| False Negative | True Negative  |

{| class="wikitable"
|+Food complements
|-
|Orange
|Apple
|-
|Bread
|Pie
|-
|Butter
|Ice cream 
|}

> **Ex. 3.0.3**: Figure out how to download pages from Wikipedia. Familiarize yourself with [the API](http://www.mediawiki.org/wiki/API:Main_page) (there's a nice little [tutorial](https://www.mediawiki.org/wiki/API:Tutorial)) and learn how to extract the markup. The API query that returns the markup of the Batman page is:
    
>`api.php?format=json&action=query&titles=Batman&prop=revisions&rvprop=content`

>1. Explain the structure of this query. What are the parameters and arguments and what do they mean? What happens if you remove `rvprop=content`? 
2. Download the Batman page data from the API. Extract the markup from the JSON object and save it to a file called "batman.txt".

rvprop indicates what part of the revision you want to retrive. Other options would be user, comment, tags

In [8]:
import requests as rq
import json

query = "https://en.wikipedia.org/w/api.php?format=json&action=query&titles=Batman&prop=revisions&rvprop=content"
response = rq.get(query)
result = response.json()
print_json_tree(result)
print(result)

with open("batman.txt", 'w') as f:
    json.dump(result, f)

batchcomplete : str - 0
    main 
        * : str - 267
    revisions 
        * : str - 163
query 
    pages 
        4335 
            pageid : int - 4
            ns : int - 1
            title : str - 6
            revisions : list - 144042


### Part 1: Get data (main part)

For a good part of this course we will be working with data from Wikipedia. Today, your objective is to crawl a large dataset with good and bad characters from the Marvel characters.

>**Ex. 3.1.1**: From the Wikipedia API, get a list of all Marvel superheroes and another list of all Marvel supervillains. Use 'Category:Marvel_Comics_supervillains' and 'Category:Marvel_Comics_superheroes' to get the characters in each category.
1. How many superheroes are there? How many supervillains?
2. How many characters are both heroes and villains? What is the Jaccard similarity between the two groups?

>*Hint: Google something like "get list all pages in category wikimedia api" if you're struggling with the query.*

In [15]:
S = rq.Session()

def get_them(category):
    URL = "https://en.wikipedia.org/w/api.php"
    PARAMS = {
        "action":"query",
        "list": "categorymembers",
        "cmtitle": category,
        "format":"json",
        "cmlimit": '500',
    }
    R = S.get(url=URL, params=PARAMS)
    DATA = R.json()
    json.dumps(DATA, indent=4)
    joe = 0
    current_villains = len(DATA['query']['categorymembers'])
    total_villains = len(DATA['query']['categorymembers'])
    char_list = []
    id_list = []

    while(current_villains > 0):
        try:
            PARAMS['cmcontinue']= DATA['continue']['cmcontinue']
            R2 = S.get(url=URL, params=PARAMS)
            DATA = R2.json()
            current_villains = len(DATA['query']['categorymembers'])
            total_villains += current_villains
            ls = DATA['query']['categorymembers']
            for unit in ls:
                char_list.append(unit['title'])
                id_list.append(unit['pageid'])
        except:
            break
    return [total_villains, char_list, id_list]
#There are 1168 villains
#There are 912 heroes
hero_data = get_them("Category:Marvel_Comics_superheroes")
villain_data = get_them("Category:Marvel_Comics_supervillains")
hero_id = hero_data[2]
villain_id = villain_data[2]

def jaccard(a, b):
    a = set(a)
    b = set(b)
    c = a.intersection(b)
    return c
    #return float(len(c)) / (len(a) + len(b) - len(c))

amb = (jaccard(hero_data[2], villain_data[2]))

print(amb)

{162177, 1035779, 5279748, 709255, 3055880, 397704, 2215946, 3664011, 6501386, 52784146, 2696728, 5545113, 13776794, 504988, 1056926, 417310, 5972515, 60963, 322085, 13537188, 17324073, 3699882, 310570, 2114226, 4975794, 8095799, 5356664, 4860350, 3441215, 12953024, 5113666, 303427, 1360328, 1345992, 6073034, 4596172, 2148940, 1911500, 2306128, 12044496, 37963346, 728659, 840148, 1775316, 17744982, 701912, 17303257, 313050, 2198491, 5955676, 1310812, 608862, 2670944, 5602145, 734304, 33527139, 49125, 10574694, 3215719, 3188327, 47433579, 4921452, 2378350, 43413743, 3701104, 1783407, 411375, 2504433, 5085039, 10312693, 2451704, 1454201, 36526076, 37874941, 362238}


>**Ex. 3.1.2**: Using this list you now want to download all data you can about each character. However, because this is potentially Big Data, you cannot store it your computer's memory. Therefore, you have to store it in your harddrive somehow. 
* Create three folders on your computer, one for *heroes*, one for *villains*, and one for *ambiguous*.
* For each character, download the markup on their pages and save in a new file in the corresponding hero/villain/ambiguous folder.

>*Hint: Some of the characters have funky names. The first problem you may encounter is problems with encoding. To solve that you can call `.encode('utf-8')` on your markup string. Another problem you may encounter is that characters have a slash in their names. This, you should just replace with some other meaningful character.*

In [28]:
def get_markup(title):  
    URL = "https://en.wikipedia.org/w/api.php"
    PARAMS = {
            "action":"query",
            "format":"json",
            "prop": "revisions",
            "rvprop": "content",
            "pageids": title
    }
    R = S.get(url=URL, params=PARAMS)
    DATA = R.json()
    name = "/Users/nathanshirley/large-data/my-exercises/villains/" + str(title) + ".txt"
    with open(name, 'w') as f:
        json.dump(result, f)

for guy in villain_id:
    get_markup(guy)

### Part 2: Explore data

#### Page lengths

>**Ex. 3.2.1**: Extract the length of the page of each character, and plot the distribution of this variable for each class (heroes/villains/ambiguous). Can you say anything about the popularity of characters in the Marvel universe based on your visualization?

>*Hint: The simplest thing is to make a probability mass function, i.e. a normalized histogram. Use `plt.hist` on a list of page lengths, with the argument `density=True`. Other distribution plots are fine too, though.*

In [30]:
def get_lengths(category):
    S = rq.Session()
    URL = "https://en.wikipedia.org/w/api.php"
    PARAMS = {
        "action":"query",
        "list": "categorymembers",
        "cmtitle": category,
        "format":"json",
        "cmlimit": '500',
    }
    R = S.get(url=URL, params=PARAMS)
    DATA = R.json()
    print(DATA)
    json.dumps(DATA, indent=4)
    joe = 0
    current_villains = len(DATA['query']['categorymembers'])
    total_villains = len(DATA['query']['categorymembers'])
    char_list = []
    id_list = []

    while(current_villains > 0):
        try:
            PARAMS['cmcontinue']= DATA['continue']['cmcontinue']
            R2 = S.get(url=URL, params=PARAMS)
            DATA = R2.json()
            current_villains = len(DATA['query']['categorymembers'])
            total_villains += current_villains
            ls = DATA['query']['categorymembers']
            
            for unit in ls:
                char_list.append(unit['title'])
                id_list.append(unit['pageid'])
        except:
            break
    return [total_villains, char_list, id_list]

get_lengths("Category:Marvel_Comics_superheroes")

{'batchcomplete': '', 'continue': {'cmcontinue': 'page|43292f392904532943042f594331011201dcbddc07|55511277', 'continue': '-||'}, 'query': {'categorymembers': [{'pageid': 2388262, 'ns': 0, 'title': 'Abigail Brand'}, {'pageid': 707597, 'ns': 0, 'title': 'Abyss (comics)'}, {'pageid': 2994013, 'ns': 0, 'title': 'Adam X the X-Treme'}, {'pageid': 26201432, 'ns': 0, 'title': 'Adept (comics)'}, {'pageid': 5026758, 'ns': 0, 'title': 'Agent (comics)'}, {'pageid': 407707, 'ns': 0, 'title': 'Agent X (Marvel Comics)'}, {'pageid': 5181968, 'ns': 0, 'title': 'El Aguila'}, {'pageid': 8926815, 'ns': 0, 'title': 'Ahura (comics)'}, {'pageid': 52186543, 'ns': 0, 'title': 'Aikku Jokinen'}, {'pageid': 1619971, 'ns': 0, 'title': 'Air-Walker'}, {'pageid': 8743328, 'ns': 0, 'title': 'Francis Fanny'}, {'pageid': 5603862, 'ns': 0, 'title': 'Alaris (comics)'}, {'pageid': 7062448, 'ns': 0, 'title': 'Alice Nugent'}, {'pageid': 1103984, 'ns': 0, 'title': 'Liz Allan'}, {'pageid': 37157943, 'ns': 0, 'title': 'Alpha (M

[912,
 ['Nadia van Dyne',
  'Nahrees',
  'Namor',
  'Namora',
  'Namorita',
  'Nemesis (Alpha Flight)',
  'Lilandra Neramani',
  'Night Thrasher (Dwayne Taylor)',
  'Nightcrawler (comics)',
  'Nighthawk (Marvel Comics)',
  'Nightmask',
  'Nightwatch (comics)',
  'Nikki (comics)',
  'Nocturne (Talia Wagner)',
  'Noh-Varr',
  'Nomad (comics)',
  'Dakota North (comics)',
  'David North (comics)',
  'Northstar (Marvel Comics)',
  'Nova (Frankie Raye)',
  'Nova (Richard Rider)',
  'Nova (Sam Alexander)',
  'Nuklo',
  'Aleta Ogord',
  "Eric O'Grady",
  'Ogre (Marvel Comics)',
  'Okoye (comics)',
  'Old Lace (comics)',
  'Omega the Unknown',
  'Onyxx',
  'Orrgo',
  'Harry Osborn',
  'Outlaw (comics)',
  'Outlaw Kid',
  'Oya (comics)',
  'Paladin (comics)',
  'Paragon (comics)',
  'Patriot (comics)',
  'Paydirt (Marvel Comics)',
  'Penance (X-Men)',
  'Peni Parker',
  'Peregrine (comics)',
  'Perun (comics)',
  'Phantom Rider',
  'Phaser (comics)',
  'Phobos (Marvel Comics)',
  'Phoenix Force 

>**Ex. 3.2.2**: Find the 10 characters from each class with the longest Wikipedia pages. Visualize their page lengths with bar charts. Comment on the result.

#### Timeline

>**Ex. 3.2.3**: We are interested in knowing if there is a time-trend in the debut of characters.
* Extract into three lists, debut years of heroes, villains, and ambiguous characters.
* Do all pages have a debut year? Do some have multiple? How do you handle these inconsistencies?
* For each class, visualize the amount of characters introduced over time. You choose how you want to visualize this data, but please comment on your choice. Also comment on the outcome of your analysis.

>*Hint: The debut year is given on the debut row in the info table of a character's Wiki-page. There are many ways that you can extract this variable. You should try to have a go at it yourself, but if you are short on time, you can use this horribly ugly regular expression code:*

>*`re.findall(r"\d{4}\)", re.findall(r"debut.+?\n", markup_text)[0])[0][:-1]`*

#### Alliances

>**Ex. 3.2.4**: In this exercise you want to find out what the biggest alliances in the Marvel universe are. The data you need for doing this is in the *alliances*-field of the markup of each character. Below I suggest steps you can take to solve the problem if you get stuck.
* Write a regex that extracts the *alliances*-field of a character's markup.
* Write a regex that extracts each team from the *alliance*-field.
* Count the number of members for each team (hint: use a `defaultdict`).
* Inspect your team names. Are there any that result from inconsistencies in the information on the pages? How do you deal with this?
* **Print the 10 largest alliances and their number of members.**