# Wiki Networking - Text Mining Tools

### Introduction

This notebook contains text mining and web crawling functions for working with Wikipedia articles. The first cell contains imports and settings for crawling Wikipedia.

In [46]:
from pyquery import PyQuery
import time


print "OK"

OK


### `filter_links`

This function accepts a PyQuery object and returns a list of Wikipedia article links from the article's main body text. It will not return links that are redirects or links that are anchors within other pages. Optionally, you can specify a DOM selector for the type of element containing links you wish to retrieve.

In [2]:
def filter_links(page, selector=""):
    result = []
    subchildren = PyQuery(page("#mw-content-text " + selector))
    for child in subchildren:
        links = PyQuery(child)("a")
        for link in links:
            linkQuery = PyQuery(link)
            if not linkQuery.hasClass("mw-redirect"):
                href = linkQuery.attr("href")
                if "/wiki/" in href and "#" not in href:
                    result.append(linkQuery.attr("href"))
    return result


In this example, we create a PyQuery object for the [Iron Man](https://en.wikipedia.org/wiki/Iron_Man) article on Wikipedia, retrieve its links with `filter_links` and output the number of links we retrieved.

In [16]:
iron_man_page = PyQuery(url="https://en.wikipedia.org/wiki/Iron_man")
iron_man_links = filter_links(iron_man_page)
print len(links), "links retrieved"

1208 links retrieved


In this example, we retrieve links from a [list of Hip Hop musicians](https://en.wikipedia.org/wiki/List_of_hip_hop_musicians). Notice that each link is contained inside of a `li` list item HTML tag.

In [26]:
hip_hop_page = PyQuery(url="https://en.wikipedia.org/wiki/List_of_hip_hop_musicians")
hip_hop_links = filter_links(hip_hop_page, "li")
print len(hip_hop_links), "links retrieved"

1115 links retrieved


In this example, we do the same thing for [Harry Potter Characters](https://en.wikipedia.org/wiki/List_of_Harry_Potter_characters). Notice that many links are omitted - these links are redirects or links to sections within larger articles.

In [19]:
harry_potter_page = PyQuery(url="https://en.wikipedia.org/wiki/List_of_Harry_Potter_characters")
harry_potter_links = filter_links(harry_potter_page, "li")
print len(harry_potter_links), "links retrieved"

144 links retrieved


In this example, we are retrieving a list of [Marvel Comics Characters beginning with the letter D](https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_D). In this case, we specify the DOM selector `.hatnote`. This is because we only want links for Marvel characters that have a dedicated _*Main article*_. The _*Main article*_ elements all have the `.hatnote` CSS class, so it is an easy way of selecting these articles. In this example, we simply print the contents of the list.

In [23]:
marvel_d_page = PyQuery(url="https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_D")
marvel_d_links = filter_links(marvel_d_page, ".hatnote")
print marvel_d_links

['/wiki/Daken', '/wiki/Dakimh_the_Enchanter', '/wiki/Randall_Darby', '/wiki/Daredevil_(Marvel_Comics_character)', '/wiki/Dark_Beast', '/wiki/Dark_Mother', '/wiki/Darkdevil', '/wiki/Darkhawk', '/wiki/Darkoth', '/wiki/Darkstar_(comics)', '/wiki/Darwin_(comics)', '/wiki/Amanda_Sefton', '/wiki/Dazzler', '/wiki/Deacon_(comics)', '/wiki/Deadpool', '/wiki/Death_(Marvel_Comics)', '/wiki/Death_Adder_(comics)', '/wiki/Death_Metal_(comics)', '/wiki/Death-Stalker', '/wiki/Death%27s_Head', '/wiki/Deathbird', '/wiki/Deathlok', '/wiki/Death_Locket', '/wiki/Deathurge', '/wiki/Deathwatch_(comics)', '/wiki/Debrii', '/wiki/Valentina_Allegra_de_Fontaine', '/wiki/Marco_Delgado_(comics)', '/wiki/Marvel_Comics', '/wiki/DC_Comics', '/wiki/Demogoblin', '/wiki/Demolition_Man_(comics)', '/wiki/Demon_Bear', '/wiki/Desak', '/wiki/Destiny_(Irene_Adler)', '/wiki/Destiny_(Marvel_Comics_personification)', '/wiki/Destroyer_(Thor)', '/wiki/Devastator_(comics)', '/wiki/Devil_Dinosaur', '/wiki/Devos_the_Devastator', '/wik

### `retrieve_multipage_category`

This function retrieves data from a multi-page list. A multi-page list on Wikipedia, for particularly long category lists broken into separate pages for each section, tends to share the same URL pattern with the section name appended to it.

In [32]:
def retrieve_multipage_category(pattern, sections, selector="", verbose=False):
    import time  
    result = []
    urls = list()
    for section in sections:
        time.sleep(1)
        if verbose:
            print "Retrieving " + pattern + section
        result.extend(filter_links(PyQuery(url=(pattern + section)), selector))
    return result

In this example, we are retrieving lists from all subpages from the [List of Marvel Comics Characters](https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters). The URL pattern for these articles is `https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_` followed by a section name.

In [34]:
sections = [letter for letter in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ']
sections.append('0-9')
print "Sections:", sections

character_links = retrieve_multipage_category("https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_", sections, ".hatnote", True)
print len(character_links), "links retrieved"

Sections: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '0-9']
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_A
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_B
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_C
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_D
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_E
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_F
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_G
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_H
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_I
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_J
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_K
Retri

### `write_list` and `read_list`

These are convenience functions for writing and reading list data.

In [35]:
def write_list(data, filename):
    with open(filename, "w") as file:
        for item in data:
            file.write(item)
            file.write("\n")
            
def read_list(filename):
    with open(filename, "r") as file:
        return [item.strip() for item in file.readlines()]

In this example, we write all of the Marvel character links we retrieved to a text file, and verify that we can read the data from the file.

In [36]:
write_list(character_links, "all_marvel_chars.txt")
print len(read_list("all_marvel_chars.txt")), "read"

2035 read


### `intersection`

Returns a list of elements that appear in both provided lists

In [40]:
def intersection(list1, list2):
    return [element for element in set(list1).intersection(list2)]

The `intersection` function is useful for cross referencing links from one Wikipedia page with another. For example, the [List of Hip Hop Musicians](https://en.wikipedia.org/wiki/List_of_hip_hop_musicians) contains many musicians, but we may not have the time or resources to crawl every single artist. The [BET Hip Hop Awards page](https://en.wikipedia.org/wiki/BET_Hip_Hop_Awards) has links to artists that have won an award and may be significant for our search, but it also contains links to songs and other media that may not be articles about individual Hip Hop artists. By taking the `intersection` of both, we get a list containing links of only Hip Hop arists that have won a BET Hip Hop Award.

In [54]:
hip_hop_page = PyQuery(url="https://en.wikipedia.org/wiki/List_of_hip_hop_musicians")
hip_hop_links = filter_links(hip_hop_page, "li")
print len(hip_hop_links), "links retrieved from List of Hip Hop Musicians"

bet_hip_hop_awards_page = PyQuery(url="https://en.wikipedia.org/wiki/BET_Hip_Hop_Awards")
bet_hip_hop_awards_links = filter_links(bet_hip_hop_awards_page, "li")
print len(bet_hip_hop_awards_links), "links retrieved from BET Hip Hop Awards"

award_winner_links = intersection(hip_hop_links, bet_hip_hop_awards_links)

print "BET Hip Hop Award winners:", award_winner_links

1115 links retrieved from List of Hip Hop Musicians
705 links retrieved from BET Hip Hop Awards
BET Hip Hop Award winners: ['/wiki/Jadakiss', '/wiki/Problem_(rapper)', '/wiki/Rapsody', '/wiki/B.o.B', '/wiki/Juicy_J', '/wiki/Schoolboy_Q', '/wiki/Gucci_Mane', '/wiki/Tech_N9ne', '/wiki/Jim_Jones_(rapper)', '/wiki/Black_Thought', '/wiki/Joe_Budden', '/wiki/Lil_Yachty', '/wiki/Unk', '/wiki/Keith_Murray_(rapper)', '/wiki/Angel_Haze', '/wiki/50_Cent', '/wiki/B.G._(rapper)', '/wiki/Stat_Quo', '/wiki/Jay_Electronica', '/wiki/Mystikal', '/wiki/Birdman_(rapper)', '/wiki/Snow_Tha_Product', '/wiki/Desiigner', '/wiki/Rich_Homie_Quan', '/wiki/Metro_Boomin', '/wiki/Shawty_Lo', '/wiki/Buckshot_(rapper)', '/wiki/Swizz_Beatz', '/wiki/Mac_Miller', '/wiki/Lil_Uzi_Vert', '/wiki/Big_Sean', '/wiki/Rahzel', '/wiki/Vic_Mensa', '/wiki/Kardinal_Offishall', '/wiki/21_Savage', '/wiki/Daveed_Diggs', '/wiki/Andy_Mineo', '/wiki/Mike_Zombie', '/wiki/Cory_Gunz', '/wiki/The_Game_(rapper)', '/wiki/Tiffany_Foxx', '/wiki/Ch

### `crawl`

This crawls articles iteratively, with a two second delay between each page retrieval. It requires a starting URL fragment and an `accept` list of URLs the crawler should follow. It returns a dictionary structure with each URL fragment as a key and crawl data (`links`, `title` and `depth`) as a value.

In [59]:
def crawl(start, max_articles=200, max_depth=3, accept=list(), reject=list(), host="https://en.wikipedia.org"):
    from collections import deque
    result = dict()
    crawl_queue = deque()
    crawl_queue.append(start)
    count = 0
    while count < max_articles and len(crawl_queue) > 0:
        count = count + 1
        current = crawl_queue.popleft()
        print "{}: Retrieving {}, ({} left in queue)".format(count, current, len(crawl_queue))
        page = PyQuery(url=host+current)
        
        # Make sure page exists in structure
        if current not in result:
            result[current] = dict()
            result[current]["depth"] = 0
        
        # Save page data
        result[current]["title"] = page("#firstHeading").text()
        result[current]["links"] = [link for link in filter_links(page) if link in accept and link not in reject]
        
        # Check current depth, don't want to go to deep!
        if result[current]["depth"] <= max_depth: 
            for link in result[current]["links"]:
                if link not in result and link not in crawl_queue:
                    crawl_queue.append(link)
                    result[link] = dict()
                    result[link]["depth"] = result[current]["depth"] + 1
                    
        # Important!!!
        time.sleep(2)

    return result


In this example, we start a crawl at the article for [Jadakiss](https://en.wikipedia.org/wiki/Jadakiss). As the `accept` list of URLs that the crawler is allowed to follow, we will use the `award_winner_links` list from earlier. This isolates the crawl to only [Artist winners of BET Hip Hop Awards](https://en.wikipedia.org/wiki/BET_Hip_Hop_Awards). Therefore, our social network will be a network built of BET Hip Hop Award Winners and stored in the `bet_winner_crawl` dictionary.

In [60]:
bet_winner_crawl = crawl("/wiki/Jadakiss", accept=award_winner_links)

1: Retrieving /wiki/Jadakiss, (0 left in queue)
2: Retrieving /wiki/DMX_(rapper), (28 left in queue)
3: Retrieving /wiki/Pharrell_Williams, (30 left in queue)
4: Retrieving /wiki/Swizz_Beatz, (38 left in queue)
5: Retrieving /wiki/Yo_Gotti, (40 left in queue)
6: Retrieving /wiki/Rick_Ross, (42 left in queue)
7: Retrieving /wiki/Fat_Joe, (49 left in queue)
8: Retrieving /wiki/DJ_Khaled, (55 left in queue)
9: Retrieving /wiki/Nicki_Minaj, (65 left in queue)
10: Retrieving /wiki/Styles_P, (65 left in queue)
11: Retrieving /wiki/Sean_Combs, (64 left in queue)
12: Retrieving /wiki/Snoop_Dogg, (63 left in queue)
13: Retrieving /wiki/Eminem, (69 left in queue)
14: Retrieving /wiki/Common_(rapper), (80 left in queue)
15: Retrieving /wiki/Nas, (83 left in queue)
16: Retrieving /wiki/Beanie_Sigel, (84 left in queue)
17: Retrieving /wiki/Freeway_(rapper), (84 left in queue)
18: Retrieving /wiki/Fabolous, (83 left in queue)
19: Retrieving /wiki/Ludacris, (83 left in queue)
20: Retrieving /wiki/Bus

In [61]:
print bet_winner_crawl

{'/wiki/Jadakiss': {'depth': 0, 'links': ['/wiki/DMX_(rapper)', '/wiki/Pharrell_Williams', '/wiki/Swizz_Beatz', '/wiki/Yo_Gotti', '/wiki/Rick_Ross', '/wiki/Fat_Joe', '/wiki/DJ_Khaled', '/wiki/Nicki_Minaj', '/wiki/DMX_(rapper)', '/wiki/Styles_P', '/wiki/Sean_Combs', '/wiki/DMX_(rapper)', '/wiki/Snoop_Dogg', '/wiki/Swizz_Beatz', '/wiki/Styles_P', '/wiki/Eminem', '/wiki/Styles_P', '/wiki/Common_(rapper)', '/wiki/Nas', '/wiki/Beanie_Sigel', '/wiki/DJ_Khaled', '/wiki/Freeway_(rapper)', '/wiki/Beanie_Sigel', '/wiki/Swizz_Beatz', '/wiki/Fabolous', '/wiki/Ludacris', '/wiki/Busta_Rhymes', '/wiki/Twista', '/wiki/Ace_Hood', '/wiki/Fat_Joe', '/wiki/Bun_B', '/wiki/Waka_Flocka_Flame', '/wiki/The-Dream', '/wiki/Young_Jeezy', '/wiki/Wiz_Khalifa', '/wiki/Rick_Ross', '/wiki/Lil_Wayne', '/wiki/Yo_Gotti', '/wiki/Fabolous', '/wiki/Future_(rapper)', '/wiki/Styles_P', '/wiki/Fat_Joe', '/wiki/50_Cent', '/wiki/Styles_P'], 'title': 'Jadakiss'}, '/wiki/Problem_(rapper)': {'depth': 3, 'links': ['/wiki/E-40', '/wi

### `flatten_crawl`

This function flattens the raw crawl data into something a little more readable. It returns a dictionary with only article title names as the key, and corresponding linked article data with the number of links to that article.

In [67]:
def flatten_crawl(data):
    result = dict()
    for item in data:
        current_title = data[item]["title"]
        result[current_title] = dict()
        for link in data[item]["links"]:
            link_title = data[link]["title"]
            if link_title not in result[current_title]:
                result[current_title][link_title] = data[item]["links"].count(link)
    return result 

In [68]:
flattened_data = flatten_crawl(bet_winner_crawl)
print flattened_data

{'Mannie Fresh': {'Lil Wayne': 2, 'B.G. (rapper)': 1, 'Mos Def': 1, 'Juvenile (rapper)': 1, 'Rick Ross': 1, 'Birdman (rapper)': 3, 'Lil Jon': 1, 'T.I.': 1, 'Mystikal': 1}, 'Beanie Sigel': {'Kanye West': 1, 'Jadakiss': 3, 'DMX (rapper)': 1, '50 Cent': 2, 'Raekwon': 1, 'Freeway (rapper)': 4, 'Omillio Sparks': 2, 'Meek Mill': 1, 'Scarface (rapper)': 2}, 'Nipsey Hussle': {'Problem (rapper)': 1, 'Dom Kennedy': 3, 'DJ Drama': 1, 'Jay Rock': 1, 'YG (rapper)': 3, 'Rick Ross': 3, 'Freeway (rapper)': 1, 'Big Sean': 1, 'Dr. Dre': 1, 'DJ Mustard': 1, 'Tyga': 1, 'Snoop Dogg': 2}, 'Nicki Minaj': {'Lil Wayne': 6, 'Kanye West': 3, 'Busta Rhymes': 1, 'Remy Ma': 3, 'Ludacris': 1, 'Jadakiss': 1, 'Eminem': 2, 'Birdman (rapper)': 1, 'Yo Gotti': 2, 'Meek Mill': 3, 'Big Sean': 1, 'Ice Cube': 1, 'Tyga': 1, 'Gucci Mane': 1, "Lil' Kim": 3}, 'Charles Hamilton (rapper)': {'Ace Hood': 1, 'Cory Gunz': 1, 'Kanye West': 1, 'Dr. Dre': 2, 'B.o.B': 2, 'Wale (rapper)': 1, 'DMX (rapper)': 1, 'MC Lyte': 2, '50 Cent': 1, 'E

### `save_dict` and `load_dict`

These functions save and load Python dictionaries. This is convenient when you want to save crawl data.

In [69]:
def save_dict(data, filename):
    import json
    with open(filename, 'w') as f:
        json.dump(data, f)
        
def load_dict(filename):
    import json
    with open(filename, 'r') as f:
        data = json.load(f)
        return data

In [70]:
save_dict(flattened_data, "bet_network.json")
print load_dict("bet_network.json")

{u'Mannie Fresh': {u'Lil Wayne': 2, u'B.G. (rapper)': 1, u'Rick Ross': 1, u'Juvenile (rapper)': 1, u'Mos Def': 1, u'Birdman (rapper)': 3, u'Lil Jon': 1, u'T.I.': 1, u'Mystikal': 1}, u'Beanie Sigel': {u'Kanye West': 1, u'Jadakiss': 3, u'DMX (rapper)': 1, u'50 Cent': 2, u'Raekwon': 1, u'Freeway (rapper)': 4, u'Omillio Sparks': 2, u'Meek Mill': 1, u'Scarface (rapper)': 2}, u'Nipsey Hussle': {u'Problem (rapper)': 1, u'Dom Kennedy': 3, u'Big Sean': 1, u'Jay Rock': 1, u'YG (rapper)': 3, u'Rick Ross': 3, u'Freeway (rapper)': 1, u'DJ Drama': 1, u'Dr. Dre': 1, u'DJ Mustard': 1, u'Tyga': 1, u'Snoop Dogg': 2}, u'Nicki Minaj': {u'Lil Wayne': 6, u'Kanye West': 3, u'Busta Rhymes': 1, u'Remy Ma': 3, u'Ludacris': 1, u'Jadakiss': 1, u'Eminem': 2, u'Birdman (rapper)': 1, u'Yo Gotti': 2, u'Meek Mill': 3, u'Big Sean': 1, u'Ice Cube': 1, u'Tyga': 1, u'Gucci Mane': 1, u"Lil' Kim": 3}, u'Charles Hamilton (rapper)': {u'Ace Hood': 1, u'Cory Gunz': 1, u'Kanye West': 1, u'Dr. Dre': 2, u'B.o.B': 2, u'Wale (rapper

### Example

In this example, we will use all of the functions above to retrieve and save a social network of characters from the Marvel Cinematic Universe.

In [71]:
sections = [letter for letter in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ']
sections.append('0-9')
print "Sections:", sections

character_links = retrieve_multipage_category("https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_", sections, ".hatnote", True)
print len(character_links), "links retrieved"

mcu_page = PyQuery(url="https://en.wikipedia.org/wiki/Marvel_Cinematic_Universe")
mcu_links = filter_links(mcu_page)

mcu_characters = intersection(character_links, mcu_links)
print len(mcu_characters), "Marvel Cinematic Universe characters"

mcu_crawl_data = crawl("/wiki/Iron_Man", accept=mcu_characters)
flattened_mcu_data = flatten_crawl(mcu_crawl_data)

save_dict(flattened_mcu_data, "mcu_network.json")
print load_dict("mcu_network.json")

Sections: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', '0-9']
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_A
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_B
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_C
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_D
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_E
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_F
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_G
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_H
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_I
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_J
Retrieving https://en.wikipedia.org/wiki/List_of_Marvel_Comics_characters:_K
Retri