# Six Degrees of Kevin Bacon

![](images/Kevin_Bacon.jpg)

This activity is motivated by the text **Web Scraping with Python** by Ryan Mitchell, available through O'Reilly [here](http://shop.oreilly.com/product/0636920078067.do).  This book goes in depth with much more on using different libraries with Python around common webscraping tasks and I highly recommend it.  We will focus on the activity of moving from a base page to further pages through their links.  

In [1]:
import requests
from bs4 import BeautifulSoup

### Scraping Links

Below, we take the page dealing with the six degrees of Keving Bacon problem.  Here, our goal is to extract links to other pages that we will subsequently pass to requests.  Recall that a link is located in an `<a>` tag and the link is contained in the `href` attribute.  For example, the tag

```HTML
<a href="/wiki/Six_degrees_of_separation" title="Six degrees of separation">six degrees of separation</a>
```

references the Six Degrees of Separation article.  Note that this is a url within Wikipedia.  We can isolate these inner Wikipedia references.  To begin, let's inspect the link content.

In [2]:
response = requests.get('https://en.wikipedia.org/wiki/Six_Degrees_of_Kevin_Bacon')

In [3]:
soup = BeautifulSoup(response.text, 'html.parser')

In [6]:
soup.find_all('a')[:10]

[<a id="top"></a>,
 <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>,
 <a class="mw-jump-link" href="#p-search">Jump to search</a>,
 <a class="image" href="/wiki/File:Kevin_Bacon.jpg"><img alt="" class="thumbimage" data-file-height="461" data-file-width="369" height="275" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Kevin_Bacon.jpg/220px-Kevin_Bacon.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Kevin_Bacon.jpg/330px-Kevin_Bacon.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/d/d2/Kevin_Bacon.jpg 2x" width="220"/></a>,
 <a class="internal" href="/wiki/File:Kevin_Bacon.jpg" title="Enlarge"></a>,
 <a class="mw-redirect" href="/wiki/Parlor_game" title="Parlor game">parlor game</a>,
 <a href="/wiki/Six_degrees_of_separation" title="Six degrees of separation">six degrees of separation</a>,
 <a href="/wiki/Kevin_Bacon" title="Kevin Bacon">Kevin Bacon</a>,
 <a href="/wiki/Hollywood" title="Hollywood">Hollywood</a>,
 <a href="/wiki/Kevin_Bacon" ti

In [7]:
for link in soup.find_all('a')[:10]:
    print(link.attrs)

{'id': 'top'}
{'class': ['mw-jump-link'], 'href': '#mw-head'}
{'class': ['mw-jump-link'], 'href': '#p-search'}
{'href': '/wiki/File:Kevin_Bacon.jpg', 'class': ['image']}
{'href': '/wiki/File:Kevin_Bacon.jpg', 'class': ['internal'], 'title': 'Enlarge'}
{'href': '/wiki/Parlor_game', 'class': ['mw-redirect'], 'title': 'Parlor game'}
{'href': '/wiki/Six_degrees_of_separation', 'title': 'Six degrees of separation'}
{'href': '/wiki/Kevin_Bacon', 'title': 'Kevin Bacon'}
{'href': '/wiki/Hollywood', 'title': 'Hollywood'}
{'href': '/wiki/Kevin_Bacon', 'title': 'Kevin Bacon'}


In [4]:
for link in soup.find_all('a')[:10]:
    if 'href' in link.attrs:
        print(link.attrs['href'])

#mw-head
#p-search
/wiki/File:Kevin_Bacon.jpg
/wiki/File:Kevin_Bacon.jpg
/wiki/Parlor_game
/wiki/Six_degrees_of_separation
/wiki/Kevin_Bacon
/wiki/Hollywood
/wiki/Kevin_Bacon


Okay, seems there are links outside of the inner wiki links.  However, we see that the wiki links contain `/wiki/`, no colons, and the links are all within the body of the page.  Exploiting these means we can write a regular expression 

```
^(/wiki/)((?!:).)*$
```

that will match only the wiki links.  

**regular expression**
* `^(/wiki/)` means contains `/wiki/`
* `[^A]` means doesn't contain `A`
* `(?!:)` means do not include `:`
* `.` means followed by anything
* `$` end of the string

In [5]:
import re

In [11]:
soup.find('div', {'id': 'bodyContent'})

<div class="mw-body-content" id="bodyContent">
<div class="noprint" id="siteSub">From Wikipedia, the free encyclopedia</div> <div id="contentSub"></div>
<div id="jump-to-nav"></div> <a class="mw-jump-link" href="#mw-head">Jump to navigation</a>
<a class="mw-jump-link" href="#p-search">Jump to search</a>
<div class="mw-content-ltr" dir="ltr" id="mw-content-text" lang="en"><div class="mw-parser-output"><div class="thumb tright">
<div class="thumbinner" style="width:222px;"><a class="image" href="/wiki/File:Kevin_Bacon.jpg"><img alt="" class="thumbimage" data-file-height="461" data-file-width="369" height="275" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Kevin_Bacon.jpg/220px-Kevin_Bacon.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/d2/Kevin_Bacon.jpg/330px-Kevin_Bacon.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/d/d2/Kevin_Bacon.jpg 2x" width="220"/></a>
<div class="thumbcaption">
<div class="magnify"><a class="internal" href="/wiki/File:Kevin_Bacon.jpg

In [8]:
for link in soup.find('div', {'id': 'bodyContent'}).find_all('a', href = re.compile('^(/wiki/)((?!:).)*$'))[:10]:
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Parlor_game
/wiki/Six_degrees_of_separation
/wiki/Kevin_Bacon
/wiki/Hollywood
/wiki/Kevin_Bacon
/wiki/Kevin_Bacon
/wiki/Charitable_organization
/wiki/SixDegrees.org
/wiki/Premiere_(magazine)
/wiki/Kevin_Bacon


### A Function for Links

Now, let's write a function that extracts the link from any wikipedia page.  We should be able to use the idea that the links we care about are located in the same place as our Six Degrees example.  

In [12]:
def get_wikilinks(url):
    links = []
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    for link in soup.find('div', {'id': 'bodyContent'}).find_all('a', href = re.compile('^(/wiki/)((?!:).)*$')):
        links.append(link)
    return links

In [13]:
links = get_wikilinks('https://en.wikipedia.org/wiki/Kevin_Bacon')

In [14]:
links[:10]

[<a class="mw-disambig" href="/wiki/Kevin_Bacon_(disambiguation)" title="Kevin Bacon (disambiguation)">Kevin Bacon (disambiguation)</a>,
 <a href="/wiki/San_Diego_Comic-Con" title="San Diego Comic-Con">San Diego Comic-Con</a>,
 <a href="/wiki/Philadelphia" title="Philadelphia">Philadelphia</a>,
 <a href="/wiki/Pennsylvania" title="Pennsylvania">Pennsylvania</a>,
 <a href="/wiki/Kyra_Sedgwick" title="Kyra Sedgwick">Kyra Sedgwick</a>,
 <a href="/wiki/Sosie_Bacon" title="Sosie Bacon">Sosie Bacon</a>,
 <a href="/wiki/Edmund_Bacon_(architect)" title="Edmund Bacon (architect)">Edmund Bacon</a>,
 <a href="/wiki/Michael_Bacon_(musician)" title="Michael Bacon (musician)">Michael Bacon</a>,
 <a href="/wiki/Footloose_(1984_film)" title="Footloose (1984 film)">Footloose</a>,
 <a href="/wiki/JFK_(film)" title="JFK (film)">JFK</a>]

In [17]:
type(links[0])

bs4.element.Tag

### Connecting Pages

Now, we want to follow these references, gather more urls, and repeat. For the sake of not running to exhaustion, I abbreviate the output using only a large length requirement for the link list.  To traverse all the pages we would simply change the 

```python
while len(links) > 100:
```

to 

```python
while len(links) > 0:
```

In [19]:
import random

In [20]:
links = get_wikilinks('https://en.wikipedia.org/wiki/Kevin_Bacon')

In [21]:
len(links)

353

In [22]:
while len(links) > 100:
    newArticle = 'https://en.wikipedia.org' + links[random.randint(0, len(links)-1)].attrs['href']
    print(newArticle)
    links = get_wikilinks(newArticle)

https://en.wikipedia.org/wiki/People_(American_magazine)
https://en.wikipedia.org/wiki/Allrecipes.com


### Problem

Write a function to retrieve a list of albums of any area you are interested in using Wikipedia's list of list of albums page: https://en.wikipedia.org/wiki/Lists_of_albums.

In [23]:
response = requests.get('https://en.wikipedia.org/wiki/Lists_of_fastest-selling_albums')
soup = BeautifulSoup(response.text, 'html.parser')

In [53]:
h2s = soup.find('div', {'id': 'mw-content-text'}).find_all('span', {'class': 'mw-headline'})
tables = soup.find('div', {'id': 'mw-content-text'}).find_all('table', {'class': 'wikitable'})

for i,table in enumerate(tables):
    print(h2s[i].text)
    rows = table.find('tbody').find_all('tr')

    for row in rows:
        print(re.sub(r'\n',' ',row.text).strip())
    print("")

Canada
Rank  Year  Title  Artist  Sales  Reference
1  2015  25  Adele  305,928  [5]
2  1997  Let's Talk About Love  Céline Dion  230,212  [6][7]
3  1999  Millennium  Backstreet Boys  191,791  [8][9]
4  2003  Star Académie  Star Académie  173,000  [10]
5  2000  Black & Blue  Backstreet Boys  156,307  [8]
6  2002  A New Day Has Come  Céline Dion  151,600  [11]
7  2002  Up!  Shania Twain  150,000  [12]
8  2008  Black Ice  AC/DC  119,000  [13]
9  2006  I Think of You  Gregory Charles  109,000  [14]
10  2014  1989  Taylor Swift  107,000  [15]

Germany
Rank  Year  Title  Artist  Sales  Reference
1  2007  12  Herbert Grönemeyer  More than 263,000  [16]
2  2015  25  Adele  263,000  [16]

Ireland
Rank  Year  Title  Artist  Sales  Reference
1  2001  Distance  Hikaru Utada  3,002,720  [19][20]
2  2001  A Best  Ayumi Hamasaki  2,874,870  [19][20]
3  1998  The Best "Pleasure"  B'z  2,709,530  [20][21]
4  1998  The Best "Treasure"  B'z  2,500,120  [20][22]
5  2002  Deep River  Hikaru Utada  2,350,17