# CS 109A/STAT 121A/AC 209A/CSCI E-109A: Homework 1



**Harvard University**<br/>
**Fall 2017**<br/>
**Instructors**: Pavlos Protopapas, Kevin Rader, Rahul Dave, Margo Levine

---

### INSTRUCTIONS

**WARNING**: There is web page scraping in this homework. It takes about 40 minutes. **Do not wait till the last minute** to do this homework.

- To submit your assignment follow the instructions given in canvas.
- Restart the kernel and run the whole notebook again before you submit. There is an important CAVEAT to this. DO NOT run the web-page fetching cells again. (We have provided hints like `# DO NOT RERUN THIS CELL WHEN SUBMITTING` on some of the cells where we provide the code). Instead load your data structures from the JSON files we will ask you to save below. Otherwise you will be waiting for a long time. (Another reason to not wait until the last moment to submit.)

- Do not include your name in the notebook.

---

# Homework 1: Rihanna or Mariah?

Billboard Magazine puts out a top 100 list of "singles" every week. Information from this list, as well as that from music sales, radio, and other sources is used to determine a top-100 "singles" of the year list. A **single** is typically one song, but sometimes can be two songs which are on one "single" record.

In this homework you will: 

1. Scrape Wikipedia to obtain infprmation about the best singers and groups from each year (distinguishing between the two groups) as determined by the Billboard top 100 charts. You will have to clean this data. Along the way you will learn how to save data in json files to avoid repeated scraping. 
2. Scrape Wikipedia to obtain information on these singers. You will have to scrape the web pages, this time using a cache to guard against network timeouts (or your laptop going to sleep). You will again clean the data, and save it to a json file.
3. Use pandas to represent these two datasets and merge them.
4. Use the individual and merged datasets to visualize the performance of the artists and their songs. We have kept the amount of analysis limited here for reasons of time; but you might enjoy exploring music genres and other aspects of the music business you can find on these wikipedia pages at your own leisure.

You should have worked through Lab0 and Lab 1, and Lecture 2.  Lab 2 will help as well.

As usual, first we import the necessary libraries.  In particular, we use [Seaborn](http://stanford.edu/~mwaskom/software/seaborn/) to give us a nicer default color palette, with our plots being of large (`poster`) size and with a white-grid background. 

In [14]:
%matplotlib inline
import functools
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns

## Q1. Scraping Wikipedia for Billboard Top 100.

In this question you will scrape Wikipedia for the Billboard's top 100 singles.

### Scraping Wikipedia for Billboard singles

We'll be using  [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/), and suggest that you use Python's built in `requests` library to fetch the web page.

#### 1.1 Parsing the Billboard Wikipedia page for 1970

Obtain the web page at http://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_1970 using a HTTP GET request. From this web page we'll extract the top 100 singles and their rankings. Create a list of dictionaries, 100 of them to be precise, with entries like 

`{'url': '/wiki/Sugarloaf_(band)', 'ranking': 30, 'band_singer': 'Sugarloaf', 'title': 'Green-Eyed Lady'}`. 

If you look at that web page, you'll see a link for every song, from which you can get the `url` of the singer or band. We will use these links later to scrape information about the singer or band. From the listing we can also get the band or singer name `band_singer`, and `title` of the song.

*HINT: look for a table with class `wikitable`.*

You should get something similar to this (where songs is the aforementioned list):

```
songs[2:4]
```

```
[{'band_singer': 'The Guess Who',
  'ranking': 3,
  'title': '"American Woman"',
  'url': '/wiki/The_Guess_Who'},
 {'band_singer': 'B.J. Thomas',
  'ranking': 4,
  'title': '"Raindrops Keep Fallin\' on My Head"',
  'url': '/wiki/B.J._Thomas'}]
```

In [15]:
import requests
from bs4 import BeautifulSoup
from IPython.display import IFrame, HTML
import time

In [16]:
# Get Wiki padge for 1970's top 100
req = requests.get(" http://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_1970")
page = req.text
# Use Beautiful Soup to convert the HTML
soup = BeautifulSoup(page, 'html.parser')
# The classes of all tables that have a class sttribute set on them
[t["class"] for t in soup.find_all("table") if t.get("class")]
# Find the table with class-types 'sortable' and 'wikitable'
dfinder = lambda tag: tag.name=='table' and tag.get('class') == ['wikitable', 'sortable']
table_songs = soup.find_all(dfinder)

In [17]:
# Extract rows from table_demographics
rows = [row for row in table_songs[0].find_all("tr")]

# Insert table data into dictionary
songs = []
for row in rows[1:]:
    entries = row.find_all("td")
    if entries[1].find("a"): 
        songEntry = {'band_singer': entries[2].text, 'ranking': int(entries[0].text), 
                 'title': ('"' + entries[1].find("a").get("title") + '"'), 'url': entries[2].find("a").get("href")}
    # if song doesn't have URL
    else:
        songEntry = {'band_singer': entries[2].text, 'ranking': int(entries[0].text), 
                 'title': ("'" + entries[1].text + "'"), 'url': entries[2].find("a").get("href")}
    songs.append(songEntry)
songs

[{'band_singer': 'Simon & Garfunkel',
  'ranking': 1,
  'title': '"Bridge over Troubled Water (song)"',
  'url': '/wiki/Simon_%26_Garfunkel'},
 {'band_singer': 'The Carpenters',
  'ranking': 2,
  'title': '"(They Long to Be) Close to You"',
  'url': '/wiki/The_Carpenters'},
 {'band_singer': 'The Guess Who',
  'ranking': 3,
  'title': '"American Woman"',
  'url': '/wiki/The_Guess_Who'},
 {'band_singer': 'B.J. Thomas',
  'ranking': 4,
  'title': '"Raindrops Keep Fallin\' on My Head"',
  'url': '/wiki/B.J._Thomas'},
 {'band_singer': 'Edwin Starr',
  'ranking': 5,
  'title': '"War (The Temptations song)"',
  'url': '/wiki/Edwin_Starr'},
 {'band_singer': 'Diana Ross',
  'ranking': 6,
  'title': '"Ain\'t No Mountain High Enough"',
  'url': '/wiki/Diana_Ross'},
 {'band_singer': 'The Jackson 5',
  'ranking': 7,
  'title': '"I\'ll Be There (The Jackson 5 song)"',
  'url': '/wiki/The_Jackson_5'},
 {'band_singer': 'Rare Earth',
  'ranking': 8,
  'title': '"Get Ready (The Temptations song)"',
  'u

#### 1.2 Generalize the previous: scrape Wikipedia from 1992 to 2014

By visiting the urls similar to the ones for 1970, we can obtain the billboard top 100 for the years 1992 to 2014. (We choose these later years rather than 1970 as you might find music from this era more interesting.) Download these using Python's `requests` module and store the text from those requests in a dictionary called `yearstext`. This dictionary ought to have as its keys the years (as integers from 1992 to 2014), and as values corresponding to these keys the text of the page being fetched.

You ought to sleep a second (look up `time.sleep` in Python) at the very least in-between fetching each web page: you do not want Wikipedia to think you are a marauding bot attempting to mount a denial-of-service attack.

*HINT: you might find `range` and string-interpolation useful to construct the URLs *.


In [18]:
years = range(1992, 2015)
text = []
for year in range(1992, 2015):
    req = requests.get(" http://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_" + str(year))
    text.append(req.text)
    time.sleep(1)
yearstext = dict(zip(years,text))

#### 1.3 Parse and Clean data 

Remember the code you wrote to get data from 1970 which produces a list of dictionaries, one corresponding to each single.  Now write a function `parse_year(the_year, yeartext_dict)` which takes the year, prints it out, gets the text for the year from the just created `yearstext` dictionary, and return a list of dictionaries for that year, with one dictionary for each single. Store this list in the variable `yearinfo`.

The dictionaries **must** be of this form:

```
{'band_singer': ['Brandy', 'Monica'],
  'ranking': 2,
  'song': ['The Boy Is Mine'],
  'songurl': ['/wiki/The_Boy_Is_Mine_(song)'],
  'titletext': '" The Boy Is Mine "',
  'url': ['/wiki/Brandy_Norwood', '/wiki/Monica_(entertainer)']}
```

The spec of this function is provided below:

In [19]:
"""
Function
--------
parse_year

Inputs
------
the_year: the year you want the singles for
yeartext_dict: a dictionary with keys as integer years and values the downloaded web pages 
    from wikipedia for that year.
   
Returns
-------

a list of dictionaries, each of which corresponds to a single and has the
following data:

Eg:

{'band_singer': ['Brandy', 'Monica'],
  'ranking': 2,
  'song': ['The Boy Is Mine'],
  'songurl': ['/wiki/The_Boy_Is_Mine_(song)'],
  'titletext': '" The Boy Is Mine "',
  'url': ['/wiki/Brandy_Norwood', '/wiki/Monica_(entertainer)']}
  
A dictionary with the following data:
    band_singer: a list of bands/singers who made this single
    song: a list of the titles of songs on this single
    songurl: a list of the same size as song which has urls for the songs on the single 
        (see point 3 above)
    ranking: ranking of the single
    titletext: the contents of the table cell
    band_singer: a list of bands or singers on this single
    url: a list of wikipedia singer/band urls on this single: only put in the part 
        of the url from /wiki onwards
    

Notes
-----
See description and example above.
"""
def parse_year(the_year, yeartext_dict):
    page =  yeartext_dict[the_year]
    soup = BeautifulSoup(page, 'html.parser')
    # Find the table with class-types 'sortable' and 'wikitable'
    table_songs = soup.find_all(dfinder)
    # Extract rows from table_demographics
    rows = [row for row in table_songs[0].find_all("tr")]
    # Insert table data into dictionary
    songs = []
    counter = 0
    for row in rows[1:]:
        counter += 1
        entries = row.find_all("td")
        # if song has URL
        if entries[0].find("a"):
            songurl = list(map(lambda x: x.get("href"), entries[0].find_all("a")))
            song = list(map(lambda x: x.text, entries[0].find_all("a")))
            titletext = functools.reduce((lambda x, y: x +" / "+ y), (list(map(lambda x: '"'+ x.get("title") +'"', entries[0].find_all("a")))))
        else: 
            songurl = [None]
            song = [entries[0].text.replace('"', '')]
            titletext = [entries[0].text]
        # if artist has URL     
        if entries[1].find("a"):
            url = list(map(lambda x: x.get("href"), entries[1].find_all("a")))
            band_singer = list(map(lambda x: x.text, entries[1].find_all("a")))
        else: 
            url = [None]
            band_singer = [entries[1].text]
        songEntry = {'band_singer': band_singer,'ranking': counter,'song': song,'songurl': songurl,'titletext': titletext,'url': url}
        songs.append(songEntry)
    return songs
    
yearinfo = []

for year in years:
    info = parse_year(year, yearstext)
    yearinfo.append(info)
    


#### Helpful notes

Notice that some singles might have multiple songs:

```
{'band_singer': ['Jewel'],
  'ranking': 2,
  'song': ['Foolish Games', 'You Were Meant for Me'],
  'songurl': ['/wiki/Foolish_Games',
   '/wiki/You_Were_Meant_for_Me_(Jewel_song)'],
  'titletext': '" Foolish Games " / " You Were Meant for Me "',
  'url': ['/wiki/Jewel_(singer)']}
```

And some singles don't have a song URL:

```
{'band_singer': [u'Nu Flavor'],
  'ranking': 91,
  'song': [u'Heaven'],
  'songurl': [None],
  'titletext': u'"Heaven"',
  'url': [u'/wiki/Nu_Flavor']}
```

Thus there are some issues this function must handle:

1. There can be more than one  `band_singer` as can be seen above (sometimes with a comma, sometimes with "featuring" in between). The best way to parse these is to look for the urls.
2. There can be two songs in a single, because of the way the industry works: there are two-sided singles. See https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_1997 for an example. You can find other examples in 1998 and 1999.
3. The `titletext` is the contents of the table cell, and retains the quotes that Wikipedia puts on the single.
4. If no song anchor is found (see the 24th song in the above url), assume there is one song in the single, set `songurl` to [`None`] and the song name to the contents of the table cell with the quotes stripped (ie `song` is a one-element list with this the `titletext` stripped of its quotes).

As a check, we can do this for 1997. We'll print the first 5 outputs: `parse_year(1997, yearstext)[:5]`

This should give the following. Notice that the year 1997 exercises the edge cases we talked about earlier.

```
[{'band_singer': ['Elton John'],
  'ranking': 1,
  'song': ['Something About the Way You Look Tonight',
   'Candle in the Wind 1997'],
  'songurl': ['/wiki/Something_About_the_Way_You_Look_Tonight',
   '/wiki/Candle_in_the_Wind_1997'],
  'titletext': '" Something About the Way You Look Tonight " / " Candle in the Wind 1997 "',
  'url': ['/wiki/Elton_John']},
 {'band_singer': ['Jewel'],
  'ranking': 2,
  'song': ['Foolish Games', 'You Were Meant for Me'],
  'songurl': ['/wiki/Foolish_Games',
   '/wiki/You_Were_Meant_for_Me_(Jewel_song)'],
  'titletext': '" Foolish Games " / " You Were Meant for Me "',
  'url': ['/wiki/Jewel_(singer)']},
 {'band_singer': ['Puff Daddy', 'Faith Evans', '112'],
  'ranking': 3,
  'song': ["I'll Be Missing You"],
  'songurl': ['/wiki/I%27ll_Be_Missing_You'],
  'titletext': '" I\'ll Be Missing You "',
  'url': ['/wiki/Sean_Combs', '/wiki/Faith_Evans', '/wiki/112_(band)']},
 {'band_singer': ['Toni Braxton'],
  'ranking': 4,
  'song': ['Un-Break My Heart'],
  'songurl': ['/wiki/Un-Break_My_Heart'],
  'titletext': '" Un-Break My Heart "',
  'url': ['/wiki/Toni_Braxton']},
 {'band_singer': ['Puff Daddy', 'Mase'],
  'ranking': 5,
  'song': ["Can't Nobody Hold Me Down"],
  'songurl': ['/wiki/Can%27t_Nobody_Hold_Me_Down'],
  'titletext': '" Can\'t Nobody Hold Me Down "',
  'url': ['/wiki/Sean_Combs', '/wiki/Mase']}]
```

In [20]:
parse_year(1997, yearstext)[:5]

[{'band_singer': ['Elton John'],
  'ranking': 1,
  'song': ['Something About the Way You Look Tonight',
   'Candle in the Wind 1997'],
  'songurl': ['/wiki/Something_About_the_Way_You_Look_Tonight',
   '/wiki/Candle_in_the_Wind_1997'],
  'titletext': '"Something About the Way You Look Tonight" / "Candle in the Wind 1997"',
  'url': ['/wiki/Elton_John']},
 {'band_singer': ['Jewel'],
  'ranking': 2,
  'song': ['Foolish Games', 'You Were Meant for Me'],
  'songurl': ['/wiki/Foolish_Games',
   '/wiki/You_Were_Meant_for_Me_(Jewel_song)'],
  'titletext': '"Foolish Games" / "You Were Meant for Me (Jewel song)"',
  'url': ['/wiki/Jewel_(singer)']},
 {'band_singer': ['Puff Daddy', 'Faith Evans', '112'],
  'ranking': 3,
  'song': ["I'll Be Missing You"],
  'songurl': ['/wiki/I%27ll_Be_Missing_You'],
  'titletext': '"I\'ll Be Missing You"',
  'url': ['/wiki/Sean_Combs', '/wiki/Faith_Evans', '/wiki/112_(band)']},
 {'band_singer': ['Toni Braxton'],
  'ranking': 4,
  'song': ['Un-Break My Heart'],
 

#### Save a json file of information from the scraped files

We do not want to lose all this work, so let's save the last data structure we created to disk. That way if you need to re-run from here, you don't need to redo all these requests and parsing. 

DO NOT RERUN THE HTTP REQUESTS TO WIKIPEDIA WHEN SUBMITTING.

*We **DO NOT** need to see these JSON files in your submission!*

In [21]:
import json

In [22]:
# DO NOT RERUN THIS CELL WHEN SUBMITTING
fd = open("yearinfo.json","w")
json.dump(yearinfo, fd)
fd.close()
del yearinfo

Now let's reload our JSON file into the yearinfo variable, just to be sure everything is working.

In [23]:
# RERUN WHEN SUBMITTING
# Another way to deal with files. Has the advantage of closing the file for you.
with open("yearinfo.json", "r") as fd:
    yearinfo = json.load(fd)

#### 1.4 Construct a year-song-singer dataframe from the yearly information

Let's construct a dataframe `flatframe` from the `yearinfo`. The frame should be similar to the frame below.  Each row of the frame represents a song, and carries with it the chief properties of year, song, singer, and ranking.

![](https://raw.githubusercontent.com/cs109/a-2017/master/hwassets/images/HW1SC1.png)

To construct the dataframe, we'll need to iterate over the years and the singles per year. Notice how, above, the dataframe is ordered by ranking and then year. While the exact order is up to you, note that you will have to come up with a scheme to order the information.

Check that the dataframe has sensible data types. You will also likely find that the year field has become an "object" (Pandas treats strings as generic objects): this is due to the conversion to and back from JSON. Such conversions need special care. Fix any data type issues with `flatframe`. (See Pandas [astype](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html) function.) 
We will use this `flatframe` in the next question. 

(As an aside, we used the name `flatframe` to indicate that this dataframe is flattened from a hierarchical dictionary structure with the keys being the years.)

In [32]:
# Rebuild yearinfo as a list of list of dictionaries that can be easily read into a data frame
new_yearinfo = []
# loop through every singer, for every song, of every year
for year in range(len(yearinfo)): 
    yearEntry = []
    for rank in range(len(yearinfo[1])):
        counter = 0
        for singer in yearinfo[year][rank]['band_singer']:
            # rebuild dictionary entry in usable format for dataframe
            date = year + 1992
            ranking = rank + 1
            song = yearinfo[year][rank]['song'][0]
            url = yearinfo[year][rank]['url'][counter]
            counter += 1
            entry = {'year': date, 'band_singer': singer, 'song': song, 'url': url, 'ranking' : ranking}
            yearEntry.append(entry)
    new_yearinfo.append(yearEntry)

# build base datframe from 1992
flatframe = pd.DataFrame.from_dict(new_yearinfo[0])
# append to dataframe additional years
for i in range(1, len(new_yearinfo)):
    flatframe = flatframe.append(pd.DataFrame.from_dict(new_yearinfo[i]), ignore_index=True)
    
# sort by ranking and year, and re-index and change column order
flatframe = flatframe.sort_values(by=['ranking', 'year']).reset_index(drop=True)[
    ['year', 'band_singer', 'ranking', 'song', 'url']]

### Who are the highest quality singers?

Here we show the highest quality singers and plot them on a bar chart.

#### 1.5 Find highest quality singers according to how prolific they are

What do we mean by highest quality? This is of course open to interpretation, but let's define "highest quality" here as the number of times a singer appears in the top 100 over this time period. If a singer appears twice in a year (for different songs), this is counted as two appearances, not one. 

Make a bar-plot of the most prolific singers. Singers on this chart should have appeared at-least more than 15 times. (HINT: look at the docs for the pandas method `value_counts`.)


In [30]:
singers = flatframe.band_singer.value_counts()
best_singers = singers[singers > 15]

sns.reset_defaults()
with sns.plotting_context("poster"):
    ax = plt.gca()
    best_singers.plot(kind="barh")
    plt.grid(axis = 'x', color ='white', linestyle='-')
    ax.tick_params(axis='both', which='both',length=0)
    sns.despine(left=True, bottom=True)
    plt.title("Top Singers from 1992 to 2014")
    plt.ylabel("Artist")
    plt.xlabel("# of Top 100 Hits")

#### 1.6 What if we used a different metric?

What we would like to capture is this: a singer should to be scored higher if the singer appears higher in the rankings. So we'd say that a singer who appeared once at a higher and once at a lower ranking is a "higher quality" singer than one who appeared twice at a lower ranking. 

To do this, group all of a singers songs together and assign each song a score `101 - ranking`. Order the singers by their total score and make a bar chart for the top 20.


In [26]:
flatframe['rankings101'] = flatframe['ranking'].map(lambda x: 101 -x)

top20 = flatframe.groupby('band_singer').sum().sort_values('rankings101', ascending = False).rankings101[:20]

sns.reset_defaults()

with sns.plotting_context("poster"):
    ax = plt.gca()
    top20.plot(kind="barh")
    plt.grid(axis = 'x', color ='white', linestyle='-')
    ax.tick_params(axis='both', which='both',length=0)
    sns.despine(left=True, bottom=True)
    plt.title("Top 20 Singers from 1992 to 2014")
    plt.ylabel("Artist")
    plt.xlabel("Ranking Score")

#### 1.7 Do you notice any major differences when you change the metric?

How have the singers at the top shifted places? Why do you think this happens?

Various artists have shifted ranking using the different methodologies. This happens because some artists produce a lot of good songs, but those songs are not necessirly the best songs for that year. Some artists like Rihanna who is at the top of both charts, produce many excellent songs that are at the top of the charts. On the other hand, R. Kelly and T.I produce a lot of songs on that make it on the top 100 list, but they are not all excellent songs at the top of the charts. Finally, artists like Bruno Mars have a few really excellent songs so they rank on the second methodology but they don't have many songs so they do not rank well on the first methodology. 

## Q2. Scraping and Constructing: Information about Artists, Bands and Genres from Wikipedia

Our next job is to use those band/singer urls we collected under `flatframe.url` and get information about singers and/or bands.

### Scrape information about artists from wikipedia

We wish to fetch information about the singers or groups for all the winning songs in a list of years.

Here we show a function that fetches information about a singer or group from their url on wikipedia. We create a cache object `urlcache` that will avoid redundant HTTP requests (e.g. an artist might have multiple singles on a single year, or be on the list over a span of years). Once we have fetched information about an artist, we don't need to do it again. The caching also helps if the network goes down, or the target website is having some problems. You simply need to run the `get_page` function below again, and the `urlcache` dictionary will continue to be filled.

If the request gets an HTTP return code different from 200, (such as a 404 not found or 500 Internal Server Error) the cells for that URL will have a value of 1; and if the request completely fails (e.g. no network connection) the cell will have a value of 2. This will allow you to analyse the failed requests.

Notice that we have wrapped the call in whats called _an exception block_. We try to make the request. If it fails entirely, or returns a HTTP code thats not 200, we set the status to 2 and 1 respectively.

In [9]:
urlcache={}


In [10]:
def get_page(url):
    # Check if URL has already been visited.
    if (url not in urlcache) or (urlcache[url]==1) or (urlcache[url]==2):
        time.sleep(1)
        # try/except blocks are used whenever the code could generate an exception (e.g. division by zero).
        # In this case we don't know if the page really exists, or even if it does, if we'll be able to reach it.
        try:
            r = requests.get("http://en.wikipedia.org%s" % url)

            if r.status_code == 200:
                urlcache[url] = r.text
            else:
                urlcache[url] = 1
        except:
            urlcache[url] = 2
    return urlcache[url]


We sort the `flatframe` by year, ascending, first. Think why.

In [13]:
flatframe=flatframe.sort_values('year')
flatframe.head()

Unnamed: 0,year,band_singer,ranking,song,url,rankings101
0,1992,Boyz II Men,1,End of the Road,/wiki/Boyz_II_Men,100
1879,1992,Karyn White,63,The Way I Feel About You,/wiki/Karyn_White,38
1849,1992,Genesis,62,Hold on My Heart,/wiki/Genesis_(band),39
1824,1992,CeCe Peniston,61,Keep on Walkin',/wiki/CeCe_Peniston,40
187,1992,En Vogue,7,My Lovin' (You're Never Gonna Get It),/wiki/En_Vogue,94


#### Pulling and saving the data

In [None]:
# DO NOT RERUN THIS CELL WHEN SUBMITTING
# Here we are populating the url cache
# subsequent calls to this cell should be very fast, since Python won't
# need to fetch the page from the web server.
# NOTE this function will take quite some time to run (about 30 mins for me), since we sleep 1 second before
# making a request. If you run it again it will be almost instantaneous, save requests that might have failed
# (you will need to run it again if requests fail..see cell below for how to test this)
flatframe["url"].apply(get_page)

You may have to run this function again and again, in case there were network problems. Note that, because there is a "global" cache, it will take less time each time you run it. Also note that this function is designed to be run again and again: it attempts to make sure that there are no unresolved pages remaining. Let us make sure of this: *the sum below should be 0, and the boolean True.*

In [None]:
# DO NOT RERUN THIS CELL WHEN SUBMITTING
print("Number of bad requests:",np.sum([(urlcache[k]==1) or (urlcache[k]==2) for k in urlcache])) # no one or 0's)
print("Did we get all urls?", len(flatframe.url.unique())==len(urlcache)) # we got all of the urls

Let's save the `urlcache` to disk, just in case we need it again.

In [None]:
# DO NOT RERUN THIS CELL WHEN SUBMITTING
with open("data/artistinfo.json","w") as fd:
    json.dump(urlcache, fd)
del urlcache

In [33]:
# RERUN WHEN SUBMITTING
with open("artistinfo.json") as json_file:
    urlcache = json.load(json_file)

#### 2.1 Extract information about singers and bands

From each page we collected about a singer or a band, extract the following information:

1. If the page has the text "Born" in the sidebar on the right, extract the element with the class `.bday`. If the page doesn't contain "Born", store `False`.  Store either of these into the variable `born`. We want to analyze the artist's age.

2. If the text "Years active" is found, but no "born", assume a band. Store into the variable `ya` the value of the next table cell corresponding to this, or `False` if the text is not found.

Put this all into a function `singer_band_info` which takes the singer/band url as argument and returns a dictionary `dict(url=url, born=born, ya=ya)`.

The information can be found on the sidebar on each such wikipedia page, as the example here shows:

![sandg](https://raw.githubusercontent.com/cs109/a-2017/master/hwassets/images/sandg.png).



Write the function `singer_band_info` according to the following specification:

In [56]:
"""
Function
--------
singer_band_info

Inputs
------
url: the url
page_text: the text associated with the url
   
Returns
-------
A dictionary with the following data:
    url: copy the input argument url into this value
    born: the artist's birthday
    ya: years active variable

Notes
-----
See description above. Also note that some of the genres urls might require a 
bit of care and special handling.
"""

def singer_band_info (url, page_text):
    if page_text == 2: 
        return 
    soup = BeautifulSoup(page_text, 'html.parser')
    bday = soup.find_all(lambda tag: tag.name=='span' and tag.get('class') == ['bday'])
    if bday:
        born = bday[0].text
        ya = False
    else: 
        table = soup.find_all(lambda tag: tag.name=='table' and tag.get('class') == ['infobox', 'vcard', 'plainlist'])
        if table: 
            yearsActive = [row.find('td').text for row in table[0].find_all(lambda tag: tag.name=='tr' and 'Years active' in tag.text)]
            if yearsActive:
                ya = yearsActive[0]
            else:
                ya = False
            born = False
        else: 
            return
    return {'url':url, 'born':born,'ya': ya }


#### 2.2  Merging this information in

Iterate over the items in the singer-group dictionary cache `urlcache`, run the above function, and create a dataframe from there with columns `url`, `born`, and `ya`. Merge this dataframe on the `url` key with `flatframe`, creating a rather wide dataframe that we shall call `largedf`. It should look something like this:

![](https://raw.githubusercontent.com/cs109/a-2017/master/hwassets/images/HW1SC3.png)

Notice how the `born` and `ya` and `url` are repeated every time a different song from a given band is represented in a row.

In [57]:
data = []
for key in urlcache.keys():
    data.append(singer_band_info (key, urlcache[key]))
    

/wiki/Boyz_II_Men
/wiki/Tevin_Campbell
/wiki/TLC_(band)
/wiki/Shanice
/wiki/Ya_Kid_K
/wiki/Richard_Marx
/wiki/Queen_(band)
/wiki/Arrested_Development_(group)
/wiki/Luther_Vandross
/wiki/Janet_Jackson
/wiki/Mariah_Carey
/wiki/Jon_Secada
/wiki/Elton_John
/wiki/P.M._Dawn
/wiki/Mr._Big_(band)
/wiki/Joe_Public_(band)
/wiki/En_Vogue
/wiki/Celine_Dion
/wiki/Guns_N%27_Roses
/wiki/Tom_Cochrane
/wiki/Michael_Jackson
/wiki/Bobby_Brown
/wiki/CeCe_Peniston
/wiki/Nirvana_(band)
/wiki/Technotronic
/wiki/Bonnie_Raitt
/wiki/Color_Me_Badd
/wiki/Jade_(American_band)
/wiki/Kris_Kross
/wiki/Keith_Sweat
/wiki/Das_EFX
/wiki/Charles_%26_Eddie
/wiki/Def_Leppard
/wiki/Amy_Grant
/wiki/Mary_J._Blige
/wiki/Sir_Mix-a-Lot
/wiki/The_KLF
/wiki/Billy_Ray_Cyrus
/wiki/U2
/wiki/FireHouse_(band)
/wiki/Jodeci
/wiki/Madonna_(entertainer)
/wiki/Sophie_B._Hawkins
/wiki/Don_Henley
/wiki/Vanessa_L._Williams
/wiki/Peabo_Bryson
/wiki/Mint_Condition
/wiki/The_Cover_Girls
/wiki/Hi-Five
/wiki/Toad_the_Wet_Sprocket
/wiki/Michael_Bolto

/wiki/Britney_Spears
/wiki/Joey_McIntyre
/wiki/Kenny_Chesney
/wiki/Blinky_Blink
/wiki/Tamia
/wiki/Cher
/wiki/Tal_Bachman
/wiki/Tyrese_Gibson
/wiki/Len_(band)
/wiki/Blaque
/wiki/Eric_Ben%C3%A9t
/wiki/Sixpence_None_the_Richer
/wiki/Marc_Anthony
/wiki/Lou_Bega
/wiki/Shawn_Mullins
/wiki/Chant%C3%A9_Moore
/wiki/Christina_Aguilera
/wiki/Divine_(group)
/wiki/Matchbox_Twenty
/wiki/Alabama_(band)
/wiki/George_Strait
/wiki/Kool_Moe_Dee
/wiki/Jesse_Powell
/wiki/Rob_Thomas_(musician)
/wiki/Santana_(band)
/wiki/Lil_Zane
/wiki/Maxwell_(musician)
/wiki/Filter_(band)
/wiki/Jagged_Edge_(band)
/wiki/Eminem
/wiki/Everclear_(band)
/wiki/Kid_Rock
/wiki/Chad_Brock
/wiki/Sisq%C3%B3
/wiki/Sammie
/wiki/Cheb_Mami
/wiki/Pink_(singer)
/wiki/Eve_(entertainer)
/wiki/Kevon_Edmonds
/wiki/Westlife
/wiki/Samantha_Mumba
/wiki/Bow_Wow_(rapper)
/wiki/Nas
/wiki/Jessica_Simpson
/wiki/Mary_Mary
/wiki/Next_(band)
/wiki/SoulDecision
/wiki/Thrust_(rapper)
/wiki/Donell_Jones
/wiki/Macy_Gray
/wiki/Avant_(singer)
/wiki/Q-Tip_(rapp

/wiki/The_Band_Perry
/wiki/The_Script
/wiki/La_Roux
/wiki/New_Boyz
/wiki/Kevin_McCall
/wiki/Jerrod_Niemann
/wiki/Tyga
/wiki/Nicki_Minaj
/wiki/Taio_Cruz
/wiki/Cali_Swag_District
/wiki/Miranda_Lambert
/wiki/Travie_McCoy
/wiki/Bruno_Mars
/wiki/Paramore
/wiki/Michael_Bubl%C3%A9
/wiki/David_Guetta
/wiki/Mr_Hudson
/wiki/Sugarland
/wiki/Dev_(singer)
/wiki/DJ_Khaled
/wiki/Adam_Lambert
/wiki/B.o.B
/wiki/Kris_Allen
/wiki/Neon_Trees
/wiki/Mike_Posner
/wiki/Hayley_Williams
/wiki/The_Cataracs
/wiki/Far_East_Movement
/wiki/Selena_Gomez_%26_the_Scene
/wiki/Young_Money
/wiki/Rivers_Cuomo
/wiki/Orianthi
/wiki/CeeLo_Green
/wiki/Calvin_Harris
/wiki/Skylar_Grey
/wiki/Wiz_Khalifa
/wiki/Roscoe_Dash
/wiki/Luke_Bryan
/wiki/Ryan_Tedder
/wiki/Wale_(rapper)
/wiki/Waka_Flocka_Flame
/wiki/Jake_Owen
/wiki/Nayer
/wiki/Rick_Ross_(rapper)
/wiki/LMFAO
/wiki/Afrojack
/wiki/Christina_Perri
/wiki/Eli_Young_Band
/wiki/Sabi_(singer)
/wiki/Eric_Turner_(singer)
/wiki/Tinie_Tempah
/wiki/Edward_Maya
/wiki/Vika_Jigulina
/wiki/Th

In [66]:
dicCreated = False
for key in urlcache:
    if not dicCreated:
        df = pd.DataFrame([singer_band_info(key, urlcache[key])])
        dicCreated = True
    else:
        df = df.append([singer_band_info(key, urlcache[key])])
        print(key)
df

/wiki/Boyz_II_Men
/wiki/Tevin_Campbell
/wiki/Tevin_Campbell
/wiki/TLC_(band)
/wiki/TLC_(band)
/wiki/Shanice
/wiki/Shanice
/wiki/Ya_Kid_K
/wiki/Ya_Kid_K
/wiki/Richard_Marx
/wiki/Richard_Marx
/wiki/Queen_(band)
/wiki/Queen_(band)
/wiki/Arrested_Development_(group)
/wiki/Arrested_Development_(group)
/wiki/Luther_Vandross
/wiki/Luther_Vandross
/wiki/Janet_Jackson
/wiki/Janet_Jackson
/wiki/Mariah_Carey
/wiki/Mariah_Carey
/wiki/Jon_Secada
/wiki/Jon_Secada
/wiki/Elton_John
/wiki/Elton_John
/wiki/P.M._Dawn
/wiki/P.M._Dawn
/wiki/Mr._Big_(band)
/wiki/Mr._Big_(band)
/wiki/Joe_Public_(band)
/wiki/Joe_Public_(band)
/wiki/En_Vogue
/wiki/En_Vogue
/wiki/Celine_Dion
/wiki/Celine_Dion
/wiki/Guns_N%27_Roses
/wiki/Guns_N%27_Roses
/wiki/Tom_Cochrane
/wiki/Tom_Cochrane
/wiki/Michael_Jackson
/wiki/Michael_Jackson
/wiki/Bobby_Brown
/wiki/Bobby_Brown
/wiki/CeCe_Peniston
/wiki/CeCe_Peniston
/wiki/Nirvana_(band)
/wiki/Nirvana_(band)
/wiki/Technotronic
/wiki/Technotronic
/wiki/Bonnie_Raitt
/wiki/Bonnie_Raitt
/wik

KeyboardInterrupt: 

In [62]:
# build base datframe from 1992
df = pd.DataFrame.from_dict(data[1])
# append to dataframe additional years
#for i in range(1, len(data)):
    #df = df.append(pd.DataFrame.from_dict(data[i]), ignore_index=True)

ValueError: If using all scalar values, you must pass an index

In [63]:
data[1]

{'born': '1976-11-12', 'url': '/wiki/Tevin_Campbell', 'ya': False}

In [49]:
BeautifulSoup(urlcache['/wiki/3T'], 'html.parser')

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>3T - Wikipedia</title>
<script>document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );</script>
<script>(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"3T","wgTitle":"3T","wgCurRevisionId":798037434,"wgRevisionId":798037434,"wgArticleId":304320,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["Webarchive template wayback links","CS1 German-language sources (de)","Articles needing additional references from February 2009","All articles needing additional references","Articles with hCards","Wikipedia articles with MusicBrainz identifiers","Musical groups established in 1995","African-American musical groups","American rhythm and blues musical groups

#### 2.3 What is the age at which singers achieve their top ranking?

Plot a histogram of the age at which singers achieve their top ranking. What conclusions can you draw from this distribution of ages?

*HINT: You will need to do some manipulation of the `born` column, and find the song for which a band or an artist achieves their top ranking. You will then need to put these rows together into another dataframe or array to make the plot.*

#### 2.4 At what year since inception do bands reach their top rankings?

Make a similar calculation to plot a histogram of the years since inception at which bands reach their top ranking. What conclusions can you draw?