<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Practice Using APIs

_Authors: Dave Yerrington (SF), Sam Stack (DC)_

---

In this lab, we'll practice using some popular APIs to retrieve and store data.

In [1]:
# Imports at the top.
import json
import urllib
import pandas as pd
import numpy as np
import requests
import json
import re
import matplotlib.pyplot as plt
%matplotlib inline

## Exercise 1: Have Fun in All 50 States

---

Go back to the Data USA site: [https://datausa.io/](https://datausa.io/)

Find the "Data" page, then click over to the GitHub page for the full API documentation.

### 1.A Get Population Data by State

Run a "query" from Python by passing an appropriately formatted URL.  Your query should receive a "response" containing population data for all of the states.  Sort this data so that you can determine which state is the most and least populated.

Note that the state name may not be immediately evident.  It will likely be encoded with it's census ID, which you may need to track down elsewhere on the interwebs.

In [59]:
query = 'https://apt.datausa.io/attrs/geo/{}'

def geo_lookup(geo):
    state_info = requests.get(query.format(geo)).json()
    state = state_info['data'][0][0]
    return state
    
geo_lookup('04000US25')

'massachusetts'

In [62]:
url = "http://api.datausa.io/api/?show=geo&sumlevel=state&required=pop&year=latest"
response = requests.get(url).json()
df = pd.DataFrame(response['data'], columns=response['headers']).sort_values('pop')
df['geo'] = df['geo'].apply(geo_lookup)
print(df.tail(1), df.head(1))

   year         geo       pop
4  2015  california  38421464     year      geo     pop
51  2015  wyoming  579679


### 1.B Get data on your favorite state

The Data USA already has some great EDA stats and visuals that you can view by state.

Pick your favorite state and do some basic EDA on a feature/attribute of your choosing.  You will probably need to inspect the GitHub docs to figure out how to pull out the data you are looking for.

In [56]:
url = "http://api.datausa.io/api/"
params = {'show':'geo',
         'sumlevel': 'all',
         'year': 'latest',
         'geo' : '04000US25' }
response = requests.get(url, params=params).json()
pd.DataFrame(response['data'], columns=response['headers'])

Unnamed: 0,num_records,avg_age,avg_wage,num_ppl,avg_age_moe,avg_wage_moe,num_ppl_moe,avg_age_ft,avg_age_pt,avg_wage_ft,...,avg_hrs_ft,avg_hrs_pt,avg_hrs_moe,avg_hrs_ft_moe,avg_hrs_pt_moe,gini,gini_ft,gini_pt,year,geo
0,33041,41.8878,58715.5,3342095,0.111717,639.855,22150.0,42.9792,38.5251,71519.4,...,43.9143,21.3047,0.14254,0.110277,0.192216,0.492001,0.420307,0.564124,2015,04000US25


### 1.C Get creative (Bonus) 

As you were scrolling the docs or the Data USA website, did something jump out at you or catch your attention?  Go back and explore it now.  Get creative with your questions or with your Python skills.  Be curious!!

In [None]:
# A:

## Exercise 2: IMDb TV Shows

---

Sometimes an API doesn't provide all of the information we'd like and we need to get creative.

Here we'll use a combination of scraping and API calls to find the ratings and networks of famous television shows.

### 3.A Get the Top TV Shows

IMDb contains data about movies and TV shows. Unfortunately, it doesn't have a public API.

The page http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2 contains the list of the top 250 television shows of all time. Retrieve the page using the `requests` library and then parse the HTML to obtain a list of the `television_ids` for these shows. You can parse it with regular expression or by using a library like `BeautifulSoup`.

> **Hint:** television_ids look like this: `tt2582802`.
> _Everything after "/title/" and before "/?"_

In [63]:
from bs4 import BeautifulSoup

In [67]:
url = 'http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2'
response = requests.get(url)
html = response.text

In [84]:
soup = BeautifulSoup(html, 'lxml')
lister = soup.find_all('tbody', {'class':'lister-list'},  'href')

In [100]:
p = lister[0].find_all('a')

In [154]:
import re
exp = r'title/.{9}'
titles = []
for t in p: 
    titles.append(re.findall(exp, str(t))[0])

In [164]:
titles_new = [x[6:] for x in titles]
final = list(set(titles))
final

['tt2395695',
 'tt1486217',
 'tt0103359',
 'tt0187664',
 'tt4295140',
 'tt0088509',
 'tt2802850',
 'tt0994314',
 'tt1641384',
 'tt0804503',
 'tt0080297',
 'tt3358020',
 'tt0302199',
 'tt0460681',
 'tt0112084',
 'tt5712554',
 'tt0362192',
 'tt0388629',
 'tt0112159',
 'tt5516154',
 'tt0118266',
 'tt0075520',
 'tt0094517',
 'tt1883092',
 'tt0047708',
 'tt0080306',
 'tt1494191',
 'tt0161952',
 'tt2384811',
 'tt0367279',
 'tt0386676',
 'tt1734135',
 'tt0098936',
 'tt2433738',
 'tt0141842',
 'tt5288312',
 'tt0807832',
 'tt0092337',
 'tt5290382',
 'tt0071075',
 'tt0318871',
 'tt1492966',
 'tt0286486',
 'tt1695360',
 'tt1518542',
 'tt0436992',
 'tt1355642',
 'tt5189670',
 'tt2303687',
 'tt5555260',
 'tt1632701',
 'tt1910272',
 'tt0083466',
 'tt3428912',
 'tt0387199',
 'tt0903747',
 'tt0773262',
 'tt1298820',
 'tt0203082',
 'tt3671754',
 'tt0098833',
 'tt4742876',
 'tt0075572',
 'tt1474684',
 'tt0475784',
 'tt0384766',
 'tt0090509',
 'tt0098769',
 'tt0434706',
 'tt2707408',
 'tt1839578',
 'tt18

### 3.B Get Data on the Top TV Shows

Although IMBb doesn't have a public API, an open API exists at http://www.tvmaze.com/api.

Use this API to retrieve information about each of the 250 TV shows you extracted in the previous step.
1) Check the documentation of TVmaze's API to learn how to request show data by ID.
- Define a function that returns a Python object with select information for a given ID.
    - Show name.
    - Rating (avg).
    - Genre(s).
    - Network name.
    - Premiere date.
    - Status.
> Tip: The JSON object can easily be converted into a Python dictionary.

- Store the gathered information in a Pandas DataFrame.

Because the target information is in a JSON format, you'll need `json.loads(res.text)` in order to gather it.

In [181]:
res = requests.get('http://api.tvmaze.com/lookup/shows?imdb=tt0353049')

In [208]:
def list_of_movies(lis):
    url = 'http://api.tvmaze.com/lookup/shows?imdb='
    shows = []
    for show in lis:
        tv = {}
        new_url = url + show
        res = requests.get(new_url)
        if res.status_code == 200:
            try:    
                tv['name'] = json.loads(res.text).get('name')
            except:
                tv['name'] = 'NaN'
            try: 
                tv['average_rating'] = json.loads(res.text).get('rating').get('average')
            except:
                tv['average_rating'] = 'NaN'
            try:
                tv['genre'] = json.loads(res.text).get('genres')
            except: 
                tv['genre'] = 'NaN'
            try:
                tv['network'] = json.loads(res.text).get('network').get('name')
            except:
                tv['network'] = 'NaN'
            try:
                tv['premiere_date'] = json.loads(res.text).get('premiered')
            except:
                tv['premiere_date'] = 'NaN'
            try: 
                tv['status'] = json.loads(res.text).get('status')
            except:
                tv['status'] = 'NaN'
            shows.append(tv)
    return pd.DataFrame(shows)

In [209]:
df = list_of_movies(final)

In [210]:
df

Unnamed: 0,average_rating,genre,name,network,premiere_date,status
0,9.3,[],Cosmos,FOX,2014-03-09,Running
1,8.9,"[Comedy, Action, Adult]",Archer,FXX,2009-09-17,Running
2,9.3,"[Action, Adventure, Science-Fiction]",Batman: The Animated Series,FOX,1992-09-05,Ended
3,9.4,[Comedy],Spaced,Channel 4,1999-09-24,Ended
4,8.8,[Food],Chef's Table,,2015-04-26,Running
5,9.0,"[Drama, Crime]",Fargo,FX,2014-04-15,Running
6,9.0,"[Drama, Action, Anime, Science-Fiction]",Code Geass,MBS,2006-10-05,Ended
7,9.1,"[Action, Adventure, Science-Fiction]",Young Justice,Cartoon Network,2010-11-26,Running
8,8.2,[Drama],Mad Men,AMC,2007-07-19,Ended
9,9.5,"[Drama, Espionage]",Tinker Tailor Soldier Spy,BBC One,1979-09-10,Ended
