
# Practice Using APIs


---

In this lab we will practice using the imdb API.

In [14]:
import requests
import json
from bs4 import BeautifulSoup
import pandas as pd

## IMDB TV Shows

---

Sometimes an API doesn't provide all the information we would like to get and we need to be creative.

Here we will use a combination of scraping and API calls to find the ratings and networks of famous television shows.

### Get the top TV Shows

The Internet Movie Database contains data about movies and TV shows. Unfortunately it does not have a public API.

The page http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2 contains the list of the top 250 tv shows of all time. Retrieve the page using the requests library and then parse the html to obtain a list of the `movie_ids` for these movies. You can parse it with regular expression or using a library like `BeautifulSoup`.

> **Hint:** movie_ids look like this: `tt2582802`
> _Everything after "/title/" and before "/?"_

In [15]:
# regex version
def get_top_250():
    response = requests.get('http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2')
    html = response.text
    # use greedy version to find everything after title to the next backslash in the a href element
    entries = re.findall("<a href.*?/title/(.*?)/", html) 
    # create a list of the top 250 results
    return list(set(entries))

In [16]:
# beautiful soup version
def get_top_250():
    response = requests.get('http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2')
    html = response.text
    soup = BeautifulSoup(html, 'lxml')
    link_list = []
    for a in soup.findAll('a'):
        try:
            if 'title/tt' in a['href']:
                link_list.append(a['href'].split('/')[2])
        except:
            pass
    return list(set(link_list))

In [17]:
entries = get_top_250()

In [18]:
len(entries)

250

In [19]:
entries[0]

'tt0185133'

###  Get data on the top movies

Although the Internet Movie Database does not have a public API, an open API exists at http://www.tvmaze.com/api.

Use this API to retrieve information about each of the 250 TV shows you have extracted in the previous step.
1. Check the documentation of tvmaze's api to learn how to request show data by id.
- Define a function that returns a python object with select information for a given id.
    - Show name
    - Rating (avg)
    - Genre(s)
    - Network name
    - Premiere date
    - Status
> Tip: the json object can easily be converted into a python dictionary.

- Store the gathered information in a Pandas Dataframe.


As the target information is in json format you will need `json.loads(res.text)` in order to gather it.

Heres and example of the information and how we can interact with it.

In [20]:
# example url
res=requests.get('http://api.tvmaze.com/lookup/shows?imdb=tt0944947')

# status code
print res.status_code

# just the contents of the name element
print json.loads(res.text).get('name')

# entire contents
print json.loads(res.text)

200
Game of Thrones
{u'status': u'Running', u'rating': {u'average': 9.3}, u'genres': [u'Drama', u'Adventure', u'Fantasy'], u'weight': 100, u'updated': 1502190750, u'name': u'Game of Thrones', u'language': u'English', u'schedule': {u'days': [u'Sunday'], u'time': u'21:00'}, u'url': u'http://www.tvmaze.com/shows/82/game-of-thrones', u'officialSite': u'http://www.hbo.com/game-of-thrones', u'externals': {u'thetvdb': 121361, u'tvrage': 24493, u'imdb': u'tt0944947'}, u'premiered': u'2011-04-17', u'summary': u'<p>Based on the bestselling book series <i>A Song of Ice and Fire</i> by George R.R. Martin, this sprawling new HBO drama is set in a world where summers span decades and winters can last a lifetime. From the scheming south and the savage eastern lands, to the frozen north and ancient Wall that protects the realm from the mysterious darkness beyond, the powerful families of the Seven Kingdoms are locked in a battle for the Iron Throne. This is a story of duplicity and treachery, nobility

In [21]:
#  function to pull information from API using Json interaction
def get_entry(entry):
    res=requests.get('http://api.tvmaze.com/lookup/shows?imdb='+entry)
    if res.status_code == 200:
        try:
            status = json.loads(res.text).get('status')
        except AttributeError:
            status = 'NA'
        try: 
            rating = json.loads(res.text).get('rating').get('average')
        except AttributeError:
            rating = 'NA'
            
        try:
            network = json.loads(res.text).get('network').get('name')
        except AttributeError:
            network = 'NA'
            
        try:
            title = json.loads(res.text).get('name')
        except AttributeError:
            title = 'NA'
            
        try:
            premier = json.loads(res.text).get('premiered')
        except AttributeError:
            premier = 'NA'
            
        try:
            genres = json.loads(res.text).get('genres')
        except AttributeError:
            genres = 'NA'

        # takes local variables as a 
        shows_df.loc[len(shows_df)] = [title, rating, genres, network, premier, status]

In [22]:
#  function to pull information from API converting Json into a python dictionary element
def get_entry(entry):
    res=requests.get('http://api.tvmaze.com/lookup/shows?imdb='+entry)
    if res.status_code == 200:
        results = json.loads(res.text)
        
        try:    
            status = results['status']
        except TypeError:
            status = 'NA'   
        try:
            rating = results['rating']['average']
        except TypeError:
            rating = 'NA'
        try:
            network = results['network']['name']
        except TypeError:
            network = 'NA'
        try:   
            title = results['name']
        except TypeError:
            title = 'NA'
        try:   
            genres = results['genres']
        except TypeError:
            genres = 'NA'
        try:   
            premier = results['premiered']
        except TypeError:
            premier = 'NA'
        shows_df.loc[len(shows_df)] = [title, rating, genres, network, premier, status]

In [23]:
# in both functions we are looking for specific elements.  If an element is missing an error will return thus the need
# for try and except statements.

In [24]:
shows_df= pd.DataFrame( columns = ['show_name', 'rating_avg', 'genres', 'network', 'premiere_date', 'status'])

for entry in entries:
    get_entry(entry)

In [25]:
shows_df.head()

Unnamed: 0,show_name,rating_avg,genres,network,premiere_date,status
0,Yu Yu Hakusho: Ghost Files,,"[Comedy, Action, Anime, Supernatural]",Fuji TV,1992-10-10,Ended
1,Only Fools and Horses,8.6,[Comedy],BBC One,1981-09-08,Ended
2,Sherlock,9.2,"[Drama, Crime, Mystery]",BBC One,2010-07-25,Running
3,Leyla ile Mecnun,,"[Drama, Comedy, Action, Adventure]",TRT1,2011-02-09,Ended
4,Community,8.5,[Comedy],,2009-09-17,Ended


In [26]:
shows_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 236 entries, 0 to 235
Data columns (total 6 columns):
show_name        236 non-null object
rating_avg       230 non-null float64
genres           236 non-null object
network          236 non-null object
premiere_date    236 non-null object
status           236 non-null object
dtypes: float64(1), object(5)
memory usage: 12.9+ KB
