# APIs Lab
In this lab we will practice using APIs to retrieve and store data.

In [132]:
# Imports at the top
import json
import urllib
import pandas as pd
import numpy as np
import requests
import json
import re
import matplotlib.pyplot as plt
%matplotlib inline

## Exercise 1: IMDB Movies

Sometimes an API doesn't provide all the information we would like to get and we need to be creative.
Here we will use a combination of scraping and API calls to investigate the ratings and gross earnings of famous movies.

## 1.a Get top movies

The Internet Movie Database contains data about movies. Unfortunately it does not have a public API.

The page http://www.imdb.com/chart/top contains the list of the top 250 movies of all times. Retrieve the page using the requests library and then parse the html to obtain a list of the `movie_ids` for these movies. You can parse it with regular expression or using a library like `BeautifulSoup`.

**Hint:** movie_ids look like this: `tt2582802`

In [133]:
from bs4 import BeautifulSoup
import urllib2

page=urllib2.urlopen('http://www.imdb.com/chart/top', 'lxml')
soup = BeautifulSoup(page.read())

tables = soup.find_all('table', class_="chart full-width")
print len(tables)

            

1


In [184]:
movie_ids = []

for idx, table in enumerate(tables):
    for idx2, row in enumerate(table.find_all('tr')):
        if idx2 > 0:

            for idx3, column, in enumerate(row.find_all(['td'])):
                if idx3 == 0:
                    a = column.find_all('a', href = True)
                    movie_ids += [a[0]['href'].split('/')[2]]
#                 if idx3 == 1:
#                     dicty['titles'] += [column.text.split('\n')[2]]
#                     dicty['years'] += [column.text.split('\n')[3][1:5]]
#                 if idx3 == 2:
#                     dicty['ratings'] += re.findall('\w*.\w*', column.text)


## 1.b Get top movies data

Although the Internet Movie Database does not have a public API, an open API exists at http://www.omdbapi.com.

Use this API to retrieve information about each of the 250 movies you have extracted in the previous step.
- Check the documentation of omdbapi.com to learn how to request movie data by id
- Define a function that returns a python object with all the information for a given id
- Iterate on all the IDs and store the results in a list of such objects
- Create a Pandas Dataframe from the list

In [277]:
info = pd.DataFrame()
for i in movie_ids:
    soup = BeautifulSoup(urllib2.urlopen('http://www.omdbapi.com/?i='+ i), "lxml")
    parsed_json = json.loads(soup.text)
    info = pd.concat((info, (pd.DataFrame([parsed_json]))), axis = 0)
    



In [278]:
movie_data = info.drop(['Response','Type'], axis = 1)
#info = info.drop('Response', axis = 1)
movie_data.head()

Unnamed: 0,Actors,Awards,Country,Director,Genre,Language,Metascore,Plot,Poster,Rated,Released,Runtime,Title,Writer,Year,imdbID,imdbRating,imdbVotes
0,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",Nominated for 7 Oscars. Another 18 wins & 30 n...,USA,Frank Darabont,"Crime, Drama",English,80.0,Two imprisoned men bond over a number of years...,https://images-na.ssl-images-amazon.com/images...,R,14 Oct 1994,142 min,The Shawshank Redemption,"Stephen King (short story ""Rita Hayworth and S...",1994,tt0111161,9.3,1711064
0,"Marlon Brando, Al Pacino, James Caan, Richard ...",Won 3 Oscars. Another 23 wins & 27 nominations.,USA,Francis Ford Coppola,"Crime, Drama","English, Italian, Latin",100.0,The aging patriarch of an organized crime dyna...,https://images-na.ssl-images-amazon.com/images...,R,24 Mar 1972,175 min,The Godfather,"Mario Puzo (screenplay), Francis Ford Coppola ...",1972,tt0068646,9.2,1177812
0,"Al Pacino, Robert Duvall, Diane Keaton, Robert...",Won 6 Oscars. Another 10 wins & 20 nominations.,USA,Francis Ford Coppola,"Crime, Drama","English, Italian, Spanish, Latin, Sicilian",80.0,The early life and career of Vito Corleone in ...,https://images-na.ssl-images-amazon.com/images...,R,20 Dec 1974,202 min,The Godfather: Part II,"Francis Ford Coppola (screenplay), Mario Puzo ...",1974,tt0071562,9.0,807734
0,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",Won 2 Oscars. Another 146 wins & 142 nominations.,"USA, UK",Christopher Nolan,"Action, Crime, Drama","English, Mandarin",82.0,When the menace known as the Joker wreaks havo...,https://images-na.ssl-images-amazon.com/images...,PG-13,18 Jul 2008,152 min,The Dark Knight,"Jonathan Nolan (screenplay), Christopher Nolan...",2008,tt0468569,9.0,1699835
0,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",Nominated for 3 Oscars. Another 16 wins & 8 no...,USA,Sidney Lumet,"Crime, Drama",English,,A jury holdout attempts to prevent a miscarria...,https://images-na.ssl-images-amazon.com/images...,APPROVED,01 Apr 1957,96 min,12 Angry Men,"Reginald Rose (story), Reginald Rose (screenplay)",1957,tt0050083,8.9,455987


## 1.c Get gross data

The OMDB API is great, but it does not provide information about Gross Revenue of the movie. We'll revert back to scraping for this.

- Write a function that retrieves the gross revenue from the entry page at imdb.com
- The function should handle the exception of when the page doesn't report gross revenue
- Retrieve the gross revenue for each movie and store it in a separate dataframe

In [220]:

gross_urls = []

for idx, col in enumerate(table.find_all('tr')):
    for idx2, row in enumerate(col.find_all('a')):
        gross_urls += ['http://www.imdb.com' + row['href']]
        
#print gross_urls[0]

def gross(url):
    page=urllib2.urlopen(url, 'lxml')
    soup = BeautifulSoup(page.read())
    gross = 0.0
    for idx, div in enumerate(soup.find_all('div', id='titleDetails')):
        for idx2, div2 in enumerate(div.find_all('div', class_='txt-block')):
            if 'Gross' in div2.text:
                gross = float((div2.text.split('$')[1]).split(' ')[0].replace(',', ''))
    return gross

grosses = []
for url in gross_urls:
    grosses += [gross(url)]
    


In [322]:
import unicodedata

grosses = pd.Series(grosses)
grosses.head()


0     28341469.0
1     28341469.0
2    134821952.0
3    134821952.0
4     57300000.0
dtype: float64

## 1.d Data munging

- Now that you have movie information and gross revenue information, let's clean the two datasets.
- Check if there are null values. Be careful they may appear to be valid strings.
- Convert the columns to the appropriate formats. In particular handle:
    - Released
    - Runtime
    - year
    - imdbRating
    - imdbVotes
- Merge the data from the two datasets into a single one

In [396]:
#len(info['Actors'].value_counts())
temp = []
# Runtime strip 'min' and make float
# for x in movie_data['Runtime']:
#     temp.append(float(x.split(' ')[0]))
#movie_data['Runtime'] = temp

# To find and drop the one "n/a" value
#movie_data[movie_data['Released'] == 'N/A']['imdbID']
movie_data = movie_data[movie_data['imdbID'] != 'tt0015864']


for x in movie_data['Released']:
    temp.append(pd.to_datetime(x))
movie_data['Released'] = temp

movie_data['Year'] = pd.to_numeric(movie_data['Year'])
movie_data['imdbRating'] = pd.to_numeric(movie_data['imdbRating'])
# temp = []
# for x in movie_data['imdbVotes']:
#     temp.append(float(x.replace(',','')))
    
#movie_data['imdbVotes'] = temp
#movie_data['imdbVotes'] = pd.to_numeric(movie_data['imdbVotes'])
movie_data['Gross'] = grosses
movie_data.index = range(249)
movie_data.head()

Unnamed: 0,Actors,Awards,Country,Director,Genre,Language,Metascore,Plot,Poster,Rated,Released,Runtime,Title,Writer,Year,imdbID,imdbRating,imdbVotes,Gross
0,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",Nominated for 7 Oscars. Another 18 wins & 30 n...,USA,Frank Darabont,"Crime, Drama",English,80.0,Two imprisoned men bond over a number of years...,https://images-na.ssl-images-amazon.com/images...,R,1994-10-14,142.0,The Shawshank Redemption,"Stephen King (short story ""Rita Hayworth and S...",1994,tt0111161,9.3,1711064.0,28341469.0
1,"Marlon Brando, Al Pacino, James Caan, Richard ...",Won 3 Oscars. Another 23 wins & 27 nominations.,USA,Francis Ford Coppola,"Crime, Drama","English, Italian, Latin",100.0,The aging patriarch of an organized crime dyna...,https://images-na.ssl-images-amazon.com/images...,R,1972-03-24,175.0,The Godfather,"Mario Puzo (screenplay), Francis Ford Coppola ...",1972,tt0068646,9.2,1177812.0,28341469.0
2,"Al Pacino, Robert Duvall, Diane Keaton, Robert...",Won 6 Oscars. Another 10 wins & 20 nominations.,USA,Francis Ford Coppola,"Crime, Drama","English, Italian, Spanish, Latin, Sicilian",80.0,The early life and career of Vito Corleone in ...,https://images-na.ssl-images-amazon.com/images...,R,1974-12-20,202.0,The Godfather: Part II,"Francis Ford Coppola (screenplay), Mario Puzo ...",1974,tt0071562,9.0,807734.0,28341469.0
3,"Christian Bale, Heath Ledger, Aaron Eckhart, M...",Won 2 Oscars. Another 146 wins & 142 nominations.,"USA, UK",Christopher Nolan,"Action, Crime, Drama","English, Mandarin",82.0,When the menace known as the Joker wreaks havo...,https://images-na.ssl-images-amazon.com/images...,PG-13,2008-07-18,152.0,The Dark Knight,"Jonathan Nolan (screenplay), Christopher Nolan...",2008,tt0468569,9.0,1699835.0,28341469.0
4,"Martin Balsam, John Fiedler, Lee J. Cobb, E.G....",Nominated for 3 Oscars. Another 16 wins & 8 no...,USA,Sidney Lumet,"Crime, Drama",English,,A jury holdout attempts to prevent a miscarria...,https://images-na.ssl-images-amazon.com/images...,APPROVED,1957-04-01,96.0,12 Angry Men,"Reginald Rose (story), Reginald Rose (screenplay)",1957,tt0050083,8.9,455987.0,28341469.0


## 1.e Text vectorization

There are several columns in the data that contain a comma separated list of items, for example the Genre column and the Actors column. Let's transform those to binary columns using the count vectorizer from scikit learn.

Append these columns to the merged dataframe.

**Hint:** In order to get the actors name right, you'll have to modify the `token_pattern` in the `CountVectorizer`.

In [419]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features = 1000,
                             tokenizer=lambda x:x.split(', '),
                             stop_words='english',
                             binary=True)

vectorizer.fit(movie_data['Actors'])
z = vectorizer.transform(movie_data['Actors']).todense()
text = pd.DataFrame(z, columns = vectorizer.get_feature_names())

merged = movie_data.merge(text, how = 'inner', left_index = True, right_index = True)
merged.head(1)

Unnamed: 0,Actors,Awards,Country,Director,Genre,Language,Metascore,Plot,Poster,Rated,...,yacef saadi,yesim salkim,yoshiko shinohara,yukiko shimazaki,yves montand,yôko tsukasa,zach grenier,zoe saldana,álvaro guerrero,çetin tekindor
0,"Tim Robbins, Morgan Freeman, Bob Gunton, Willi...",Nominated for 7 Oscars. Another 18 wins & 30 n...,USA,Frank Darabont,"Crime, Drama",English,80,Two imprisoned men bond over a number of years...,https://images-na.ssl-images-amazon.com/images...,R,...,0,0,0,0,0,0,0,0,0,0


In [453]:
vectorizer = CountVectorizer(max_features = 1000,
                             tokenizer=lambda x:x.split(', '),
                             stop_words='english',
                             binary=True)

vectorizer.fit(movie_data['Genre'])
blah = vectorizer.transform(movie_data['Actors']).todense()
genre_text = pd.DataFrame(blah, columns = vectorizer.get_feature_names())

merged = merged.merge(genre_text, how = 'inner', left_index = True, right_index = True)
merged.head(1)


TypeError: string indices must be integers

In [451]:
merged.to_csv('imdb.csv', sep='\t', encoding='utf-8')


## Exercise 2: Showing Our Data

We did all that hard work. Let's show it to the world! First save your .csv file.

```
df.to_csv('imdb.csv', sep='\t', encoding='utf-8')
```

## 2.a Add model

Add a new model to the models.py file in your website folder. It should contain each movie's meta data such as rank, summary, genre, and rating. Here is an example of a movies model.

```
class TopMovies(models.Model):
    name = models.CharField(max_length=200)
    rank = models.IntegerField(default=-1)
    created = models.DateTimeField(auto_now_add=True, blank=True)
```

## 2.b Add Url
Add the following path to your url.py file.

```
url(r'^imdb/top/$', views.imdb, name='imdb')
```

## 2.c Create View

Create your view. Remember you need to name it imdb! Your view should contain the following logic:

- If your imdb database has no values in it
    - Open up your imdb.csv file and populate the model
- Retrieve model and show the data on your webpage

## 2.d Create Template

Link a new template to your view. Your template must do the following things:

- Use a table, rows, and columns to organize your data
- Have a different background color besides white
- Show your image poster links as images

## 2.e Submit Url

Paste your new url below

In [None]:
#https://salty-reaches-92407.herokuapp.com/site/imdb/top/