# APIs Lab
In this lab we will practice using APIs to retrieve and store data.

In [1]:
# Imports at the top
import json
import urllib
import pandas as pd
import numpy as np
import requests
import json
import re
import matplotlib.pyplot as plt
%matplotlib inline

## Exercise 1: Get Data From Sheetsu

[Sheetsu](https://sheetsu.com/) is an online service that allows you to access any Google spreadsheet from an API. This can be a very powerful way to share a dataset with colleagues as well as to create a mini centralized data storage, that is simpler to edit than a database.

A Google Spreadsheet with Wine data can be found [here]().

It can be accessed through sheetsu API at this endpoint: https://sheetsu.com/apis/v1.0/dab55afd

Questions:

1. Use the requests library to access the document. Inspect the response text. What kind of data is it?

> Answer: it's a json string
- Check the status code of the response object. What code is it?
> 200
- use the appropriate libraries and read functions to read the response into a Pandas Dataframe
> Possible answers include: pd.read_json and json.loads + pd.Dataframe
- once you've imported the data into a dataframe, check the value of the 5th line: what's the price? (should be 6)

In [7]:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse

webpath = 'https://sheetsu.com/apis/v1.0/dab55afd'
response = requests.get(webpath)
JSON = response.text  

print response.headers
print response.status_code
print JSON



{'Status': '200 OK', 'Content-Length': '19756', 'Vary': 'Origin', 'X-Request-Id': '28f87501-ee12-48bb-8d47-4eb83e8f8466', 'Server': 'nginx', 'Connection': 'keep-alive', 'ETag': 'W/"530b6398e17572e0df3096ae06f4bf1d"', 'Cache-Control': 'max-age=0, private, must-revalidate', 'Date': 'Mon, 29 Aug 2016 23:30:58 GMT', 'X-Runtime': '1.259784', 'Content-Type': 'application/json;charset=UTF-8'}
200
[{"Color":"W","Region":"Portugal","Country":"Portugal","Vintage":"2013","Vinyard":"Vinho Verde","Name":"","Grape":"","Consumed In":"2015","Score":"4","Price":""},{"Color":"W","Region":"France","Country":"France","Vintage":"2013","Vinyard":"Peyruchet","Name":"","Grape":"","Consumed In":"2015","Score":"3","Price":"17.8"},{"Color":"W","Region":"Oregon","Country":"Oregon","Vintage":"2013","Vinyard":"Abacela","Name":"","Grape":"","Consumed In":"2015","Score":"3","Price":"20"},{"Color":"W","Region":"Spain","Country":"Spain","Vintage":"2012","Vinyard":"Ochoa","Name":"","Grape":"chardonay","Consumed In":"201

In [9]:
df =pd.read_json(JSON)
print df

    Color Consumed In   Country  \
0       W        2015  Portugal   
1       W        2015    France   
2       W        2015    Oregon   
3       W        2015     Spain   
4       R        2015        US   
5       R        2015        US   
6       R        2015        US   
7       R        2015    France   
8       R        2015    France   
9       R        2015        US   
10      R        2015     Italy   
11      R        2015             
12      R        2015        US   
13      R        2015     Italy   
14      W        2013    France   
15      R        2013        US   
16      R        2013    France   
17      R        2013    France   
18      W        2014        US   
19      W        2014        US   
20      R        2014     Italy   
21      P        2014        US   
22      W        2014        US   
23      R        2014    France   
24      W        2015    France   
25      W        2015             
26      W        2015  Portugal   
27      W        201

### Exercise 2: Post Data to Sheetsu
Now that we've learned how to read data, it'd be great if we could also write data. For this we will need to use a _POST_ request.

1. Use the post command to add the following data to the spreadsheet:

In [11]:
post_data = {
'Grape' : ''
, 'Name' : 'My wonderful wine'
, 'Color' : 'R'
, 'Country' : 'US'
, 'Region' : 'Sonoma'
, 'Vinyard' : ''
, 'Score' : '10'
, 'Consumed In' : '2015'
, 'Vintage' : '1973'
, 'Price' : '200'
}

1. What status did you get? How can you check that you actually added the data correctly?
- In this exercise, your classmates are adding data to the same spreadsheet. What happens because of this? Is it a problem? How could you mitigate it?

In [12]:
requests.post('https://sheetsu.com/apis/v1.0/dab55afd',json=post_data)

<Response [201]>

## Exercise 3: Data munging

Get back to the dataframe you've created in the beginning. Let's do some data munging:

1. Search For missing data
    - Is there any missing data? How do you deal with it?
    - Is there any data you can just remove?
    - Are the data types appropriate?
- Summarize the data 
    - Try using describe, min, max, mean, var

In [36]:
import numpy as np
df_null = df.replace(to_replace='', value=np.nan)
df_null.columns = [x.replace(' ','_').lower() for x in df_null.columns]

#print df_null.isnull().sum()
# $print df_null.info()

# grape field is missing too much, will ignore. 
# vineyard is missing too much will ignore
# will drop all other fields

keep_columns = [x for x in df_null.columns if x not in ['grape','vinyard'] ]
df_clean = df_null[keep_columns].copy()
df_clean.dropna(inplace=True)
print df_clean.info()
df_clean.describe()
df_clean.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 101 entries, 6 to 116
Data columns (total 8 columns):
color          101 non-null object
consumed_in    101 non-null object
country        101 non-null object
name           101 non-null object
price          101 non-null object
region         101 non-null object
score          101 non-null object
vintage        101 non-null object
dtypes: object(8)
memory usage: 7.1+ KB
None


Unnamed: 0,color,consumed_in,country,name,price,region,score,vintage
6,R,2015,US,#14,21,Oregon,2.5,2013
10,R,2015,Italy,Rosso Dei Poggi,12,Tuscany,3.0,2012
13,R,2015,Italy,Rosso Di Montalcino,15,Tuscany,3.5,2012
14,W,2013,France,Sancerre Cuvee Des Moulins Bales,12,Loire,3.0,2012
15,R,2013,US,Meiomi,13,Napa,3.0,2012


## Exercise 4: Feature Extraction

We would like to use a regression tree to predict the score of a wine. In order to do that, we first need to select and engineer appropriate features.

1. set the target to be the Score column, drop the rows with no score
- use pd.get_dummies to create dummy features for all the text columns
- fill the nan values in the numerical columns, using an appropriate method
- train a Decision tree regressor on the Score, using a train test split:
        X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.3, random_state=42)
- plot the test values, the predicted values and the residuals
- calculate R^2 score
- discuss your findings


## Exercise 5: IMDB Movies

Sometimes an API doesn't provide all the information we would like to get and we need to be creative.
Here we will use a combination of scraping and API calls to investigate the ratings and gross earnings of famous movies.

## 5.a Get top movies

The Internet Movie Database contains data about movies. Unfortunately it does not have a public API.

The page http://www.imdb.com/chart/top contains the list of the top 250 movies of all times. Retrieve the page using the requests library and then parse the html to obtain a list of the `movie_ids` for these movies. You can parse it with regular expression or using a library like `BeautifulSoup`.

**Hint:** movie_ids look like this: `tt2582802`

In [37]:
webpath='http://www.imdb.com/chart/top'
response = requests.get(webpath)
HTML = response.text  


In [58]:

rows = Selector(text=HTML).xpath('//tr').extract()
data = [ Selector(text=r).xpath('//div/@data-titleid').extract() for r in rows]

data = [x[0].encode('utf-8') for x in data if len(x)> 0]

print data
 

['tt0111161', 'tt0068646', 'tt0071562', 'tt0468569', 'tt0108052', 'tt0050083', 'tt0110912', 'tt0167260', 'tt0060196', 'tt0137523', 'tt0120737', 'tt0080684', 'tt0109830', 'tt1375666', 'tt0167261', 'tt0073486', 'tt0099685', 'tt0133093', 'tt0047478', 'tt0076759', 'tt0317248', 'tt0114369', 'tt0102926', 'tt0038650', 'tt0114814', 'tt0118799', 'tt0110413', 'tt0064116', 'tt0245429', 'tt0120815', 'tt0120586', 'tt0034583', 'tt0816692', 'tt0054215', 'tt0021749', 'tt0082971', 'tt0027977', 'tt1675434', 'tt0120689', 'tt0047396', 'tt0103064', 'tt0407887', 'tt0253474', 'tt0088763', 'tt2582802', 'tt0172495', 'tt0209144', 'tt0078788', 'tt0482571', 'tt0110357', 'tt0057012', 'tt0043014', 'tt0078748', 'tt0032553', 'tt0405094', 'tt0095765', 'tt0050825', 'tt1853728', 'tt0081505', 'tt0095327', 'tt0910970', 'tt1345836', 'tt0169547', 'tt0119698', 'tt0090605', 'tt0364569', 'tt0033467', 'tt0087843', 'tt0082096', 'tt0053125', 'tt0052357', 'tt0051201', 'tt0086190', 'tt0022100', 'tt0105236', 'tt0112573', 'tt0211915'

## 5.b Get top movies data

Although the Internet Movie Database does not have a public API, an open API exists at http://www.omdbapi.com.

Use this API to retrieve information about each of the 250 movies you have extracted in the previous step.
1. Check the documentation of omdbapi.com to learn how to request movie data by id
- define a function that returns a python object with all the information for a given id
- iterate on all the IDs and store the results in a list of such objects
- create a Pandas Dataframe from the list

In [None]:
"http://www.omdbapi.com/?t=avengers&y=2015&plot=short&r=json"
'http://img.omdbapi.com/?apikey=[yourkey]&'

datalist = []
for y in data:
    webpath='http://www.omdbapi.com/?i='+y+'&r=json'
    response = requests.get(webpath)
    HTML = response.text
    datalist.append(HTML)


## 5.c Get gross data

The OMDB API is great, but it does not provide information about Gross Revenue of the movie. We'll revert back to scraping for this.

1. Write a function that retrieves the gross revenue from the entry page at imdb.com
- the function should handle the exception of when the page doesn't report gross revenue
- retrieve the gross revenue for each movie and store it in a separate dataframe

## 5.d Data munging

1. Now that you have movie information and gross revenue information, let's clean the two datasets.
- check if there are null values. Be careful they may appear to be valid strings.
- convert the columns to the appropriate formats. In particular handle:
    - Released
    - Runtime
    - year
    - imdbRating
    - imdbVotes
- merge the data from the two datasets into a single one

## 5.d Text vectorization

There are several columns in the data that contain a comma separated list of items, for example the Genre column and the Actors column. Let's transform those to binary columns using the count vectorizer from scikit learn.

Append these columns to the merged dataframe.

**Hint:** In order to get the actors name right, you'll have to modify the `token_pattern` in the `CounteVectorizer`.

## Bonus: Final Questions:

1. what are the top 10 grossing movies?
- who are the 10 actors that appear in the most movies?
- what's the average grossing of the movies in which each of these actors appear?
- what genre is the oldest movie?
