In [4]:
'''
CLASS: Getting Data from APIs

What is an API?
- Application Programming Interface
- Structured way to expose specific functionality and data access to users
- Web APIs usually follow the "REST" standard

How to interact with an API:
- Make a "request" to a specific URL (an "endpoint"), and get the data back in a "response"
- Most relevant request method for us is GET (other methods: POST, PUT, DELETE)
- Response is often JSON format
- Web console is sometimes available (allows you to explore an API)
'''

'\nCLASS: Getting Data from APIs\n\nWhat is an API?\n- Application Programming Interface\n- Structured way to expose specific functionality and data access to users\n- Web APIs usually follow the "REST" standard\n\nHow to interact with an API:\n- Make a "request" to a specific URL (an "endpoint"), and get the data back in a "response"\n- Most relevant request method for us is GET (other methods: POST, PUT, DELETE)\n- Response is often JSON format\n- Web console is sometimes available (allows you to explore an API)\n'

In [1]:
import pandas as pd
import requests

In [2]:
# read IMDb data into a DataFrame: we want a year column!
movies = pd.read_csv('../data/imdb_1000.csv')
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [3]:
print movies.shape
movies.describe()

(979, 6)


Unnamed: 0,star_rating,duration
count,979.0,979.0
mean,7.889785,120.979571
std,0.336069,26.21801
min,7.4,64.0
25%,7.6,102.0
50%,7.8,117.0
75%,8.1,134.0
max,9.3,242.0


In [4]:
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 979 entries, 0 to 978
Data columns (total 6 columns):
star_rating       979 non-null float64
title             979 non-null object
content_rating    976 non-null object
genre             979 non-null object
duration          979 non-null int64
actors_list       979 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 46.0+ KB


In [12]:
movies.title.value_counts()

The Girl with the Dragon Tattoo                           2
Dracula                                                   2
Les Miserables                                            2
True Grit                                                 2
Kung Fu Hustle                                            1
The Bridge on the River Kwai                              1
Donnie Brasco                                             1
Running Scared                                            1
The Evil Dead                                             1
My Left Foot                                              1
Escape from Alcatraz                                      1
The Matrix                                                1
Brokeback Mountain                                        1
Hachi: A Dog's Tale                                       1
The Visitor                                               1
Elizabeth                                                 1
The Meaning of Life                     

In [8]:
###### exercise #######

# Is the title column unique? If not, what are the non unique names?



# answer below

In [13]:
from collections import Counter
for title, count in Counter(movies['title']).items():
    if count > 1:
        print title

The Girl with the Dragon Tattoo
Les Miserables
True Grit
Dracula


In [14]:
# use requests library to interact with a URL http://www.omdbapi.com
r = requests.get('http://www.omdbapi.com?t=the shawshank redemption&r=json&type=movie')

In [15]:
# check the status: 200 means success, 4xx or 5xx means error
r.status_code

200

In [16]:
# view the raw response text
r.text

u'{"Title":"The Shawshank Redemption","Year":"1994","Rated":"R","Released":"14 Oct 1994","Runtime":"142 min","Genre":"Crime, Drama","Director":"Frank Darabont","Writer":"Stephen King (short story \\"Rita Hayworth and Shawshank Redemption\\"), Frank Darabont (screenplay)","Actors":"Tim Robbins, Morgan Freeman, Bob Gunton, William Sadler","Plot":"Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.","Language":"English","Country":"USA","Awards":"Nominated for 7 Oscars. Another 16 wins & 20 nominations.","Poster":"http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_SX300.jpg","Metascore":"80","imdbRating":"9.3","imdbVotes":"1,684,836","imdbID":"tt0111161","Type":"movie","Response":"True"}'

In [17]:
# decode the JSON response body into a dictionary
r.json()

{u'Actors': u'Tim Robbins, Morgan Freeman, Bob Gunton, William Sadler',
 u'Awards': u'Nominated for 7 Oscars. Another 16 wins & 20 nominations.',
 u'Country': u'USA',
 u'Director': u'Frank Darabont',
 u'Genre': u'Crime, Drama',
 u'Language': u'English',
 u'Metascore': u'80',
 u'Plot': u'Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency.',
 u'Poster': u'http://ia.media-imdb.com/images/M/MV5BODU4MjU4NjIwNl5BMl5BanBnXkFtZTgwMDU2MjEyMDE@._V1_SX300.jpg',
 u'Rated': u'R',
 u'Released': u'14 Oct 1994',
 u'Response': u'True',
 u'Runtime': u'142 min',
 u'Title': u'The Shawshank Redemption',
 u'Type': u'movie',
 u'Writer': u'Stephen King (short story "Rita Hayworth and Shawshank Redemption"), Frank Darabont (screenplay)',
 u'Year': u'1994',
 u'imdbID': u'tt0111161',
 u'imdbRating': u'9.3',
 u'imdbVotes': u'1,684,836'}

In [18]:
# extracting the year from the dictionary
r.json()['Year']

u'1994'

In [19]:
# what happens if the movie name is not recognized?
r = requests.get('http://www.omdbapi.com/?t=thebestmovieevermade&r=json&type=movie')
print r.status_code
r.json()

200


{u'Error': u'Movie not found!', u'Response': u'False'}

In [20]:
# another example
movie_title = 'finding dory'
r = requests.get('http://www.omdbapi.com/?t='+movie_title+'&r=json&type=movie')
print r.status_code
r.json()

200


{u'Actors': u"Ellen DeGeneres, Albert Brooks, Ed O'Neill, Kaitlin Olson",
 u'Awards': u'2 nominations.',
 u'Country': u'USA',
 u'Director': u'Andrew Stanton, Angus MacLane',
 u'Genre': u'Animation, Adventure, Comedy',
 u'Language': u'English, Indonesian',
 u'Metascore': u'77',
 u'Plot': u'The friendly but forgetful blue tang fish begins a search for her long-lost parents, and everyone learns a few things about the real meaning of family along the way.',
 u'Poster': u'http://ia.media-imdb.com/images/M/MV5BNzg4MjM2NDQ4MV5BMl5BanBnXkFtZTgwMzk3MTgyODE@._V1_SX300.jpg',
 u'Rated': u'PG',
 u'Released': u'17 Jun 2016',
 u'Response': u'True',
 u'Runtime': u'97 min',
 u'Title': u'Finding Dory',
 u'Type': u'movie',
 u'Writer': u'Andrew Stanton (original story by), Andrew Stanton (screenplay), Victoria Strouse (screenplay), Bob Peterson (additional screenplay material by), Angus MacLane (additional story material by)',
 u'Year': u'2016',
 u'imdbID': u'tt2277860',
 u'imdbRating': u'7.8',
 u'imdbVot

In [25]:
temp_dict = r.json()
print temp_dict

{u'Plot': u'The friendly but forgetful blue tang fish begins a search for her long-lost parents, and everyone learns a few things about the real meaning of family along the way.', u'Rated': u'PG', u'Response': u'True', u'Language': u'English, Indonesian', u'Title': u'Finding Dory', u'Country': u'USA', u'Writer': u'Andrew Stanton (original story by), Andrew Stanton (screenplay), Victoria Strouse (screenplay), Bob Peterson (additional screenplay material by), Angus MacLane (additional story material by)', u'Metascore': u'77', u'imdbRating': u'7.8', u'Director': u'Andrew Stanton, Angus MacLane', u'Released': u'17 Jun 2016', u'Actors': u"Ellen DeGeneres, Albert Brooks, Ed O'Neill, Kaitlin Olson", u'Year': u'2016', u'Genre': u'Animation, Adventure, Comedy', u'Awards': u'2 nominations.', u'Runtime': u'97 min', u'Type': u'movie', u'Poster': u'http://ia.media-imdb.com/images/M/MV5BNzg4MjM2NDQ4MV5BMl5BanBnXkFtZTgwMzk3MTgyODE@._V1_SX300.jpg', u'imdbVotes': u'51,956', u'imdbID': u'tt2277860'}


In [26]:
print temp_dict['Year']

2016


In [42]:
##### Exercise #####

# define a function to return the year of release of a given movie title, return None if no movie found
def get_movie_year2(title):
    r = requests.get('http://www.omdbapi.com/?t='+title+'&r=json&type=movie')
    container = r.json()
    print container['Year']
    if container['Year']:
        return container['Year']
    else:
        return None







In [43]:
get_movie_year2('asdfsadf')

KeyError: 'Year'

In [44]:
def get_movie_year(title):
    response = requests.get('http://www.omdbapi.com/?t='+title+'&r=json&type=movie').json()
    if 'Error' not in response: return response['Year']


In [45]:
# test the function
print get_movie_year('finding dory')
print get_movie_year('blahblahblah')

2016
None


In [51]:
# create a smaller DataFrame for testing
# the copy method makes a carbon copy of the dataframe
top_movies = movies.head().copy()

In [52]:
# write a for loop to build a list of years
from time import sleep # timey wimey stuff
years = []
for title in top_movies.title:
    years.append(get_movie_year(title))
    sleep(1)
    
# the sleep is used to not over hit the API
# this is called "rate limiting"
# Most APIs don't allow you to hit it too much

In [49]:
# assert will throw an error if the value inside is NOT True

assert(3==4)

AssertionError: 

In [53]:
# check that the DataFrame and the list of years are the same length
assert(len(top_movies) == len(years))

In [54]:
# save that list as a new column
top_movies['year'] = years
top_movies

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list,year
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt...",1994
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']",1972
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv...",1974
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E...",2008
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L....",1994


In [None]:
'''
Bonus content: Updating the DataFrame as part of a loop
'''

# enumerate allows you to access the item location while iterating
letters = ['a', 'b', 'c']
for index, letter in enumerate(letters):
    print index, letter

In [None]:
# iterrows method for DataFrames is similar
for index, row in top_movies.iterrows():
    print index, row.title

In [None]:
# create a new column and set a default value
movies['yearsr'] = None
movies.head()

In [None]:
# loc method allows you to access a DataFrame element by 'label'
movies.loc[0, 'year'] = 1994
movies.head()

In [None]:
# write a for loop to update the year for the first three movies
for index, row in movies.iterrows():
    if index < 3:
        movies.loc[index, 'year'] = get_movie_year(row.title)
        sleep(1)
    else:
        break

In [30]:
movies.head()

Unnamed: 0,star_rating,title,content_rating,genre,duration,actors_list
0,9.3,The Shawshank Redemption,R,Crime,142,"[u'Tim Robbins', u'Morgan Freeman', u'Bob Gunt..."
1,9.2,The Godfather,R,Crime,175,"[u'Marlon Brando', u'Al Pacino', u'James Caan']"
2,9.1,The Godfather: Part II,R,Crime,200,"[u'Al Pacino', u'Robert De Niro', u'Robert Duv..."
3,9.0,The Dark Knight,PG-13,Action,152,"[u'Christian Bale', u'Heath Ledger', u'Aaron E..."
4,8.9,Pulp Fiction,R,Crime,154,"[u'John Travolta', u'Uma Thurman', u'Samuel L...."


In [55]:
# this is my own personal twitter api information
# if you could be so kind as to sign up yourself on both twitter and mashape that'd be great :)
# It's FREEEEEEE
api_key = 'g5uPIpw80nULQI1gfklv2zrh4'
api_secret = 'cOWvNWxYvPmEZ0ArZVeeVVvJu41QYHdUS2GpqIKtSQ1isd5PJy'
access_token = '49722956-TWl8J0aAS6KTdcbz3ppZ7NfqZEmrwmbsb9cYPNELG'
access_secret = '3eqrVssF3ppv23qyflyAto8wLEiYRA8sXEPSghuOJWTub'

# Masahpe Key
mashape_key = '0CLvblsJOxmshWkaep4szo9CILOMp1PM3hhjsnDi4k8g8ME14o'

In [57]:
# more complicated request

# HEADERS tell the API (or website) the type of system attempting the request
# DATA tells the API (or website) any pertinent info needed to make the request
# it is up to whoever wrote the API how access keys are passed in

url = "https://japerk-text-processing.p.mashape.com/sentiment/"
headers ={
        "X-Mashape-Key": mashape_key,
        "Content-Type": "application/x-www-form-urlencoded"
        }
data={
        "language": "english",
        "text": "this sucks"      ### change here the text
        }

print requests.post(url, headers = headers, data = data).json()

{u'probability': {u'neg': 0.7705685298387661, u'neutral': 0.3944811241378215, u'pos': 0.22943147016123394}, u'label': u'neg'}


In [5]:
'''
Example of API WITH WRAPPER
tweepy is the python wrapper for twitter data
'''

'\nExample of API WITH WRAPPER\ntweepy is the python wrapper for twitter data\n'

In [58]:
import tweepy       # python wrapper for twitter api
import json
import time

In [59]:
tag = 'donald trump'

# Documentation is your friend! http://docs.tweepy.org/en/v3.1.0/
auth = tweepy.OAuthHandler(api_key, api_secret)
auth.set_access_token(access_token, access_secret)
api = tweepy.API(auth) # returns a tweepy authorization handler object
tweets = api.search(q=tag)

In [60]:
# let's take a look at the first one
tweets[0]

Status(contributors=None, truncated=False, text=u"RT @davidsirota: REVEALED: Trump gave Christie's GOP group $170K just before Christie admin slashed Trump's tax bill by $25 million https:/\u2026", is_quote_status=False, in_reply_to_status_id=None, id=766114709590384640, favorite_count=0, _api=<tweepy.api.API object at 0x116d99e50>, author=User(follow_request_sent=False, has_extended_profile=False, profile_use_background_image=False, _json={u'follow_request_sent': False, u'has_extended_profile': False, u'profile_use_background_image': False, u'default_profile_image': False, u'id': 2706897924, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'verified': False, u'profile_text_color': u'000000', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/755910153069821954/tZ2KAUEj_normal.jpg', u'profile_sidebar_fill_color': u'000000', u'entities': {u'description': {u'urls': []}}, u'followers_count': 291, u'profile_sidebar_border_color':

In [61]:
# wrappers come with built in python attributes and methods!
print tweets[0].created_at
print tweets[0].text

2016-08-18 03:29:15
RT @davidsirota: REVEALED: Trump gave Christie's GOP group $170K just before Christie admin slashed Trump's tax bill by $25 million https:/…


In [62]:
# the author is an object in and of itself
tweets[0].author

User(follow_request_sent=False, has_extended_profile=False, profile_use_background_image=False, _json={u'follow_request_sent': False, u'has_extended_profile': False, u'profile_use_background_image': False, u'default_profile_image': False, u'id': 2706897924, u'profile_background_image_url_https': u'https://abs.twimg.com/images/themes/theme1/bg.png', u'verified': False, u'profile_text_color': u'000000', u'profile_image_url_https': u'https://pbs.twimg.com/profile_images/755910153069821954/tZ2KAUEj_normal.jpg', u'profile_sidebar_fill_color': u'000000', u'entities': {u'description': {u'urls': []}}, u'followers_count': 291, u'profile_sidebar_border_color': u'000000', u'id_str': u'2706897924', u'profile_background_color': u'000000', u'listed_count': 58, u'is_translation_enabled': False, u'utc_offset': None, u'statuses_count': 18331, u'description': u'Fighting for truth, liberty and justice for all.', u'friends_count': 307, u'location': u'', u'profile_link_color': u'0009B3', u'profile_image_ur

In [63]:
# the author's handle
print tweets[0].author.screen_name
print tweets[0].author.profile_image_url

JoZPina
http://pbs.twimg.com/profile_images/755910153069821954/tZ2KAUEj_normal.jpg


<img src="http://pbs.twimg.com/profile_images/746490546831691776/n3aQiG8f_normal.jpg">

In [21]:
'''
THE BELOW CODE IS OPTIONAl
It creates a stream of a given tag!
'''

# This is the listener, resposible for receiving data
# I will not be covering this in class
class StdOutListener(tweepy.StreamListener):
    def on_data(self, data):
        # Twitter returns data in JSON format - we need to decode it first
        decoded = json.loads(data)
        #print decoded
        time_ =  time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(int(decoded['timestamp_ms']) / 1000))
        handle = decoded['user']['screen_name']
        tweet_text = decoded['text'].encode('ascii', 'ignore')
        num_followers = int(decoded['user']['followers_count'])
        # Also, we convert UTF-8 to ASCII ignoring all bad characters sent by users
        print '@%s at %s: %s with %d followers' % (handle, time_, tweet_text, num_followers)
        print ''
        return True
    def on_error(self, status):
        print status
def begin_live_feed(tags_to_follow):
    print "beginning live feed...."
    l = StdOutListener()
    auth = tweepy.OAuthHandler(api_key, api_secret)
    auth.set_access_token(access_token, access_secret)
    stream = tweepy.Stream(auth, l)
    stream.filter(track=tags_to_follow)

begin_live_feed(['donald trump', 'hillary clinton'])
# this is an example use, if you create a list of words and phrases, 
# a live stream of tweets about them will show up

# INTERUPT THE KERNEL TO STOP THE MADNESS

beginning live feed....
@El_Chuco_TX at 2016-08-16 19:46:56: RT @_Makada_: I'm voting for Donald Trump for POTUS, no more career politicians!!!#TRUMP2016 #MakeAmericaGreatAgain #TrumpTrain https://t.c with 154 followers

@ljsharbono at 2016-08-16 19:46:56: RT @SpecialKMB1969: Trump Rally Breaks Elton Johns Attendance Record At River City Arena
U won't hear this #CNN
https://t.co/kIGYniuFh6 ht with 184 followers

@JessesLaw at 2016-08-16 19:46:56: RT @ddale8: Donald Trump: "I'm going to break up the gangs, the cartels, and the criminal syndicates terrorizing our neighbourhoods." How? with 213 followers

@HMaewest at 2016-08-16 19:46:56: RT @realDonaldTrump: "@DallasVercillo: Boys from the hood call me Black Donald Trump #facts" Great. with 1270 followers

@bapasphotos at 2016-08-16 19:46:56: Often-confused Hillary Clinton found unresponsive amid questions about FBI interview notes https://t.co/24FXbXmnng with 201 followers

@DawnSwe12515208 at 2016-08-16 19:46:56: RT @KazmierskiR: "In m

KeyboardInterrupt: 

In [40]:
'''
Other considerations when accessing APIs:
- Most APIs require you to have an access key (which you should store outside your code)
- Most APIs limit the number of API calls you can make (per day, hour, minute, etc.)
- Not all APIs are free
- Not all APIs are well-documented
- Pay attention to the API version

Python wrapper is another option for accessing an API:
- Set of functions that "wrap" the API code for ease of use
- Potentially simplifies your code
- But, wrapper could have bugs or be out-of-date or poorly documented
'''

'\nOther considerations when accessing APIs:\n- Most APIs require you to have an access key (which you should store outside your code)\n- Most APIs limit the number of API calls you can make (per day, hour, minute, etc.)\n- Not all APIs are free\n- Not all APIs are well-documented\n- Pay attention to the API version\n\nPython wrapper is another option for accessing an API:\n- Set of functions that "wrap" the API code for ease of use\n- Potentially simplifies your code\n- But, wrapper could have bugs or be out-of-date or poorly documented\n'