Before we do a full pull from OMDB, we'll need to ping the OMDB API for any shows for which we couldn't pick up an IMDB ID from TVMaze (which generally has better search matching than OMDB). To do this, we need to be very careful about timing. I've written a custom wrapper and requests function that sleeps if a request times out due to server overload, such that the dataset does not need to be gathered piecemeal.

In [1]:
import pandas as pd
import numpy as np
import requests
import time
import re
from functools import wraps
import yaml
shows = pd.read_pickle("ismyshowcancelled_final.pkl")
tvmaze = pd.read_csv("tvmaze_tmp_3.csv",index_col=0)
auth = yaml.load(open('omdb.yaml'))
key = auth['OMDB']['Key']

There's about ~170 shows that don't have an IMDB ID according to tvmaze. I suspect, though, that this is a shortcoming in their database and not an actual indication that IMDB has no record of these shows. Let's first get a list of the ones we want to search for.

In [17]:
# This dataframe is going to house the shows with missing IMDB IDs, 
# as well as the IDs that we eventually (hopefully) will return
missing_imdb = pd.DataFrame(tvmaze[tvmaze['imdb'].isnull()]['name'])
missing_imdb['found_id'] = np.nan
missing_imdb['found_name'] = np.nan
missing_imdb.head()

Unnamed: 0,name,found_id,found_name
301,Crash Course Engineering,,
625,Ink Master: Redemption,,


In [16]:
# A wrapper to retry the request function many times, with a sleep in between
def retry_multi(max_retries):
    """ Retry a function `max_retries` times. """
    def retry(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            num_retries = 0 
            while num_retries <= max_retries:
                try:
                    ret = func(*args, **kwargs)
                    break
                except:
                    if num_retries == max_retries:
                        raise
                    num_retries += 1
                    time.sleep(15)
            return ret 
        return wrapper
    return retry

In [17]:
# Building a requester for the IMDB IDs from OMDB (note that this differs from our tvmaze equivalent function)
@retry_multi(10)
def request_info(show):
    
    # proper formatting
    show = '-'.join(show.lower().split()).replace(':','')
    
    # define params for json request
    params = {'apikey':key,'t':show,'type':'series'}

    # send request via requests
    response = requests.get('http://www.omdbapi.com/?',params=params)
    
    return response

In [18]:
# Building functions to extract just the IMDB IDs and names (just to cross-check that the result wasn't erroneous)
# Later, we'll return to scrape all the information we want once we have a dataframe filled with necessary IDs
def get_json(response):
    
    # Just return the json, if the search was successful
    if response.json()['Response'] == 'True':
        success = True
    elif response.json()['Response'] == 'False':
        success = False
        
    # Return a dict indicating success/failure of request, with json
    return {'json':response.json(), 'success':success}

In [19]:
def extract_imdb(json_response):
    
    # get the id, if successful pull
    if json_response['success'] == True:
        return(json_response['json']['imdbID'])

def extract_title(json_response):
    
    # get the name, if successful pull. this is just for cross-checking and we'll delete it later
    if json_response['success'] == True:
        return(json_response['json']['Title'])

In [20]:
# Now writing a loop to ping for requests!
for show,index in zip(missing_imdb.name,missing_imdb.index):
    
    # Request the show from OMDB
    r = request_info(show)
    json_response = get_json(r)
    
    # Getting imdb id and title (will not collect anything if unsuccessful json response)
    imdb = extract_imdb(json_response)
    title = extract_title(json_response)
    
    # Doing a double check (also in function) that we got a good response, then inputting info back into our df
    if json_response['success'] is True:
        
        # Letting us know what happened
        print('Match for {}: {}. ID is {}'.format(show,title,imdb))
        
        # Adding this result back to our dataframe
        missing_imdb.loc[missing_imdb.index == index,'found_name'] = title
        missing_imdb.loc[missing_imdb.index == index,'found_id'] = imdb
        #missing_imdb.iloc[index].loc['found_id'] = imdb
        #missing_imdb.iloc[index].loc['found_name'] = title
        
    else:
        print('No match found for {}'.format(show))
    
    time.sleep(2)

Match for 500 Questions: 500 Questions. ID is tt4591316
Match for All My Children: All My Children. ID is tt0065272
Match for American Grit: American Grit. ID is tt5546352
Match for American Vandal: American Vandal. ID is tt6877772
Match for Battle of the Network Stars: Battle of the Network Stars. ID is tt7115526
Match for Beat Shazam: Beat Shazam. ID is tt6917254
Match for Bellevue: Bellevue. ID is tt6082618
No match found for Ben and Lauren: Happily Ever After
No match found for Bethenny and Fredrik
Match for Better Late Than Never: Better Late Than Never. ID is tt5020352
Match for Big Star Little Star: Big Star Little Star. ID is tt7174362
Match for Boy Band: Shut Up Flower Boy Band. ID is tt2942950
No match found for Brandi & Jarrod: Married to the Job
Match for Breaking Boston: Breaking Boston. ID is tt3600966
Match for The Capture: The Capture of the Green River Killer. ID is tt1100911
Match for Cardinal: Cardinal. ID is tt5583512
Match for Celebrity Big Brother: Celebrity Big B

Match for The Million Second Quiz: The Million Second Quiz. ID is tt3038492
Match for The Orville: The Orville. ID is tt5691552
Match for The Outpost: Star Trek: The Romulan Wars - The Outpost. ID is tt5331942
Match for The Queen Latifah Show: The Queen Latifah Show. ID is tt2099467
Match for The Ricki Lake Show: The Ricki Lake Show. ID is tt1908423
Match for The Tick: The Tick. ID is tt5540054
Match for The Toy Box: The Toy Box. ID is tt6392176
Match for The Wall: Against the Wall. ID is tt1836237
Match for The Winner Is...: The Winner Is. ID is tt2341735
Match for There's... Johnny!: There's... Johnny!. ID is tt6483198
Match for This is Not Happening: This Is Not Happening. ID is tt3528306
Match for Three Rivers: Three Rivers. ID is tt1440346
Match for Too Close to Home: Too Close to Home. ID is tt5596646
Match for Tori & Dean: sTORIbook Weddings: Tori & Dean: Storibook Weddings. ID is tt1705812
Match for Tracks: Tracks. ID is tt0310534
Match for Truth & Iliza: Truth & Iliza. ID is t

In [21]:
missing_imdb.tail(30)

Unnamed: 0,name,found_id,found_name
1305,The Hasselhoffs,tt1705811,The Hasselhoffs
1308,The Insider,tt0430836,The Insider
1311,The Jeff Probst Show,tt1978332,The Jeff Probst Show
1313,The Jim Jefferies Show,tt6987966,The Jim Jefferies Show
1324,The Last Tycoon,tt3390892,The Last Tycoon
1327,The Letter,tt0285289,The Letter People
1331,The Long Road Home,tt1210820,The Long Road Home
1348,The Million Second Quiz,tt3038492,The Million Second Quiz
1370,The Orville,tt5691552,The Orville
1371,The Outpost,tt5331942,Star Trek: The Romulan Wars - The Outpost


Success! We got a good number of matches, and the leftovers are few enough that I can do a quick check for the remainder and just write those into the CSV. Additionally, those matches we did get seem entirely accurate.

For now, we'll add these IMDB IDs back into the original dataframe nad go from there!

In [22]:
for show in missing_imdb['name']:
    tvmaze.loc[tvmaze['name'] == show,'imdb'] = missing_imdb.loc[missing_imdb['name'] == show,'found_id']

Doing some spot checking...

In [23]:
tvmaze.loc[tvmaze['name'] == 'The Hasselhoffs','imdb']

1305    tt1705811
Name: imdb, dtype: object

In [24]:
tvmaze.loc[tvmaze['name'] == 'The Outpost']

Unnamed: 0,name,tv_id,imdb,prem_date,rating,runtime,ep_day
1371,The Outpost,34419.0,tt5331942,2018-07-10,,60.0,Tuesday


In [25]:
tvmaze.isnull().sum()

name           5
tv_id          5
imdb           8
prem_date     12
rating       466
runtime       13
ep_day         5
dtype: int64

In [26]:
tvmaze[tvmaze['imdb'].isnull()]

Unnamed: 0,name,tv_id,imdb,prem_date,rating,runtime,ep_day
141,Ben and Lauren: Happily Ever After,20882.0,,2016-10-11,,60.0,Tuesday
151,Bethenny and Fredrik,27702.0,,2018-02-06,,30.0,Tuesday
204,Brandi & Jarrod: Married to the Job,15352.0,,2014-08-12,,30.0,Tuesday
301,Crash Course Engineering,36786.0,,,,,Multiple
625,Ink Master: Redemption,3097.0,,2015-09-08,,30.0,Tuesday
808,"Me, Myself & I",20660.0,,2017-09-25,6.5,30.0,Monday
872,Nate and Jeremiah by Design,25890.0,,2017-04-08,,60.0,Saturday
1509,Vegas Cakes,33303.0,,2017-11-05,,30.0,Multiple


Awesome, we've greatly reduced the number of missing IDs. Let's save this to CSV and then I'll fill in the remainder myself.

In [28]:
tvmaze.to_csv("tvmaze_tmp_4_more_ids.csv")

I found matches for all but 2: Ink Master Redemption (which is considered to be the same thing as Ink Master by IDMB, which we already have in our dataset), and Crash Course Engineering (this is a YouTube series). We'll probably delete both once we get into the analysis and modeling phases.

In [61]:
tvmaze[tvmaze['tvmaze_name']=='Caprica']

Unnamed: 0,tvmaze_name,tvmaze_tv_id,imdb_id,tvmaze_prem_date,tvmaze_rating,tvmaze_runtime,tvmaze_ep_day
234,Caprica,433.0,tt0799862,2010-01-22,7.4,60.0,Tuesday


In [4]:
tvmaze = pd.read_csv("tvmaze_tmp_5_more_ids_MORE ADDITIONS.csv",index_col=0)

In [5]:
tvmaze.isnull().sum()

name           5
tv_id          6
imdb           2
prem_date     12
rating       466
runtime       13
ep_day         5
dtype: int64

Awesome! Only one thing remains for now. Let's merge this with our shows df. Luckily, they rely on the exact same indices.

In [6]:
l = []
for col in tvmaze.columns:
    l.append('tvmaze_'+col)

In [7]:
l

['tvmaze_name',
 'tvmaze_tv_id',
 'tvmaze_imdb',
 'tvmaze_prem_date',
 'tvmaze_rating',
 'tvmaze_runtime',
 'tvmaze_ep_day']

In [8]:
l[2] = 'imdb_id'

In [9]:
l

['tvmaze_name',
 'tvmaze_tv_id',
 'imdb_id',
 'tvmaze_prem_date',
 'tvmaze_rating',
 'tvmaze_runtime',
 'tvmaze_ep_day']

In [10]:
tvmaze.columns = l

In [11]:
tvmaze.head()

Unnamed: 0,tvmaze_name,tvmaze_tv_id,imdb_id,tvmaze_prem_date,tvmaze_rating,tvmaze_runtime,tvmaze_ep_day
0,$#*! My Dad Says,1986.0,tt1612578,9/23/2010,6.2,30.0,Thursday
1,100 Code,3953.0,tt3515512,3/11/2015,8.1,60.0,Wednesday
2,101 Ways to Leave a Gameshow,12166.0,tt1702030,7/10/2010,,60.0,Saturday
3,12 Monkeys,614.0,tt3148266,1/16/2015,7.9,60.0,Friday
4,13 Reasons Why,7194.0,tt1837492,3/31/2017,8.2,60.0,Friday


In [12]:
tvmaze.isnull().sum()

tvmaze_name           5
tvmaze_tv_id          6
imdb_id               2
tvmaze_prem_date     12
tvmaze_rating       466
tvmaze_runtime       13
tvmaze_ep_day         5
dtype: int64

In [13]:
tvmaze.iloc[234:]

Unnamed: 0,tvmaze_name,tvmaze_tv_id,imdb_id,tvmaze_prem_date,tvmaze_rating,tvmaze_runtime,tvmaze_ep_day
234,Caprica,433.0,tt0799862,1/22/2010,7.4,60.0,Tuesday
235,The Capture,35512.0,tt1100911,,,60.0,Multiple
236,Cardinal,18613.0,tt5583512,1/25/2017,7.9,60.0,Thursday
237,Cash Cab,8327.0,tt0465321,6/13/2005,,30.0,Monday
238,Castle,68.0,tt1219024,3/9/2009,8.4,60.0,Monday
239,Castlevania,25242.0,tt6517102,7/7/2017,7.1,30.0,Friday
240,Casual,3036.0,tt4577466,10/7/2015,7.4,30.0,Tuesday
241,Catastrophe,3004.0,tt4374208,1/19/2015,8.2,35.0,Tuesday
242,Catch a Contractor,2287.0,tt3108438,3/9/2014,,60.0,Sunday
243,Catfish: The TV Show,997.0,tt2498968,11/12/2012,7.3,60.0,Wednesday


In [14]:
shows = shows.join(tvmaze)

In [15]:
shows.head()

Unnamed: 0,genre,link,network,status,tagline,title,years,start_year,end_year,synopsis,...,Horror,Legal,Medical,tvmaze_name,tvmaze_tv_id,imdb_id,tvmaze_prem_date,tvmaze_rating,tvmaze_runtime,tvmaze_ep_day
0,Comedy,http://www.ismyshowcancelled.com/show/2010/ble...,CBS,Cancelled,"A sitcom based on the Twitter feed ""S*** My Da...",$#*! My Dad Says,2010 - 2011,2010,2011.0,Ed is an opinionated and divorced 72-year-old ...,...,0,0,0,$#*! My Dad Says,1986.0,tt1612578,9/23/2010,6.2,30.0,Thursday
1,Drama / Crime,http://www.ismyshowcancelled.com/show/2018/100...,WGN America,Coming Soon,A thriller following an New York cop who trave...,100 Code,2018 - Present,2018,,NYPD Detective Tommy Conley travels to Sweden ...,...,0,0,0,100 Code,3953.0,tt3515512,3/11/2015,8.1,60.0,Wednesday
2,Game Show,http://www.ismyshowcancelled.com/show/2011/101...,ABC,Cancelled,A game show competition where contestants are ...,101 Ways to Leave a Gameshow,2011 - 2011,2011,2011.0,"In 101 Ways to Leave a Game Show, contestants ...",...,0,0,0,101 Ways to Leave a Gameshow,12166.0,tt1702030,7/10/2010,,60.0,Saturday
3,Drama / Sci-fi,http://www.ismyshowcancelled.com/show/2015/12-...,Syfy,On Air,A drama following a man sent back in time to p...,12 Monkeys,2015 - Present,2015,,"By the year 2043, a deadly virus has wiped out...",...,0,0,0,12 Monkeys,614.0,tt3148266,1/16/2015,7.9,60.0,Friday
4,Drama,http://www.ismyshowcancelled.com/show/2017/13-...,Netflix,On Air,A drama following the revelation of why a youn...,13 Reasons Why,2017 - Present,2017,,Hannah Baker is a teenage girl who takes her o...,...,0,0,0,13 Reasons Why,7194.0,tt1837492,3/31/2017,8.2,60.0,Friday


Saving to pkl and csv...

In [16]:
shows.to_csv("full_shows_tmp_1_tvmaze_plus_imdbids.csv")

In [17]:
shows.to_pickle("full_shows_tmp_1_tvmaze_plus_imdbids.pkl")

In [18]:
shows.shape

(1575, 32)

In [19]:
shows.isnull().sum()

genre                  0
link                   0
network                0
status                 0
tagline                0
title                  0
years                  0
start_year             0
end_year             595
synopsis               0
primary_genre          0
secondary_genre     1070
Comedy                 0
Drama                  0
Game Show              0
Reality                0
Sci-fi                 0
Talk                   0
Crime                  0
Action                 0
Fantasy                0
Animated               0
Horror                 0
Legal                  0
Medical                0
tvmaze_name            5
tvmaze_tv_id           6
imdb_id                2
tvmaze_prem_date      12
tvmaze_rating        466
tvmaze_runtime        13
tvmaze_ep_day          5
dtype: int64