Now that we've got IMDB IDs for nearly every show in our original scrape, let's get some additional info from IMDB that TVmaze couldn't provide. This includes stuff like actors, writers, and the IMDB score, which is likely more meaningful than the TVMaze result given the former's larger user base. We'll again need to use the custom wrapper and requests function to avoid overloading the server and breaking the pull.

In [8]:
import pandas as pd
import numpy as np
import requests
import time
import re
import yaml
import scrapy
import pickle
from functools import wraps
shows = pd.read_pickle("full_shows_tmp_1_tvmaze_plus_imdbids.pkl")
auth = yaml.load(open('omdb.yaml'))
key = auth['OMDB']['Key']

In [9]:
# A wrapper to retry the request function many times, with a sleep in between
def retry_multi(max_retries):
    """ Retry a function `max_retries` times. """
    def retry(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            num_retries = 0 
            while num_retries <= max_retries:
                try:
                    ret = func(*args, **kwargs)
                    break
                except:
                    if num_retries == max_retries:
                        raise
                    num_retries += 1
                    time.sleep(15)
            return ret 
        return wrapper
    return retry

In [10]:
# Bringing in our requester from the previous step (OMDB Initial Supplementation)
# But with the addition of functionality from ratelimit
#@sleep_and_retry
#@limits(calls=10,period=10)
@retry_multi(10)
def request_info(imdb_id):
    
    # define params for json request
    params = {'apikey':key,'i':imdb_id,'plot':'full'}

    # send request via requests
    try:
        response = requests.get('http://www.omdbapi.com/?',params=params,timeout=10)
        time.sleep(0.25)
    except:
        while num_tries <= 15:
            num_tries = 1
            response = requests.get('http://www.omdbapi.com/?',params=params,timeout=30)
            num_tries += 1
            time.sleep(1)
    
    if response.status_code != 200:
        raise Exception('API response: {}'.format(response.status_code))
    
    return response

In [11]:
def get_json(response):
    
    # Just return the json, if the search was successful
    if response.json()['Response'] == 'True':
        success = True
    elif response.json()['Response'] == 'False':
        success = False
        
    # Return a dict indicating success/failure of request, with json
    return {'json':response.json(),'success':success}

Looping through our results...

In [12]:
# This will be a list of the JSONs we receive
feed = []

In [13]:
for show in shows['imdb_id']:
    r = request_info(show)
    json = get_json(r)
    if json['success'] == True:
        print('Success: {}'.format(json['json']['Title']))
        feed.append(json['json'])
    else:
        print('Request worked, but show not found.')

Success: $#*! My Dad Says
Success: 100 Code
Success: 101 Ways to Leave a Gameshow
Success: 12 Monkeys
Success: 13 Reasons Why
Success: 16 and Pregnant
Success: 1600 Penn
Success: 18 to Life
Success: 17 Kids and Counting
Success: 2 Broke Girls
Success: 24
Success: 24: Legacy
Success: 24: Live Another Day
Success: 3%
Success: 30 Rock
Success: 4th and Loud
Success: 500 Questions
Success: 60 Days In
Success: 666 Park Avenue
Success: 7 Little Johnstons
Success: 9-1-1
Success: 90210
Success: 9JKL
Success: Midnight, Texas
Success: A Gifted Man
Success: A Series of Unfortunate Events
Success: A to Z
Success: A Wicked Offer
Success: A.D. The Bible Continues
Success: A.P. Bio
Success: Abby's Studio Rescue
Success: About a Boy
Success: Absentia
Success: Accidentally on Purpose
Success: According to Jim
Success: Ace of Cakes
Success: Adam Devine's House Party
Success: Adventure Time
Success: After Lately
Success: Aftermath
Success: Against the Wall
Success: Agent X
Success: Alaska State Troopers
S

Success: Dear White People
Success: Death Valley
Success: Deception
Success: Deception
Success: Defiance
Success: Defying Gravity
Success: Designated Survivor
Success: Desperate Housewives
Success: Destination Truth
Success: Detroit Steel
Success: Devious Maids
Success: Dexter
Success: Dice
Success: Dietland
Success: Difficult People
Success: Dig
Success: Dirk Gently's Holistic Detective Agency
Success: Dirty Jobs
Success: Dirty Sexy Money
Success: Disjointed
Success: Divorce
Success: Do No Harm
Success: Documentary Now!
Success: Does Someone Have to Go?
Success: Dog the Bounty Hunter
Success: Dog with a Blog
Success: Dogs in the City
Success: Dollhouse
Success: Dominion
Success: Don't Trust the B---- in Apartment 23
Success: Donny!
Success: Doubt
Success: Downton Abbey
Success: Downward Dog
Success: Dr. Ken
Success: Dracula
Success: Drop Dead Diva
Success: Drop the Mic
Success: Drunk History
Success: Duck Dynasty
Success: Duets
Success: Dynasty
Success: Eagleheart
Success: East Los Hi

Success: Law & Order True Crime
Success: Law & Order: Criminal Intent
Success: Law & Order: LA
Success: Law & Order: Special Victims Unit
Success: Leah Remini: It's All Relative
Success: Leah Remini: Scientology and the Aftermath
Success: Legend of the Seeker
Success: Legends
Success: Legion
Success: Legit
Success: Let's Get Physical
Success: Let's Stay Together
Success: Lethal Weapon
Success: Leverage
Success: Liar
Success: Lie to Me
Success: Life
Success: Life in Pieces
Success: Life Sentence
Success: Life Unexpected
Success: Lights Out
Success: Lilyhammer
Success: Limitless
Success: Lip Sync Battle
Success: Lipstick Jungle
Success: Little Big Shots
Success: Little People, Big World
Success: Little Women: LA
Success: Little Women: NY
Success: Liv and Maddie
Success: Live to Dance
Success: Living Biblically
Success: Lizard Lick Towing
Success: Loaded
Success: Lone Star Law
Success: Long Island Medium
Success: Longmire
Success: Looking
Success: *Loosely Exactly Nicole
Success: Lopez
Su

Success: Say Yes to the Dress: Bridesmaids
Success: Scandal
Success: Scare Tactics
Success: Schitt's Creek
Success: School of Rock
Success: Scorpion
Success: Scoundrels
Success: Scream Queens
Success: Scream Queens
Success: Scrubs
Success: SEAL Team
Success: Sean Saves the World
Success: Search Party
Success: Second Chance
Success: Secret Millionaire
Success: Secrets and Lies
Success: See Dad Run
Success: Seed
Success: Selfie
Success: Selling New York
Success: Sense8
Success: Seven Seconds
Success: Sex&Drugs&Rock&Roll
Success: Sexy Beasts
Success: Shades of Blue
Success: Shadowhunters: The Mortal Instruments
Success: Shahs of Sunset
Success: Shake It Up
Success: Shameless
Success: Shaq vs
Success: Shark Tank Australia
Success: Sharp Objects
Success: She's Gotta Have It
Success: Shedding for the Wedding
Success: Sherlock
Success: Shooter
Success: Shots Fired
Success: Showtime at the Apollo
Success: Shut Eye
Success: Siberia
Success: Siesta Key
Success: Significant Mother
Success: Silico

Success: The Real Housewives of Atlanta
Success: The Real Housewives of Beverly Hills
Success: The Real Housewives of D.C.
Success: The Real Housewives of Dallas
Success: The Real Housewives of Miami
Success: The Real Housewives of New Jersey
Success: The Real Housewives of New York City
Success: The Real Housewives of Orange County
Success: The Real Housewives of Potomac
Success: The Real L Word: Los Angeles
Success: The Real O'Neals
Success: Real World
Success: The Red Road
Success: The Resident
Success: The Returned
Success: The Ricki Lake Show
Success: The Ricky Gervais Show
Success: The River
Success: The Royals
Success: The Sarah Silverman Program.
Success: The Secret Circle
Success: The Secret Life of the American Teenager
Success: The Shannara Chronicles
Success: The Simpsons
Success: The Sing-Off
Success: The Sinner
Success: The Sisterhood
Success: The Slap
Success: The Son
Success: The Sopranos
Success: The Soul Man
Success: The Soup
Success: The Strain
Success: The Taste
Suc

In [14]:
len(feed)

1573

Hooray! Let's save the list for safekeeping

In [15]:
print('Checkpoint: {} items'.format(len(feed)))
with open('feed_tmp_3.pkl', 'wb') as f:
    pickle.dump(feed, f)

Checkpoint: 1573 items


Now let's go ahead and try to wrangle this data into a dataframe!

In [16]:
imdb_data = pd.DataFrame(feed)
imdb_data.head()

Unnamed: 0,Actors,Awards,BoxOffice,Country,DVD,Director,Genre,Language,Metascore,Plot,...,Runtime,Title,Type,Website,Writer,Year,imdbID,imdbRating,imdbVotes,totalSeasons
0,"William Shatner, Jonathan Sadowski, Nicole Sul...",1 win.,,USA,,,Comedy,English,,This show is about Ed Goodson a very old fashi...,...,30 min,$#*! My Dad Says,series,,"Justin Halpern, David Kohan, Max Mutchnick, Pa...",2010–2011,tt1612578,6.3,4647,1.0
1,"Michael Nyqvist, Dominic Monaghan, Felice Jank...",1 nomination.,,"Sweden, Germany",,,Crime,"Swedish, English",,"New York, USA. Stockholm, Sweden. Over the pas...",...,60 min,100 Code,series,,Bobby Moresco,2015–,tt3515512,7.4,2101,1.0
2,"Steve Jones, Nemone",,,"UK, Argentina",,,Game-Show,English,,People must answere questions correctly and if...,...,,101 Ways to Leave a Gameshow,series,,,2010–,tt1702030,5.7,48,
3,"Aaron Stanford, Amanda Schull, Barbara Sukowa,...",4 wins & 8 nominations.,,USA,,,"Adventure, Drama, Mystery",English,,Follows the journey of a time traveler from th...,...,42 min,12 Monkeys,series,,"Travis Fickett, Terry Matalas",2015–,tt3148266,7.6,29731,4.0
4,"Nina Cheek, Michael Sadler, Cassie Hendry, Ke'...",Nominated for 1 Golden Globe. Another 2 wins &...,,USA,,,"Drama, Mystery",English,,"Thirteen Reasons Why, based on the best-sellin...",...,60 min,13 Reasons Why,series,,Brian Yorkey,2017–,tt1837492,8.3,163320,2.0


In [19]:
# Renaming columns...
newcols = []
for col in imdb_data.columns:
    if 'imdb' in col:
        col = col.strip('imdb')
    col = 'imdb_'+col
    col = col.lower()
    newcols.append(col)
    
imdb_data.columns = newcols
print(imdb_data.columns)

Index(['imdb_actors', 'imdb_awards', 'imdb_boxoffice', 'imdb_country',
       'imdb_dvd', 'imdb_director', 'imdb_genre', 'imdb_language',
       'imdb_metascore', 'imdb_plot', 'imdb_poster', 'imdb_production',
       'imdb_rated', 'imdb_ratings', 'imdb_released', 'imdb_response',
       'imdb_runtime', 'imdb_title', 'imdb_type', 'imdb_website',
       'imdb_writer', 'imdb_year', 'imdb_id', 'imdb_rating', 'imdb_votes',
       'imdb_totalseasons'],
      dtype='object')


In [20]:
shows.columns

Index(['genre', 'link', 'network', 'status', 'tagline', 'title', 'years',
       'start_year', 'end_year', 'synopsis', 'primary_genre',
       'secondary_genre', 'Comedy', 'Drama', 'Game Show', 'Reality', 'Sci-fi',
       'Talk', 'Crime', 'Action', 'Fantasy', 'Animated', 'Horror', 'Legal',
       'Medical', 'tvmaze_name', 'tvmaze_tv_id', 'imdb_id', 'tvmaze_prem_date',
       'tvmaze_rating', 'tvmaze_runtime', 'tvmaze_ep_day'],
      dtype='object')

In [21]:
shows = shows.merge(imdb_data,how='left',on='imdb_id')

In [22]:
shows.columns

Index(['genre', 'link', 'network', 'status', 'tagline', 'title', 'years',
       'start_year', 'end_year', 'synopsis', 'primary_genre',
       'secondary_genre', 'Comedy', 'Drama', 'Game Show', 'Reality', 'Sci-fi',
       'Talk', 'Crime', 'Action', 'Fantasy', 'Animated', 'Horror', 'Legal',
       'Medical', 'tvmaze_name', 'tvmaze_tv_id', 'imdb_id', 'tvmaze_prem_date',
       'tvmaze_rating', 'tvmaze_runtime', 'tvmaze_ep_day', 'imdb_actors',
       'imdb_awards', 'imdb_boxoffice', 'imdb_country', 'imdb_dvd',
       'imdb_director', 'imdb_genre', 'imdb_language', 'imdb_metascore',
       'imdb_plot', 'imdb_poster', 'imdb_production', 'imdb_rated',
       'imdb_ratings', 'imdb_released', 'imdb_response', 'imdb_runtime',
       'imdb_title', 'imdb_type', 'imdb_website', 'imdb_writer', 'imdb_year',
       'imdb_rating', 'imdb_votes', 'imdb_totalseasons'],
      dtype='object')

In [23]:
shows[shows['imdb_id'].isnull()]

Unnamed: 0,genre,link,network,status,tagline,title,years,start_year,end_year,synopsis,...,imdb_response,imdb_runtime,imdb_title,imdb_type,imdb_website,imdb_writer,imdb_year,imdb_rating,imdb_votes,imdb_totalseasons
305,Game Show,http://www.ismyshowcancelled.com/show/2009/cra...,ABC,Cancelled,A reality show where teams get behind the driv...,Crash Course,2009 - 2013,2009,2013.0,"In each episode of Crash Course, five teams of...",...,,,,,,,,,,
632,Reality,http://www.ismyshowcancelled.com/show/2015/ink...,Paramount Network,On Air,A spin-off series offering former Ink Master t...,Ink Master: Redemption,2015 - Present,2015,,What happens to the disgruntled human canvases...,...,,,,,,,,,,


Above are the two that didn't have good matches on IMDB (as discussed on OMDB). They still seemed to have been merged fine.

Let's checkpoint to pickle...

In [24]:
shows.to_pickle("full_shows_tmp_2_tvmaze+imdb.pkl")

In [25]:
shows = pd.read_pickle("full_shows_tmp_2_tvmaze+imdb.pkl")

In [26]:
len(shows)

1603

In [27]:
shows['title'].value_counts()

Deception                            4
The Bridge                           4
Star-Crossed                         2
Face Off                             2
@midnight                            2
Million Dollar Listing New York      2
Still Star-Crossed                   2
Alexa & Katie                        2
Million Dollar Listing               2
The X-Files                          2
The Wall                             2
The X-Files (2016)                   2
Midnight, Texas                      2
Being Mary Jane                      2
Against The Wall                     2
Love & Hip Hop: Atlanta              2
Love & Hip Hop                       2
Pitch                                2
Katie                                2
Mary + Jane                          2
Missing                              2
The Face                             2
The Pitch                            2
Scream                               2
Scream Queens                        2
The Missing              

Looks like we did have a couple weird instances where a show showed up multiple times. Let's clear those up.

In [28]:
cleaner = shows.copy()
cleaner['ones'] = 1
shows['title_cumsum'] = cleaner.groupby('title')['ones'].transform(np.cumsum)
shows['title_dupes'] = cleaner.groupby('title')['ones'].transform('count')
shows['imdbid_dupes'] = cleaner.groupby('title')['ones'].transform('count')
shows['imdbid_cumsum'] = cleaner.groupby('title')['ones'].transform(np.cumsum)

In [29]:
shows = shows[(shows['imdbid_cumsum'] == 1) | (shows['imdbid_cumsum'].isnull())]

In [30]:
shows.shape

(1573, 61)

Looks like we're back to the right number. Pickling again - next, data cleaning & exploration!

In [31]:
shows = shows.drop(['title_cumsum','title_dupes','imdbid_dupes','imdbid_cumsum'],axis=1)
shows.columns

Index(['genre', 'link', 'network', 'status', 'tagline', 'title', 'years',
       'start_year', 'end_year', 'synopsis', 'primary_genre',
       'secondary_genre', 'Comedy', 'Drama', 'Game Show', 'Reality', 'Sci-fi',
       'Talk', 'Crime', 'Action', 'Fantasy', 'Animated', 'Horror', 'Legal',
       'Medical', 'tvmaze_name', 'tvmaze_tv_id', 'imdb_id', 'tvmaze_prem_date',
       'tvmaze_rating', 'tvmaze_runtime', 'tvmaze_ep_day', 'imdb_actors',
       'imdb_awards', 'imdb_boxoffice', 'imdb_country', 'imdb_dvd',
       'imdb_director', 'imdb_genre', 'imdb_language', 'imdb_metascore',
       'imdb_plot', 'imdb_poster', 'imdb_production', 'imdb_rated',
       'imdb_ratings', 'imdb_released', 'imdb_response', 'imdb_runtime',
       'imdb_title', 'imdb_type', 'imdb_website', 'imdb_writer', 'imdb_year',
       'imdb_rating', 'imdb_votes', 'imdb_totalseasons'],
      dtype='object')

In [32]:
shows.to_pickle("full_shows_tmp_6_tvmaze+imdb.pkl")

In [33]:
shows = pd.read_pickle("full_shows_tmp_6_tvmaze+imdb.pkl")