In [1]:
import requests
import json
import pandas as pd
import re
from bs4 import BeautifulSoup
from datetime import datetime
from timeit import default_timer as timer

___
# Netflix Ratings
## Johnny Barrett
### Feb 2020
This notebook builds a dataframe of all programs available on Netflix in a given region, with their corresponding IMDb and Rotten Tomatoes scores.

___
## 1. Building Netflix DataFrame from uNoGS API

Please note: To use this notebook, you will need to acquire an API key from https://rapidapi.com/unogs/api/unogs. An API key grants you 100 free requests per day. Also, each individual request is limited to only 100 results (programs), so to get the full list of programs for your region you will need to:
- Head to https://unogs.com 
- Select only the region(s) you are interested in
- Observe how many pages of results there are (there are 100 results per page)
- Set the variable `PAGES` to this number  

The get_catalogue function will then perform a request for each 100 results (each 'page') until all results have been collected.

In [2]:
RAPIDAPI_KEY = "871eb26902msh3eb929fcaab20c6p11065fjsnecd4f288ad44"

In [3]:
url = "https://unogs-unogs-v1.p.rapidapi.com/aaapi.cgi"
COUNTRY_CODE = '23'  # Australia country code
PAGES = 56  # Number of pages for Australia region
HEADERS = {
    'x-rapidapi-host': "unogs-unogs-v1.p.rapidapi.com",
    'x-rapidapi-key': RAPIDAPI_KEY
    }

In [8]:
def get_catalogue(pages, country_code, headers):
    '''Builds pandas DataFrame of all available Netflix programs in given country_code region.
    Check unogs.com for the appropriate value for variable pages for your desired region.
    '''
    shows_data = []
    for i in range(1, pages+1):
        querystring = {'cl':country_code, 't':'ns', 'st':'adv', 'p': str(i)}
        response = requests.request("GET", url, headers=headers, params=querystring)
        json_to_dict = json.loads(response.text)
        shows_data.extend(json_to_dict['ITEMS'])
        print('-', end='')
    
    df = pd.DataFrame(shows_data)
    df.columns = ['NetflixID', 'Title', 'Image', 'Synopsis',
                  'IMDbRating', 'Type', 'ReleaseYear', 'Runtime',
                  'LargeImage', 'uNoGSDate', 'IMDbID', 'Download']
    df.set_index('NetflixID', inplace=True)
    
    print('\nCatalogue received successfully')
    
    return df

In [9]:
df = get_catalogue(PAGES, COUNTRY_CODE, HEADERS)

--------------------------------------------------------
Catalogue received successfully


In [10]:
df.shape

(5519, 11)

In [11]:
df.head()

Unnamed: 0_level_0,Title,Image,Synopsis,IMDbRating,Type,ReleaseYear,Runtime,LargeImage,uNoGSDate,IMDbID,Download
NetflixID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
81212488,ZZ TOP: THAT LITTLE OL&#39; BAND FROM TEXAS,https://occ-0-299-300.1.nflxso.net/dnm/api/v6/...,This documentary delves into the mystique behi...,7.3,movie,2019,1h30m,,2020-03-01,tt9015306,0
81229511,Velvet Colecci&oacute;n: Grand Finale,https://occ-0-2773-2774.1.nflxso.net/dnm/api/v...,"During Christmas 1969, the impending sale of V...",,movie,2020,1h22m,,2020-03-01,,0
81244455,"Pop, Lock &#39;n Roll",https://occ-0-299-300.1.nflxso.net/dnm/api/v6/...,An ambitious hip-hop dancer jeopardizes his ri...,0.0,movie,2017,1h27m,,2020-03-01,tt2336014,0
70019062,Nausica&auml; of the Valley of the Wind,https://occ-0-2773-2774.1.nflxso.net/dnm/api/v...,Facing the destruction of her planet&#39;s nat...,8.1,movie,1984,1h57m,,2020-03-01,tt0087544,0
81237761,Calico Critters: Everyone&#39;s Big Dream Flyi...,https://occ-0-2705-2433.1.nflxso.net/dnm/api/v...,"In the Hazelnut Chipmunk Family, Dominic is a ...",,movie,2019,11m,,2020-02-29,,0


___
## 2. Data pre-processing
Exploring the data and resolving some formatting issues

In [12]:
def fix_punc(df, code, punc):
    '''Fix puncuation codes'''
    df['Title'] = df['Title'].str.replace(code, punc)
    df['Synopsis'] = df['Synopsis'].str.replace(code, punc)

In [13]:
fix_punc(df, "&#39;", "'")
fix_punc(df, "&rsquo;", "'")
fix_punc(df, "&ndash;", "-")

In [14]:
# Ensuring all Type == 'movie' are lowercase
df['Type'] = df['Type'].str.replace('Movie', 'movie')

In [15]:
def search(df, title):
    '''Simple title search function'''
    return df[df['Title'].str.lower().str.contains(title.lower())].sort_values('IMDbRating', ascending=False)

In [16]:
search(df, 'avatar')  # example

Unnamed: 0_level_0,Title,Image,Synopsis,IMDbRating,Type,ReleaseYear,Runtime,LargeImage,uNoGSDate,IMDbID,Download
NetflixID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
70142405,Avatar: The Last Airbender,https://occ-0-2717-360.1.nflxso.net/dnm/api/v6...,Siblings Katara and Sokka wake young Aang from...,9.2,series,2005,,http://cdn1.nflximg.net/images/5561/11315561.jpg,2015-04-14,tt0417299,0
81175361,The King's Avatar,https://occ-0-768-769.1.nflxso.net/dnm/api/v6/...,When an elite gamer is forced out of his profe...,0.0,series,2019,,,2019-08-16,tt10736726,0


___
## 3. Scraping Rotten Tomatoes Ratings
Now, we use the Title and Type of each program to build a Rotten Tomatoes URL and attempt to scrape ratings. Whenever we cannot aquire a rating from RT, we simply continue (leaving a NaN value). There are a range of reasons why this can happen, for instance:
- Some shows may simply not be listed on Rotten Tomatoes
- Some shows might have a Critic rating but not an Audience rating, or vice versa
- Some movies and shows, predominantly remakes, will have the year of release after their name in the URL. This algorithm does not account for this
- Special characters in the movie title can affect the URL building function
- The URL building function could be improved, with more focus on edge cases  

First we build the functions we need to search through the HTML of each program's review site:

In [17]:
def get_critic(soup):
    '''If found, return Critic score for given title from Rotten Tomatoes.
    Return None otherwise.'''
    critic_badge = soup.find('div', {'class': 'mop-ratings-wrap__half'})
    try:
        critic_score = int(critic_badge.find('span', {'class': 'mop-ratings-wrap__percentage'}).contents[0].strip().strip('%'))
    except AttributeError:  # If score not found
        return
    return critic_score

In [18]:
def get_audience(soup):
    '''If found, return Audience score for given title from Rotten Tomatoes.
    Return None otherwise.'''
    audience_badge = soup.find('div', {'class': 'mop-ratings-wrap__half audience-score'})
    if audience_badge:
        try:
            audience_score = int(audience_badge.find('span', {'class': 'mop-ratings-wrap__percentage'}).contents[0].strip().strip('%'))
        except AttributeError:  # If score not found
            return
        return audience_score

And some other useful functions:

In [19]:
def percent_na(df):
    '''View the percentage of missing values per column'''
    return round(df.isnull().mean() * 100, 2)

In [20]:
def process_time(seconds):
    '''Transform seconds into hh:mm:ss'''
    mins, secs = divmod(seconds, 60)
    hrs, mins = divmod(mins, 60)
    return f'{int(hrs):02d}:{int(mins):02d}:{int(secs):02d}'

The `get_rt_scores` function builds a Rotten Tomatoes URL and scrapes the site for it's ratings. This can take some time (for Australia region, roughly 2 hours).

In [21]:
def get_rt_scores(df):
    '''For each title in DataFrame, build Rotten Tomatoes URL,
    attempt to retrieve scores and add to DataFrame.'''
    
    start = timer()
    tomato_base_url = 'https://www.rottentomatoes.com/'

    for ID in df.index:
        if df.loc[ID, 'Type'] == 'movie':
            tomato_url = tomato_base_url + 'm/'
        elif df.loc[ID, 'Type'] == 'series':
            tomato_url = tomato_base_url + 'tv/'
        tomato_url += re.sub(':', '', re.sub(' ', '_', re.sub("'", '_', re.sub(',', '', df.loc[ID, 'Title'])))).lower()
        soup = BeautifulSoup(requests.get(tomato_url).text)

        df.loc[ID, 'RT Critic'] = get_critic(soup)
        df.loc[ID, 'RT Audience'] = get_audience(soup)
        
        print('.', end='')
    
    end = timer()
    print(f'\nCompleted in {process_time(end-start)}')

In [None]:
get_rt_scores(df)

### Saving new df to csv for later use

In [None]:
date_today = str(datetime.now().date())
df.to_csv(f'Netflix_Aus_Ratings_{date_today}.csv')

___
## 4. Exploring our new DataFrame

In [27]:
df.head()

Unnamed: 0_level_0,Title,Image,Synopsis,IMDbRating,Type,ReleaseYear,Runtime,LargeImage,uNoGSDate,IMDbID,Download,RT Critic,RT Audience
NetflixID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
81194454,She Did That,https://occ-0-299-300.1.nflxso.net/dnm/api/v6/...,"Go inside the lives of extraordinary, black fe...",,movie,2019,1h10m,,2020-02-04,,0.0,,
81055398,"Faith, Hope and Love",https://occ-0-299-300.1.nflxso.net/dnm/api/v6/...,"After shattering losses, a recent divorc&eacut...",,movie,2019,1h46m,,2020-02-04,,0.0,80.0,96.0
81176204,Thambi,https://occ-0-299-300.1.nflxso.net/dnm/api/v6/...,As a tip leads a local politician to his long-...,7.4,movie,2019,2h27m,,2020-02-02,tt10468636,0.0,,57.0
81234400,Tempted,https://occ-0-2851-1432.1.nflxso.net/dnm/api/v...,Love is just a game for a chaebol heir who agr...,,series,2018,,,2020-02-01,,0.0,,
81234382,Extraordinary You,https://occ-0-2851-1432.1.nflxso.net/dnm/api/v...,A teen seeks to change the fate that's been se...,8.2,series,2019,,,2020-02-01,tt10826102,0.0,,


First, let's select only the columns we want:

In [29]:
df2 = df[['Title', 'Type', 'ReleaseYear', 'Synopsis', 'Runtime', 'Image', 'IMDbRating', 'RT Critic', 'RT Audience']]
df2.head()

Unnamed: 0_level_0,Title,Type,ReleaseYear,Synopsis,Runtime,Image,IMDbRating,RT Critic,RT Audience
NetflixID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
81194454,She Did That,movie,2019,"Go inside the lives of extraordinary, black fe...",1h10m,https://occ-0-299-300.1.nflxso.net/dnm/api/v6/...,,,
81055398,"Faith, Hope and Love",movie,2019,"After shattering losses, a recent divorc&eacut...",1h46m,https://occ-0-299-300.1.nflxso.net/dnm/api/v6/...,,80.0,96.0
81176204,Thambi,movie,2019,As a tip leads a local politician to his long-...,2h27m,https://occ-0-299-300.1.nflxso.net/dnm/api/v6/...,7.4,,57.0
81234400,Tempted,series,2018,Love is just a game for a chaebol heir who agr...,,https://occ-0-2851-1432.1.nflxso.net/dnm/api/v...,,,
81234382,Extraordinary You,series,2019,A teen seeks to change the fate that's been se...,,https://occ-0-2851-1432.1.nflxso.net/dnm/api/v...,8.2,,


How many ratings did we get?

In [31]:
percent_na(df2)

Title           0.00
Type            0.00
ReleaseYear     0.00
Synopsis        0.02
Runtime        34.49
Image           0.00
IMDbRating     17.75
RT Critic      73.02
RT Audience    58.17
dtype: float64

In [33]:
df2.sort_values(['RT Audience', 'RT Critic'], ascending=False).head(20)

Unnamed: 0_level_0,Title,Type,ReleaseYear,Synopsis,Runtime,Image,IMDbRating,RT Critic,RT Audience
NetflixID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
80213588,The Confession Killer,series,2019,Henry Lee Lucas rose to infamy when he confess...,,https://occ-0-1952-2433.1.nflxso.net/dnm/api/v...,,100.0,100.0
81021633,The Last Animals,movie,2017,This sobering documentary highlights the poach...,1h31m,https://occ-0-753-1360.1.nflxso.net/dnm/api/v6...,8.4,100.0,100.0
80190097,November 13: Attack on Paris,series,2018,Survivors and first responders share personal ...,,https://occ-0-1952-2433.1.nflxso.net/dnm/api/v...,,100.0,100.0
80213655,The Honeymoon Stand Up Special,series,2018,Impending parenthood does funny things to Nata...,,https://occ-0-2851-1432.1.nflxso.net/dnm/api/v...,,100.0,100.0
80175348,Kantaro: The Sweet Tooth Salaryman,series,2017,Elite publishing sales rep Kantaro wraps up hi...,,https://occ-0-1952-2433.1.nflxso.net/dnm/api/v...,,100.0,100.0
80058424,John Mulaney: The Comeback Kid,movie,2015,"Armed with boyish charm and a sharp wit, the f...",1h1m,http://occ-0-1952-2433.1.nflxso.net/dnm/api/v6...,7.9,100.0,100.0
80043049,Anthony Jeselnik: Thoughts and Prayers,movie,2015,There's no subject too dark as the comedian sk...,59m,http://occ-0-1952-2433.1.nflxso.net/dnm/api/v6...,7.8,100.0,100.0
70157495,Black Books,series,2000,A misanthropic bookshop owner named Bernard Bl...,,https://occ-0-753-1360.1.nflxso.net/dnm/api/v6...,8.5,100.0,100.0
80209796,Meditation Park,movie,2017,An elderly immigrant matriarch from Hong Kong ...,1h34m,http://occ-0-753-1360.1.nflxso.net/dnm/api/v6/...,7.2,89.0,100.0
80093103,Palm Trees in the Snow,movie,2015,"Finding a tantalizing clue in an old letter, a...",2h42m,https://occ-0-768-769.1.nflxso.net/dnm/api/v6/...,7.4,86.0,100.0


___
## 5. Future Work & Extensions
There are a range of ways we could extend beyond this project, including:
- Devising an improved URL building function
- Building a web-based front-end for the new dataframe
- Developing a metric that takes into considerationg the number of reviews that contribute to a rating (this metric is available on RT). For example: we might consider that a program with 100,000 reviews leading to a rating of 85 is a more significant result than a program with 2 reviews with a rating of 90