# Project PTO - Predicting the Oscars


![Who will be the Oscar nominees and winners?](./oscars.jpg)

This notebook covers my progress on building an algorithm to accurately predict Oscar results. Not just the winners based on the nominees in each category, but also which movies will be nominees based on all of the movies released that year. The project repository can be cloned at https://github.com/j0hnk1m/predict-the-oscars if you're interested.

This mini-project only lasted for a month, so it may not be perfect. Hopefully, you guys can use this as a platform to get even better predictions. Let's dig in! I'll be using Anaconda Python 3.7. You may need to import a few packages yourself, however, using conda install {package}

There are 4 major sections to cover:
1. Data Collection
2. Data Organization/Manipulation
3. The Algorithm
4. Algorithm Optimization

## 1. Data Collection

### 1.1 The BIGML dataset

First things first, general imports first

In [31]:
# collect_data.py
import pandas as pd
import numpy as np
import os

I came across this dataset from https://bigml.com/user/academy_awards/gallery/dataset/5c6886e1eba31d73070017f5, which contained a list of movies from 2000~2018 and their details, including release year, movie_id (IMDB), certificate, duration, genre, IMDB rating, etc.

Here's what it looks like:

In [32]:
bigml = pd.read_csv('./data/bigml.csv')
print(bigml.shape)
bigml.head(5)

(1235, 119)


Unnamed: 0,year,movie,movie_id,certificate,duration,genre,rate,metascore,synopsis,votes,...,New_York_Film_Critics_Circle_nominated,New_York_Film_Critics_Circle_nominated_categories,Los_Angeles_Film_Critics_Association_won,Los_Angeles_Film_Critics_Association_won_categories,Los_Angeles_Film_Critics_Association_nominated,Los_Angeles_Film_Critics_Association_nominated_categories,release_date.year,release_date.month,release_date.day-of-month,release_date.day-of-week
0,2001,Kate & Leopold,tt0035423,PG-13,118,Comedy|Fantasy|Romance,6.4,44.0,An English Duke from 1876 is inadvertedly drag...,66660,...,0,,0,,0,,2001.0,12.0,25.0,2.0
1,2000,Chicken Run,tt0120630,G,84,Animation|Adventure|Comedy,7.0,88.0,When a cockerel apparently flies into a chicke...,144475,...,1,Best Animated Film,1,Best Animation,1,Best Animation,2000.0,6.0,23.0,5.0
2,2005,Fantastic Four,tt0120667,PG-13,106,Action|Adventure|Family,5.7,40.0,A group of astronauts gain superpowers after a...,273203,...,0,,0,,0,,2005.0,7.0,8.0,5.0
3,2002,Frida,tt0120679,R,123,Biography|Drama|Romance,7.4,61.0,"A biography of artist Frida Kahlo, who channel...",63852,...,0,,0,,0,,2002.0,11.0,22.0,5.0
4,2001,The Lord of the Rings: The Fellowship of the Ring,tt0120737,PG-13,178,Adventure|Drama|Fantasy,8.8,92.0,A meek Hobbit from the Shire and eight compani...,1286275,...,0,,1,Best Music,2,Best Music|Best Production Design,2001.0,12.0,19.0,3.0


And each of the 1235 movies in this dataset has a lot of variables (119). If we print all of them out:

In [33]:
print(bigml.columns)

Index(['year', 'movie', 'movie_id', 'certificate', 'duration', 'genre', 'rate',
       'metascore', 'synopsis', 'votes',
       ...
       'New_York_Film_Critics_Circle_nominated',
       'New_York_Film_Critics_Circle_nominated_categories',
       'Los_Angeles_Film_Critics_Association_won',
       'Los_Angeles_Film_Critics_Association_won_categories',
       'Los_Angeles_Film_Critics_Association_nominated',
       'Los_Angeles_Film_Critics_Association_nominated_categories',
       'release_date.year', 'release_date.month', 'release_date.day-of-month',
       'release_date.day-of-week'],
      dtype='object', length=119)


This is a good start, but there are some issues. First, there's an unnecessary amount of variables that can be concised or grouped together, especially the award vectors. Second, there's NaN values. Third, there's not enough data to work with if the goal is to predict which movies will be oscar NOMINEES and in which category.

The first two issues are easily fixable later. To fix the third issue, let's start data scraping. 

### 1.2 Web-scraping movie lists from IMDB

We can create a new python file called "collect_data.py" and add the following imports and function. If you're confused, refer to the github files.

The following function, "imdb_feature_film", takes in a year from 2000~2018 and returns a dataframe of 350 movies scraped off of IMDB feature film lists, the year they were released, and their respective IMDB movie ids. I chose 350 movies every year simply because it looked like a good balance between having enough data and filtering out really, really weird + indie movies not likely to win Oscars anytime soon. You can change the number, however, by replacing 7 in the for loop with a different number.


In [57]:
import numpy as np
import pandas as pd
import requests
import re

def imdb_feature_film(year):
    """
    Scrapes data of movie titles from IMDB
    :param year: any year from 2000~2018
    :return: a dataframe of movies, their respective IMDB IDs, and release years.
    """
    # Example link where this function scrapes data from: https://www.imdb.com/year/2018/

    print(year)
    html = requests.get("https://www.imdb.com/year/" + str(year)).text

    movies = np.zeros((0, 2))
    for i in range(0, 7):  # 7 pages of 50 movies each = 500 top movies
        movies = np.concatenate([movies, np.flip(np.array(re.findall(r'<a href="/title/([^:?%]+?)/"[\r\n]+> <img alt="([^%]+?)"[\r\n]+', html)))])
        nextLink = "https://www.imdb.com" + re.findall(r'<a href="(/search/title\?title_type=feature&year=(?:.*)&start=(?:.*))"[\r\n]+class="lister-page-next next-page"', html)[0]
        html = requests.get(nextLink).text

    df = pd.DataFrame(movies, columns=['movie', 'movie_id'])
    df.insert(0, 'year', [year]*movies.shape[0], True)
    return df

Let's see the pandas dataframe of web-scraped movies from 2018 (It will take some time).

In [35]:
df_2018 = imdb_feature_film(2018)
print(df_2018.shape)
df_2018.head(10)

2018
(350, 3)


Unnamed: 0,year,movie,movie_id
0,2018,Incredibles 2,tt3606756
1,2018,Annihilation,tt2798920
2,2018,Overlord,tt4530422
3,2018,Bad Times at the El Royale,tt6628394
4,2018,Goosebumps 2: Haunted Halloween,tt5664636
5,2018,A Simple Favor,tt7040874
6,2018,Hotel Mumbai,tt5461944
7,2018,All Is True,tt9206798
8,2018,Mission: Impossible - Fallout,tt4912910
9,2018,The Professor,tt6865690


Awesome, now we can move onto the next step, movie tags.

### 1.3 Web-scraping movie tags

Now that we have the movies and thier IMDB movie ids, we can create a function that uses that information and regex to scrape important tags that we may need to build an accurate prediction algorithm. The tags, which are the same as the ones in the BIDML dataset, are listed as comments in the function.

In [58]:
def movie_tags(id):
    """
    Scrapes data of movie tags/details/variables from IMDB based on the movie IDs
    :param id: movie id (IMDB)
    :return: list of its tags/variables to be used as input variables.
    """
    html = requests.get("https://www.imdb.com/title/" + id).text
    # ---------------TAGS---------------
    # certificate
    # duration
    # genre
    # rate
    # metascore
    # synopsis
    # votes
    # gross
    # user reviews
    # critic reviews
    # popularity
    # awards wins
    # awards nominations

    genre = re.findall('"genre": ([\s\S]+),\\n[\s\S]+"contentRating":', html)
    certificate = re.findall('"contentRating": "(.*)",\\n[\s\S]+<strong', html)
    rate = re.findall('<strong title="(.*) based on ', html)
    votes = re.findall('based on ([,0-9]+) user ratings">', html)
    user_reviews = re.findall('<span itemprop="reviewCount">([,0-9]+) user</span>', html)
    critic_reviews = re.findall('<span itemprop="reviewCount">([,0-9]+) critic</span>', html)
    duration = re.findall('<time datetime="PT(\d+)M">\\n', html)
    keywords = re.findall('<div class="summary_text">\\n(.*)\\n', html)[0].strip()
    metascore = re.findall('<div class="metacriticScore score_[\w]+ titleReviewBarSubItem">\\n<span>([0-9]+)<', html)

    if len(genre) == 0 or len(certificate) == 0 or len(rate) == 0 or len(votes) == 0 or len(user_reviews) == 0 or len(critic_reviews) == 0 or len(duration) == 0 or len(metascore) == 0:
        return None
    genre = ' '.join(genre[0].split()).replace('"', '').replace('[ ', '').replace(' ]', '')
    certificate = certificate[0]
    rate = float(rate[0])
    votes = int(votes[0].replace(',', ''))
    user_reviews = int(user_reviews[0].replace(',', ''))
    critic_reviews = int(critic_reviews[0].replace(',', ''))
    duration = int(duration[0].replace(',', ''))
    metascore = int(metascore[0])

    popularity = re.findall('titleReviewBarSubItem">\\n<span>[0-9]+<[\s\S]+ ([,0-9]+)\\n[\s\S]+\(<span class="titleOverviewSprite popularity', html)
    if len(popularity) == 0:
        popularity = -1
    else:
        popularity = int(popularity[0].replace(',', ''))

    awards_wins = re.findall('<span class="awards-blurb">[\s\S]+ (\d+) wins', html)
    if len(awards_wins) == 0:
        awards_wins = 0
    else:
        awards_wins = int(awards_wins[0])

    awards_nominations = re.findall('<span class="awards-blurb">[\s\S]+ (\d+) nominations', html)
    if len(awards_nominations) == 0:
        awards_nominations = 0
    else:
        awards_nominations = int(awards_nominations[0])

    gross = re.findall('Gross USA:</h4> \$([,0-9]+)', html)
    if len(gross) == 0:
        gross = -1
    else:
        gross = int(gross[0].replace(',', ''))

    tags = [certificate, duration, genre, rate, metascore, keywords, votes, gross, user_reviews, critic_reviews,
            popularity, awards_wins, awards_nominations]
    return tags

Again, let's check if the tags are correct using the movie 'Green Book', which won an Oscar for Best Picture this year (defintely recommend watching it)

![Green Book](https://s20352.pcdn.co/wp-content/uploads/2018/11/green-book-GBK_Tsr1Sheet_RGB_3SM_rgb.jpg)

In [37]:
tags = movie_tags('tt6966692')
print(tags)

['PG-13', 130, 'Biography, Comedy, Drama, Music', 8.3, 69, 'A working-class Italian-American bouncer becomes the driver of an African-American classical pianist on a tour of venues through the 1960s American South.', 191836, 85080171, 1041, 361, 77, 50, 89]


Oh baby, we're in business now.

### 1.4 Web-scraping award ceremony results

Now that we have the movie tags, all that's left for data collection is the awards it won and was nominated for. Unfortunately, it's not so simple. Let's see why.

First, we need to choose which award ceremonies' data we want to use for input. The Oscars results are what we want to predict, so they will be used for output labels. There are 14 ceremonies total used: Golden Globe, BAFTA< Screen Actors Guild, Directors Guild, Producers Guild, Art Directors Guild, Writers Guild, Costume Designers Guild, Online Film Television Association, Online Film Critics Society, Critics Choice, London Critics Circle Film, American Cinema Editors, and Academy Awards/Oscars.

Functions to scrape the award winners + nominees from these award ceremonies:

In [61]:
def scrape_movie_awards(year):
    """
    Given a year, scrapes data off of IMDB for the results of 14 different award ceremonies and the categories invovled.
    :param year: integer year from 2000~2018
    :return: 13 ceremonies' award categories, 13 ceremonies' award results, Oscar categories, Oscar results
    """
    events = ['ev0000292', 'ev0000123', 'ev0000598', 'ev0000212', 'ev0000531', 'ev0000618', 'ev0000710',
              'ev0000190', 'ev0002704', 'ev0000511', 'ev0000133', 'ev0000403', 'ev0000017', 'ev0000003']

    htmls = []
    for e in events:
        htmls.append(requests.get("https://www.imdb.com/event/" + e + "/" + str(year + 1) + "/1?ref_=ttawd_ev_1").text)
    # ---------------AWARDS---------------
    # 1. Golden Globe
    # 2. BAFTA
    # 3. Screen Actors Guild
    # 4. Directors Guild
    # 5. Producers Guild
    # 6. Art Directors Guild
    # 7. Writers Guild
    # 8. Costume Designers Guild
    # 9. Online Film Television Association
    # 10. Online Film Critics Society
    # 11. Critics Choice
    # 12. London Critics Circle Film
    # 13. American Cinema Editors

    # 14. Oscar

    gg_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[0]) if 'Television' not in i][:14]
    gg = []
    for c in gg_categories:
        if 'Actor' in c or 'Actress' in c or 'Director' in c or (year == 2014 and 'Original Score' in c):
            gg.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[0])[:-1])
        else:
            gg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[0])[:-1])
    gg_categories, gg = id_categories('gg', gg_categories, gg)

    bafta_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[1]) if 'British' not in i and 'Best' in i and 'Series' not in i and 'Television' not in i and 'Features' not in i][:19]
    bafta = []
    for c in bafta_categories:
        if 'Actor' in c or 'Actress' in c or 'Director' in c:
            bafta.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[1])[:5])
        else:
            bafta.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[1])[:-1])
    bafta_categories, bafta = id_categories('bafta', bafta_categories, bafta)

    sag_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[2]) if 'Series' not in i and 'Motion Picture' not in i and 'Stunt' not in i and 'Cast' not in i][:4]
    sag = []
    for c in sag_categories:
        if 'Actor' in c or 'Actress' in c or 'Director' in c:
            sag.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[2])[:-1])
        else:
            sag.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[2])[:-1])
    sag_categories, sag = id_categories('sag', sag_categories, sag)

    dg_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[3]) if 'Feature Film' in i or 'Motion' or 'Documentary' in i and 'First' not in i][:2]
    dg = []
    for c in dg_categories:
        dg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[3])[:-1])
    dg_categories, dg = id_categories('dg', dg_categories, dg)

    pg_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[4]) if 'Producer of' in i and 'Theatrical Motion Pictures' in i][:3]
    pg = []
    for c in pg_categories:
        if year >= 2004:
            pg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[4])[:-1])
        else:
            pg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[4]))
    pg_categories, pg = id_categories('pg', pg_categories, pg)

    adg_categories = [i for i in re.findall('"categoryName":"([^"]*)","nominations"', htmls[5]) if 'Film' in i][:4]
    adg = []
    for c in adg_categories:
        if year == 2001 and c == 'Fantasy Film':
            adg.append(['A.I. Artificial Intelligence'])
        else:
            adg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[5])[:-1])
    adg_categories, adg = id_categories('adg', adg_categories, adg)

    wg_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[6]) if 'Original Screenplay'
                     in i or 'Adapted Screenplay' in i or i == 'Documentary Screenplay'][:3]
    wg = []
    for c in wg_categories:
        wg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[6])[:-1])
    wg_categories, wg = id_categories('wg', wg_categories, wg)

    cdg_categories  = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[7]) if 'Contemporary Film' in i
                      or 'Period Film' in i or 'Fantasy Film' in i][:3]
    cdg = []
    for c in cdg_categories:
        cdg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[7])[:-1])
    cdg_categories, cdg = id_categories('cdg', cdg_categories, cdg)

    ofta_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[8]) if 'Series' not in i and 'Ensemble' not in i
                       and 'Television' not in i and 'Actors and Actresses' not in i and 'Creative' not in i and 'Program' not in i
                       and 'Behind' not in i and 'Debut' not in i and 'Poster' not in i and 'Trailer' not in i and 'Stunt' not in i and
                       'Sequence' not in i and 'Voice-Over' not in i and 'Youth' not in i and 'Cinematic' not in i and 'Casting' not in i and 'Acting' not in i][:23]
    ofta = []
    for c in ofta_categories:
        if 'Actor' in c or 'Actress' in c or 'Director' in c:
            ofta.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[8])[:-1])
        else:
            ofta.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[8])[:-1])
    ofta_categories, ofta = id_categories('ofta', ofta_categories, ofta)

    ofcs_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[9]) if 'Debut' not in i
                       and 'Stunt' not in i and 'Television' not in i and 'Series' not in i][:18]
    ofcs = []
    for c in ofcs_categories:
        if 'Actor' in c or 'Actress' in c or 'Director' in c:
            ofcs.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[9])[:-1])
        else:
            ofcs.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[9])[:-1])
    ofcs_categories, ofcs = id_categories('ofcs', ofcs_categories, ofcs)

    cc_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[10]) if 'Series' not in i
                     and 'Young' not in i and 'Ensemble' not in i and 'TV' not in i and 'Television' not in i and 'Show' not in i][:23]
    cc = []
    for c in cc_categories:
        if 'Actor' in c or 'Actress' in c or 'Director' in c:
            cc.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[10])[:-1])
        else:
            cc.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[10])[:-1])
    cc_categories, cc = id_categories('cc', cc_categories, cc)

    lccf_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[11]) if 'British' not in i
                       and 'Technical' not in i and 'Screenwriter' not in i and 'Television' not in i][:8]
    lccf = []
    for c in lccf_categories:
        if 'Actor' in c or 'Actress' in c or 'Director' in c:
            lccf.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[11])[:-1])
        else:
            lccf.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[11])[:-1])
    lccf_categories, lccf = id_categories('lccf', lccf_categories, lccf)

    ace_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[12]) if 'Series' not in i
                      and 'Non-Theatrical' not in i and 'Television' not in i and 'Student' not in i][:4]
    ace = []
    for c in ace_categories:
        if 'Actor' in c or 'Actress' in c or 'Director' in c:
            ace.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[12])[:-1])
        else:
            ace.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[12])[:-1])
    ace_categories, ace = id_categories('ace', ace_categories, ace)

    oscar_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[13])][:24]
    oscar = []
    for c in oscar_categories:
        if c == oscar_categories[-1]:
            if 'Actor' in c or 'Actress' in c or 'Director' in c:
                oscar.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[13]))
            else:
                oscar.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[13]))
        else:
            if 'Actor' in c or 'Actress' in c or 'Director' in c:
                oscar.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[13])[:-1])
            else:
                oscar.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[13])[:-1])
    oscar_categories, oscar = id_categories('oscar', oscar_categories, oscar)

    return [gg_categories, bafta_categories, sag_categories, dg_categories, pg_categories, adg_categories, wg_categories, cdg_categories, ofta_categories, ofcs_categories, cc_categories, lccf_categories, ace_categories],\
           [gg, bafta, sag, dg, pg, adg, wg, cdg, ofta, ofcs, cc, lccf, ace], oscar_categories, oscar



def id_categories(name, cs, aw):
    """
    This function is specifically called by scrape_movie_awards to link similar categories across award ceremonies by
    tagging them with IDs.
    :param name: award ceremony id/name
    :param cs: list of categories
    :param aw: list of award winners/nominees
    :return: list of categories ids (0~23) and list of award winners based on the available categories
    """
    if name == 'gg':
        replace = [next((s for s in cs if 'Best Motion Picture' in s and 'Drama' in s), None),
                   next((s for s in cs if 'Best Motion Picture' in s and 'Comedy' in s), None),
                   next((s for s in cs if 'Actor' in s and 'Drama' in s and 'Supporting' not in s), None),
                   next((s for s in cs if 'Actor' in s and 'Comedy' in s and 'Supporting' not in s), None),
                   next((s for s in cs if 'Actress' in s and 'Drama' in s and 'Supporting' not in s), None),
                   next((s for s in cs if 'Actress' in s and 'Comedy' in s and 'Supporting' not in s), None),
                   next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
                   next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
                   next((s for s in cs if 'Animated' in s), None),
                   next((s for s in cs if 'Director' in s), None),
                   next((s for s in cs if 'Foreign' in s), None),
                   next((s for s in cs if 'Original Score' in s), None),
                   next((s for s in cs if 'Original Song' in s), None),
                   next((s for s in cs if 'Screenplay' in s), None)]
        id = [0, 0, 1, 1, 2, 2, 3, 4, 5, 8, 12, 14, 15, 22]
    elif name == 'bafta':
        replace = [next((s for s in cs if 'Best Film' in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Animated' in s and 'Short' not in s), None),
             next((s for s in cs if 'Cinematography' in s), None),
             next((s for s in cs if 'Costume Design' in s), None),
             next((s for s in cs if 'Documentary' in s), None),
             next((s for s in cs if 'Editing' in s), None),
             next((s for s in cs if 'Not' in s and 'English' in s), None),
             next((s for s in cs if 'Make Up' in s or 'Hair' in s), None),
             next((s for s in cs if 'Production Design' in s), None),
             next((s for s in cs if 'Short' in s and 'Animat' in s), None),
             next((s for s in cs if 'Short' in s and 'Film' in s), None),
             next((s for s in cs if 'Sound' in s), None),
             next((s for s in cs if 'Visual Effects' in s), None),
             next((s for s in cs if 'Screenplay' in s and 'Adapted' in s), None),
             next((s for s in cs if 'Screenplay' in s and 'Original' in s), None)]
        id = [0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 12, 13, 16, 17, 18, 19, 21, 22, 23]
    elif name == 'sag':
        replace = [next((s for s in cs if 'Male' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Female' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Male' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Female' in s and 'Supporting' in s), None)]
        id = [1, 2, 3, 4]
    elif name == 'dg':
        replace = [next((s for s in cs if 'Feature' in s), None),
             next((s for s in cs if 'Documentary' in s), None),]
        id = [0, 9]
    elif name == 'pg':
        replace = [next((s for s in cs if 'Producer of Theatrical' in s), None),
             next((s for s in cs if 'Animated' in s), None),
             next((s for s in cs if 'Documentary' in s), None)]
        id = [0, 5, 9]
    elif name == 'adg':
        replace = [next((s for s in cs if 'Period' in s), None),
             next((s for s in cs if 'Fantasy' in s), None),
             next((s for s in cs if 'Contemporary' in s), None),
             next((s for s in cs if 'Animated' in s), None)]
        id = [0, 0, 0, 5]
    elif name == 'wg':
        replace = [next((s for s in cs if 'Documentary' in s), None),
            next((s for s in cs if 'Adapted' in s), None),
             next((s for s in cs if 'Original' in s), None)]
        id = [9, 22, 23]
    elif name == 'cdg':
        replace = [next((s for s in cs if 'Period' in s), None),
             next((s for s in cs if 'Fantasy' in s), None),
             next((s for s in cs if 'Contemporary' in s), None)]
        id = [0, 0, 0]
    elif name == 'ofta':
        replace = [next((s for s in cs if 'Best Picture' in s), None),
             next((s for s in cs if 'Best Actor' in s), None),
             next((s for s in cs if 'Breakthrough' in s and 'Male' in s), None),
             next((s for s in cs if 'Best Actress' in s), None),
             next((s for s in cs if 'Breakthrough' in s and 'Female' in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Animated' in s), None),
             next((s for s in cs if 'Cinematography' in s), None),
             next((s for s in cs if 'Costume Design' in s), None),
             next((s for s in cs if 'Director' in s), None),
             next((s for s in cs if 'Documentary' in s), None),
             next((s for s in cs if 'Film Editing' in s), None),
             next((s for s in cs if 'Foreign' in s), None),
             next((s for s in cs if 'Makeup' in s or 'Hair' in s), None),
             next((s for s in cs if 'Original Score' in s), None),
             next((s for s in cs if 'Original Song' in s), None),
             next((s for s in cs if 'Production Design' in s), None),
             next((s for s in cs if 'Sound' in s and 'Editing' in s), None),
             next((s for s in cs if 'Sound' in s and 'Mixing' in s), None),
             next((s for s in cs if 'Visual Effects' in s), None),
             next((s for s in cs if 'Screenplay' in s and 'Another' in s), None),
             next((s for s in cs if 'Screenplay' in s and 'Directly' in s), None)]
        id = [0, 1, 1, 2, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 19, 20, 21, 22, 23]
    elif name == 'ofcs':
        replace = [next((s for s in cs if 'Best Picture' in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Animated' in s), None),
             next((s for s in cs if 'Cinematography' in s), None),
             next((s for s in cs if 'Costume Design' in s), None),
             next((s for s in cs if 'Director' in s), None),
             next((s for s in cs if 'Documentary' in s), None),
             next((s for s in cs if 'Editing' in s), None),
             next((s for s in cs if 'Not' in s and 'English' in s), None),
             next((s for s in cs if 'Original Score' in s), None),
             next((s for s in cs if 'Original Song' in s), None),
             next((s for s in cs if 'Sound' in s), None),
             next((s for s in cs if 'Visual Effects' in s), None),
             next((s for s in cs if 'Screenplay' in s and 'Adapted'), None),
             next((s for s in cs if 'Screenplay' in s and 'Original'), None)]
        id = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 19, 21, 22, 23]
    elif name == 'cc':
        replace = [next((s for s in cs if 'Best Picture' in s), None),
             next((s for s in cs if 'Best Action Movie' in s), None),
             next((s for s in cs if 'Best Comedy' in s), None),
             next((s for s in cs if 'Best Sci-Fi' in s or 'Best Horror' in s), None),
             next((s for s in cs if 'Actor' in s and 'Comedy' not in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actor' in s and 'Comedy' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actress' in s and 'Comedy' not in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actress' in s and 'Comedy' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Animated' in s), None),
             next((s for s in cs if 'Cinematography' in s), None),
             next((s for s in cs if 'Costume Design' in s), None),
             next((s for s in cs if 'Director' in s), None),
             next((s for s in cs if 'Editing' in s), None),
             next((s for s in cs if 'Foreign' in s), None),
             next((s for s in cs if 'Makeup' in s or 'Hair' in s), None),
             next((s for s in cs if 'Score' in s), None),
             next((s for s in cs if 'Song' in s), None),
             next((s for s in cs if 'Production Design' in s), None),
             next((s for s in cs if 'Visual Effects' in s), None),
             next((s for s in cs if 'Adapted Screenplay' in s), None),
             next((s for s in cs if 'Original Screenplay' in s), None)]
        id = [0, 0, 0, 0, 1, 1, 2, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 21, 22, 23]
    elif name == 'lccf':
        replace = [next((s for s in cs if 'Film' in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Director' in s), None),
             next((s for s in cs if 'Documentary' in s), None),
             next((s for s in cs if 'Foreign' in s), None)]
        id = [0, 1, 2, 3, 4, 8, 9, 12]
    elif name == 'ace':
        replace = [next((s for s in cs if 'Feature Film' in s and 'Drama' in s), None),
             next((s for s in cs if 'Feature Film' in s and 'Comedy' in s), None),
             next((s for s in cs if 'Animated' in s), None),
             next((s for s in cs if 'Documentary' in s), None)]
        id = [0, 0, 5, 9]
    else:  # Oscars
        replace = [next((s for s in cs if 'Picture' in s), None),
             next((s for s in cs if 'Actor' in s and 'Leading' in s), None),
             next((s for s in cs if 'Actress' in s and 'Leading' in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Animated' in s and 'Short' not in s), None),
             next((s for s in cs if 'Cinematography' in s), None),
             next((s for s in cs if 'Costume Design' in s), None),
             next((s for s in cs if 'Direct' in s), None),
             next((s for s in cs if 'Documentary' in s), None),
             next((s for s in cs if 'Documentary' in s and 'Short' in s), None),
             next((s for s in cs if 'Film Editing' in s), None),
             next((s for s in cs if 'Foreign' in s), None),
             next((s for s in cs if 'Makeup' in s or 'Hair' in s), None),
             next((s for s in cs if 'Original Score' in s), None),
             next((s for s in cs if 'Original Song' in s), None),
             next((s for s in cs if 'Production Design' in s), None),
             next((s for s in cs if 'Short' in s and 'Animat' in s), None),
             next((s for s in cs if 'Short' in s and 'Live' in s), None),
             next((s for s in cs if 'Sound' in s and 'Editing' in s), None),
             next((s for s in cs if 'Sound' in s and 'Mixing' in s), None),
             next((s for s in cs if 'Visual Effects' in s), None),
             next((s for s in cs if 'Screenplay' in s and 'Adapted' in s), None),
             next((s for s in cs if 'Screenplay' in s and 'Original' in s), None)]
        id = list(range(0, 24))

    none_index = [i for i in replace if i is None]
    cs = [c for c in cs if c not in none_index]
    aw = [a for a in aw if a not in none_index]

    for i, c in enumerate(cs):
        cs[i] = str(id[i])
    return cs, aw

These functions above would have been a lot more concise and simple (with less if else statements and list comprehensions) if the format for categories and the listing of winners/nominees stayed consistent throughout the years.

Here are just a few of the countless exceptions/difficulties that needed to be dealt with:
1. Movie titles switched with the Actor/Actress/Director/etc:

1 | 2
- | - 
![Best Picture](./ex1.jpg) | ![Best Actor](./ex2.jpg)


2. Or how about this case when the composer and movie title for Best Original Score were switched ONLY in 2015?
![Best Original Score](./ex3.jpg)

3. Award categories changing every year

4. TV Shows and other unwanted categories (i.e. Best British/Canadian only film producer, Student Award, Best Youth Award, etc) mixed in

5. Regex in general

## 2. Data Organization/Manipulation

### 2.1 Combining BIGML with our Data

As the title suggests, we should combine the BIGML dataset with the data we just scraped for more data. However, we should try to avoid duplicates. Time to work on main.py now. First, imports first, including the collect_data.py we created.

In [39]:
# main.py
import pandas as pd
import numpy as np
import collect_data
import os
import pickle

The function below, extract_movie_data(), calls the web-scraping functions from collect_data.py and returns a dataframe of movies and their tags. Of course, excluding duplicates from the BIGML dataset.

Note: the GitHub repository doesn't include imdb.csv because it's not useful in the presence of the final dataset (combined.csv)

In [62]:
def extract_movie_data():
	"""
	Extracts the movie titles from years 2000~2018 and their respective tags/details and outputs them in the form of
	a dataframe in similar format to the BIGML dataset.
	:return: dataframe of web-scraped movies and their tags
	"""
	if os.path.exists('./data/imdb.csv'):
		imdb = pd.read_csv('./data/imdb.csv', index_col=0)
	else:
		imdb = collect_data.imdb_feature_film(2000)
		for y in list(range(2001, 2019)):
			imdb = imdb.append(collect_data.imdb_feature_film(y))

		# Removes duplicate movies
		df = pd.read_csv('./data/bigml.csv')
		temp = []
		for index, row in imdb.iterrows():
			if row['movie'] not in list(df['movie']):
				temp.append([row['year'], row['movie'], row['movie_id']])
		imdb = pd.DataFrame(temp, columns=['year', 'movie', 'movie_id'])

		tags = []
		for index, row in imdb.iterrows():
			print(str(index) + '. ' + row['movie'])
			id = row['movie_id']
			extra = collect_data.movie_tags(id)

			if extra is not None:
				tags.append([row['year'], row['movie'], row['movie_id']] + extra)

		imdb = pd.DataFrame(tags, columns=['year', 'movie', 'movie_id', 'certificate', 'duration', 'genre', 'rate',
										   'metascore', 'synopsis', 'votes', 'gross', 'user_reviews', 'critic_reviews',
										   'popularity', 'awards_wins', 'awards_nominations'], index=False)
		imdb.to_csv('./data/imdb.csv')

	return imdb

Now, we have two dataframes (BIGML and our IMDB scrape) of similar formats. First let's delete unnecessary rows off of BIGML, specifically release_date and every column after awards_nominations, which can be down with the drop function like so:

In [41]:
bigml.drop([bigml.columns[10]], axis=1, inplace=True)
bigml.drop(bigml.columns[16:], axis=1, inplace=True)
bigml.to_csv('./data/bigml.csv')

Then, the only step left to do is to combine the two. Read BIGML in first, fill the NaN values in the gross and popularity columns with -1 to allow the model to train better, read in the IMDB dataset, append them, sort the dataset by year and movie, then reset the indices, and finally export our final dataset as combined.csv. Done!

In [63]:
def combine_datasets():
	"""
	Combines the BIGML dataset and our IMDB-scraped dataset. Also fills the NaN values in the gross and popularity
	columns with -1 for better model training and sorts the data based on year and movie.
	:return: combined dataset
	"""
	bigml = pd.read_csv('./data/bigml.csv', index_col=0)
	bigml = bigml.fillna(value={'gross': -1, 'popularity': -1})
	imdb = extract_movie_data()
	dataframe = bigml.append(imdb, sort=False, ignore_index=True)
	dataframe.sort_values(['year', 'movie'], axis=0, ascending=True, inplace=True)
	dataframe = dataframe.reset_index(drop=True)

	return dataframe

### 2.2 Splitting Genres

Instead of passing strings into the model, we should pass integers (IDs) that represent the individual genres. I chose to pass 3 genres per movie as input vectors. Movies with less than 3 genres listed were filled with -1, and those with more were cut off. Here's a function to do that:

In [64]:
def split_genres(dataframe):
	"""
	Extracts the genre column from the final (combined) dataset, splits it into lists, and converts them into IDs
	based on genreID below.
	:param dataframe: the final dataframe
	:return: an edited dataframe with split genres
	"""
	# ID dictionary of all the genres
	genreID = {'Action': 0, 'Adult': 1, 'Adventure': 2, 'Animation': 3, 'Biography': 4, 'Comedy': 5, 'Crime': 6,
			   'Documentary': 7, 'Drama': 8, 'Family': 9, 'Fantasy': 10, 'Film': 11, 'Noir': 12, 'Game - Show': 13,
			   'History': 14, 'Horror': 15, 'Musical': 16, 'Music': 17, 'Mystery': 18, 'News': 19, 'Reality - TV': 20,
			   'Romance': 21, 'SciFi': 22, 'Short': 23, 'Sport': 24, 'Talk - Show': 25, 'Thriller': 26, 'War': 27,
			   'Western': 28}

	# Splits the first 3 genres of each movie into 3 different lists. If a movie only has 1 or 2 genre(s), then the empty spot is filled with -1
	genre = [i.replace('|', ', ') for i in list(dataframe.genre)]
	genre1 = []
	genre2 = []
	genre3 = []
	for i in genre:
		multipleGenres = [g.replace(',', '').replace('Sci-Fi', 'SciFi') for g in i.split()]

		if len(multipleGenres) <= 3:
			multipleGenres += [-1] * (3 - len(multipleGenres))
		genre1.append(multipleGenres[0])
		genre2.append(multipleGenres[1])
		genre3.append(multipleGenres[2])

	# Replaces the genres with IDs from genreID
	genre1 = [str(genreID.get(word, word)) for word in genre1]
	genre2 = [str(genreID.get(word, word)) for word in genre2]
	genre3 = [str(genreID.get(word, word)) for word in genre3]

	# Deletes the original genre column and inserts the 3 new genre columns
	dataframe.drop('genre', axis=1, inplace=True)
	dataframe.insert(5, 'genre1', genre1, True)
	dataframe.insert(6, 'genre2', genre2, True)
	dataframe.insert(7, 'genre3', genre3, True)
	return dataframe

### 2.3 Category IDs
Now if you've been paying attention with the function code above, you might wonder what id_categories(name, cs, aw) does. To understand this, we have to look at awards_categories.csv (which I manually created):

In [43]:
award_categories = pd.read_csv('./data/awards_categories.csv')
award_categories.head(10)

Unnamed: 0,Oscars,Golden globe,bafta,screen actors guild,directors guild,producers guild,art directors guild,writers guild,costume designers guild,online film television association,online film critics society,critics choice,london critics circle film,american cinema editors
0,picture 0,best motion picture – drama,film,,feature film,Outstanding Producer of Theatrical Motion Pict...,period feature film,,,best picture,best picture,best picture,film of the year,feature film (dramatic)
1,,Best motion picture - musical/comedy,,,,,fantasy feature film,,,,,best action movie,,feature film (comedy)
2,,,,,,,contemporary feature film,,,,,best comedy,,
3,,,,,,,,,,,,best sci-fi/horror movie,,
4,,,,,,,,,,,,,,
5,actor 1,best performance by an actor in a motion pictu...,Leading actor,outstanding performance by a male actor in a l...,,,,,,best actor,best actor,best actor,actor of the year,
6,,best performance by an actor in a motion pictu...,,,,,,,,best breakthrough performance: male,,best actor in a comedy,,
7,actress 2,best performance by an actress in a motion pic...,Leading actress,outstanding performance by a female actor in a...,,,,,,best actress,best actress,best actress,actress of the year,
8,,best performance by an actress in a motion pic...,,,,,,,,best breakthrough performance: female,,best actress in a comedy,,
9,supporting actor 3,best performance by an actor in a supporting r...,supporting actor,outstanding performance by a male actor in a s...,,,,,,best supporting actor,best supporting actor,best supporting actor,supporting actor of the year,


As shown, all of the categories from all the award ceremonies are linked together with similar ones by tagging them with IDs. Of course, these IDs (0~23) all ultimately refer to the 24 Oscar categories we want to predict.

A few key points here:
1. Some Oscar categories are linked to MULTIPLE categories (ex: Oscar best picture = Golden Globe best motion picture drama & best motion picture musical/comedy
2. NaN values exist, meaning that not all ceremonies contain the 24 needed categories. This is fine.
3. By manually ID-ing them, the algorithm will be initially biased towards the idea that winning a certain category doesn't directly affect winning another category, which is a fine assumption. Maybe the algorithm will learn that it does as it trains.

So let's go back to the function id_categories. After retreiving all the category titles, the category names are replaced with ID integers ONLY if they exist, which the built-in next() calls are responsible for.

### 2.4 Award points

Now that we have IDs for all categories from all of the award ceremonies, we can give "award points" in order to specify to the algorithm which movie won and which movie got nominated. For example, if we give winners 1 point and nominees 0.5 points:

| Result | Points |
| --- | --- |
| Winner | 1 |
| Nominee | 0.5 |
| Nothing | 0 |

Seems intuitive right? But let's think about this more. Why should nominees be given half the winner's points? There's 1 winner, but usually around 5 nominees. So how about this:

| Result | Points |
| --- | --- |
| Winner | 1 |
| Nominee | 1/(nominee_count) |
| Nothing | 0 |

Better. Below is the function to do the magic.

In [65]:
def add_award_points(dataframe):
	"""
	Adds points to movies in categories that it won / was nominated in from all 14 award ceremonies. 1 point for winner,
	1/(number of nominees) points for nominee, and 0 points for neither.
	:param dataframe: final (combined) dataset
	:return: edited dataset with points added in
	"""
	if os.path.exists('./data/categories') and os.path.exists('./data/awards') and os.path.exists('./data/oscar_cs') and os.path.exists('./data/oscar_aw'):
		with open('./data/categories', 'rb') as f:
			categories = pickle.load(f)
		with open('./data/awards', 'rb') as f:
			awards = pickle.load(f)
		with open('./data/oscar_cs', 'rb') as f:
			oscar_cs = pickle.load(f)
		with open('./data/oscar_aw', 'rb') as f:
			oscar_aw = pickle.load(f)
	else:
		categories = []
		awards = []
		oscar_cs = []
		oscar_aw = []
		for y in range(2000, 2019):
			print(y)
			results = collect_data.scrape_movie_awards(y)
			categories.append(results[0])
			awards.append(results[1])
			oscar_cs.append(results[2])
			oscar_aw.append(results[3])
		with open('./data/categories', 'wb') as f:
			pickle.dump(categories, f)
		with open('./data/awards', 'wb') as f:
			pickle.dump(awards, f)
		with open('./data/oscar_cs', 'wb') as f:
			pickle.dump(oscar_cs, f)
		with open('./data/oscar_aw', 'wb') as f:
			pickle.dump(oscar_aw, f)

	# Adds points to all of the movies that have won/been nominated for awards in all categories (except Oscar)
	start = dataframe.columns.get_loc('best_picture')

	# Ensures that all movies' award points start at 0
	for i in dataframe.columns[start:]:
		dataframe[i] = 0

	for i, year in enumerate(categories):
		for j, event in enumerate(year):
			for k, award in enumerate(event):
				for l, movie in enumerate(awards[i][j][k]):
					index = dataframe.index[(dataframe.movie == movie)&((dataframe.year == 2000 + i)|(dataframe.year == 2000 + i + 1)|(dataframe.year == 2000 + i - 1))]
					if len(index) != 0:
						print(str(i) + ", " + str(j) + ", " + str(k) + ", " + str(l))
						print(movie + '\n')
						if l == 0: points = 1
						else: points = 1.0/len(awards[i][j][k])
						dataframe.loc[index[0], dataframe.columns[start + int(award)]] += points

	# Oscar points for data labels
	oscar_start = dataframe.columns.get_loc('oscar_best_picture')
	for i, year in enumerate(oscar_cs):
		for j, award in enumerate(year):
			for l, movie in enumerate(oscar_aw[i][j]):
				index = dataframe.index[(dataframe.movie == movie)&((dataframe.year == 2000 + i)|(dataframe.year == 2000 + i + 1)|(dataframe.year == 2000 + i - 1))]
				if len(index) != 0:
					print(str(i) + ", " + str(j) + ", " + str(l))
					print(movie)
					print(str(dataframe.loc[index[0], dataframe.columns[oscar_start + int(award)]]))
					if l == 0: points = 1
					else: points = 1.0/len(oscar_aw[i][j])
					dataframe.loc[index[0], dataframe.columns[oscar_start + int(award)]] = points
					print(str(dataframe.loc[index[0], dataframe.columns[oscar_start + int(award)]]) + '\n')

	# Computes average sum by dividing the award points by the number of award ceremonies the movie could have won in
	N = [11, 7, 7, 7, 7, 8, 4, 4, 5, 8, 1, 4, 6, 3, 4, 4, 3, 1, 1, 3, 1, 4, 6, 5]
	for i, col in enumerate(dataframe.columns[16:oscar_start]):
		dataframe[col] /= N[i]

	dataframe.to_csv('./data/combined.csv')
	return dataframe

Slick! After running add_award_points, let's quickly check what our final dataset looks like before we move on.

In [54]:
df = pd.read_csv('./data/combined.csv', index_col=0)
print(df.shape)
df.head(3)

(5232, 64)


Unnamed: 0,year,movie,movie_id,certificate,duration,genre,rate,metascore,synopsis,votes,...,oscar_best_music_score,oscar_best_music_song,oscar_best_production_design,oscar_best_short_animated,oscar_best_short_live,oscar_best_sound_editing,oscar_best_sound_mixing,oscar_best_visual_effects,oscar_best_writing_adapted,oscar_best_writing_original
0,2000,101 Reykjavík,tt0237993,Not Rated,88,"Comedy, Romance",6.9,68.0,Will the 30 y.o. Hlynur ever move out of his m...,8989,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2000,102 Dalmatians,tt0211181,G,100,Adventure|Comedy|Family,4.8,35.0,Cruella DeVil gets out of prison and goes afte...,27364,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2000,28 Days,tt0191754,PG-13,103,"Comedy, Drama",6.0,46.0,A big-city newspaper columnist is forced to en...,40466,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 3. The Algorithm

In [56]:
!jupyter nbconvert predicting_the_oscars.ipynb --to html

[NbConvertApp] Converting notebook predicting_the_oscars.ipynb to html
[NbConvertApp] Writing 502371 bytes to predicting_the_oscars.html
