# Project PTO - Predicting the Oscars


![Who will be the Oscar nominees and winners?](./imgs/oscars.jpg)

This notebook covers my progress on building an algorithm to accurately predict Oscar results. Not just the winners based on the nominees in each category, but also which movies will be nominees based on all of the movies released that year. The project repository can be cloned at https://github.com/j0hnk1m/predict-the-oscars if you're interested.

This mini-project only lasted for a month, so it may not be perfect. Hopefully, you guys can use this as a platform to get even better predictions. Let's dig in! I'll be using Anaconda Python 3.7. You may need to import a few packages yourself, however, using conda install {package}

There are 4 major sections to cover:
1. Data Collection
2. Data Organization/Manipulation
3. The Algorithm
4. Algorithm Optimization

## 1. Data Collection

### 1.1 The BIGML dataset

First things first, general imports first

In [32]:
# collect_data.py
import pandas as pd
import numpy as np
import os
import requests
import re

I came across this dataset from https://bigml.com/user/academy_awards/gallery/dataset/5c6886e1eba31d73070017f5, which contained a list of movies from 2000~2018 and their details, including release year, movie_id (IMDB), certificate, duration, genre, IMDB rating, etc.

Here's what it looks like:

In [33]:
bigml = pd.read_csv('./data/bigml.csv')
print(bigml.shape)
bigml.head(5)

(1235, 17)


Unnamed: 0.1,Unnamed: 0,year,movie,movie_id,certificate,duration,genre,rate,metascore,synopsis,votes,release_date,user_reviews,critic_reviews,popularity,awards_wins,awards_nominations
0,0,2001,Kate & Leopold,tt0035423,PG-13,118,Comedy|Fantasy|Romance,6.4,44.0,An English Duke from 1876 is inadvertedly drag...,66660,2001-12-25,318.0,125.0,2363.0,1,4
1,1,2000,Chicken Run,tt0120630,G,84,Animation|Adventure|Comedy,7.0,88.0,When a cockerel apparently flies into a chicke...,144475,2000-06-23,361.0,186.0,2859.0,5,11
2,2,2005,Fantastic Four,tt0120667,PG-13,106,Action|Adventure|Family,5.7,40.0,A group of astronauts gain superpowers after a...,273203,2005-07-08,1008.0,278.0,1876.0,0,0
3,3,2002,Frida,tt0120679,R,123,Biography|Drama|Romance,7.4,61.0,"A biography of artist Frida Kahlo, who channel...",63852,2002-11-22,272.0,126.0,2508.0,2,12
4,4,2001,The Lord of the Rings: The Fellowship of the Ring,tt0120737,PG-13,178,Adventure|Drama|Fantasy,8.8,92.0,A meek Hobbit from the Shire and eight compani...,1286275,2001-12-19,5078.0,296.0,204.0,26,67


And each of the 1235 movies in this dataset has a lot of variables (119). If we print all of them out:

In [34]:
print(bigml.columns)

Index(['Unnamed: 0', 'year', 'movie', 'movie_id', 'certificate', 'duration',
       'genre', 'rate', 'metascore', 'synopsis', 'votes', 'release_date',
       'user_reviews', 'critic_reviews', 'popularity', 'awards_wins',
       'awards_nominations'],
      dtype='object')


This is a good start, but there are some issues. First, there's an unnecessary amount of variables that can be concised or grouped together, especially the award vectors. Second, there's NaN values. Third, there's not enough data to work with if the goal is to predict which movies will be oscar NOMINEES and in which category.

The first two issues are easily fixable later. To fix the third issue, let's start data scraping. 

### 1.2 Web-scraping movie lists from IMDB

We can create a new python file called "collect_data.py" and add the following imports and function. If you're confused, refer to the github files.

The following function, "imdb_feature_film", takes in a year from 2000~2018 and returns a dataframe of 350 movies scraped off of IMDB feature film lists, the year they were released, and their respective IMDB movie ids. I chose 350 movies every year simply because it looked like a good balance between having enough data and filtering out really, really weird + indie movies not likely to win Oscars anytime soon. You can change the number, however, by replacing 7 in the for loop with a different number.


In [35]:
def imdb_feature_film(year):
    """
    Scrapes data of movie titles from IMDB
    :param year: any year from 2000~2018
    :return: a dataframe of movies, their respective IMDB IDs, and release years.
    """
    # Example link where this function scrapes data from: https://www.imdb.com/year/2018/

    print(year)
    html = requests.get("https://www.imdb.com/year/" + str(year)).text

    movies = np.zeros((0, 2))
    for i in range(0, 5):  # _ pages of 50 movies each
        movies = np.concatenate([movies, np.flip(np.array(re.findall(r'<a href="/title/([^:?%]+?)/"[\r\n]+> <img alt="([^%]+?)"[\r\n]+', html)))])
        nextLink = "https://www.imdb.com" + re.findall(r'<a href="(/search/title\?title_type=feature&year=(?:.*)&start=(?:.*))"[\r\n]+class="lister-page-next next-page"', html)[0]
        html = requests.get(nextLink).text

    df = pd.DataFrame(movies, columns=['movie', 'movie_id'])
    df.insert(0, 'year', [year]*movies.shape[0], True)
    return df

Let's see the pandas dataframe of web-scraped movies from 2018 (It will take some time).

In [36]:
df_2018 = imdb_feature_film(2018)
print(df_2018.shape)
df_2018.head(10)

2018
(250, 3)


Unnamed: 0,year,movie,movie_id
0,2018,12 Strong,tt1413492
1,2018,Mary Queen of Scots,tt2328900
2,2018,Annihilation,tt2798920
3,2018,Tag,tt2854926
4,2018,Robin Hood,tt4532826
5,2018,Replicas,tt4154916
6,2018,Bad Times at the El Royale,tt6628394
7,2018,Dragged Across Concrete,tt6491178
8,2018,Mid90s,tt5613484
9,2018,Mortal Engines,tt1571234


Awesome, now we can move onto the next step, movie tags.

### 1.3 Web-scraping movie tags

Now that we have the movies and thier IMDB movie ids, we can create a function that uses that information and regex to scrape important tags that we may need to build an accurate prediction algorithm. The tags, which are the same as the ones in the BIDML dataset, are listed as comments in the function.

In [37]:
def movie_tags(id):
    """
    Scrapes data of movie tags/details/variables from IMDB based on the movie IDs
    :param id: movie id (IMDB)
    :return: list of its tags/variables to be used as input variables.
    """
    html = requests.get("https://www.imdb.com/title/" + id).text
    # ---------------TAGS---------------
    # certificate
    # duration
    # genre
    # rate
    # metascore
    # synopsis
    # votes
    # gross
    # user reviews
    # critic reviews
    # popularity
    # awards wins
    # awards nominations

    genre = re.findall('"genre": ([\s\S]+),\\n[\s\S]+"contentRating":', html)
    certificate = re.findall('"contentRating": "(.*)",\\n[\s\S]+<strong', html)
    rate = re.findall('<strong title="(.*) based on ', html)
    votes = re.findall('based on ([,0-9]+) user ratings">', html)
    user_reviews = re.findall('<span itemprop="reviewCount">([,0-9]+) user</span>', html)
    critic_reviews = re.findall('<span itemprop="reviewCount">([,0-9]+) critic</span>', html)
    duration = re.findall('<time datetime="PT(\d+)M">\\n', html)
    keywords = re.findall('<div class="summary_text">\\n(.*)\\n', html)[0].strip()
    metascore = re.findall('<div class="metacriticScore score_[\w]+ titleReviewBarSubItem">\\n<span>([0-9]+)<', html)

    if len(genre) == 0 or len(certificate) == 0 or len(rate) == 0 or len(votes) == 0 or len(user_reviews) == 0 or len(critic_reviews) == 0 or len(duration) == 0 or len(metascore) == 0:
        return None
    genre = ' '.join(genre[0].split()).replace('"', '').replace('[ ', '').replace(' ]', '')
    certificate = certificate[0]
    rate = float(rate[0])
    votes = int(votes[0].replace(',', ''))
    user_reviews = int(user_reviews[0].replace(',', ''))
    critic_reviews = int(critic_reviews[0].replace(',', ''))
    duration = int(duration[0].replace(',', ''))
    metascore = int(metascore[0])

    popularity = re.findall('titleReviewBarSubItem">\\n<span>[0-9]+<[\s\S]+ ([,0-9]+)\\n[\s\S]+\(<span class="titleOverviewSprite popularity', html)
    if len(popularity) == 0:
        popularity = -1
    else:
        popularity = int(popularity[0].replace(',', ''))

    awards_wins = re.findall('<span class="awards-blurb">[\s\S]+ (\d+) wins', html)
    if len(awards_wins) == 0:
        awards_wins = 0
    else:
        awards_wins = int(awards_wins[0])

    awards_nominations = re.findall('<span class="awards-blurb">[\s\S]+ (\d+) nominations', html)
    if len(awards_nominations) == 0:
        awards_nominations = 0
    else:
        awards_nominations = int(awards_nominations[0])

    gross = re.findall('Gross USA:</h4> \$([,0-9]+)', html)
    if len(gross) == 0:
        gross = -1
    else:
        gross = int(gross[0].replace(',', ''))

    tags = [certificate, duration, genre, rate, metascore, keywords, votes, gross, user_reviews, critic_reviews,
            popularity, awards_wins, awards_nominations]
    return tags

Again, let's check if the tags are correct using the movie 'Green Book', which won an Oscar for Best Picture this year (defintely recommend watching it)

![Green Book](./imgs/green_book.jpg)

In [38]:
tags = movie_tags('tt6966692')
print(tags)

['PG-13', 130, 'Biography, Comedy, Drama, Music', 8.3, 69, 'A working-class Italian-American bouncer becomes the driver of an African-American classical pianist on a tour of venues through the 1960s American South.', 200532, 85080171, 1065, 362, 76, 50, 90]


Oh baby, we're in business now.

### 1.4 Web-scraping award ceremony results

Now that we have the movie tags, all that's left for data collection is the awards it won and was nominated for. Unfortunately, it's not so simple. Let's see why.

First, we need to choose which award ceremonies' data we want to use for input. The Oscars results are what we want to predict, so they will be used for output labels. There are 14 ceremonies total used: Golden Globe, BAFTA< Screen Actors Guild, Directors Guild, Producers Guild, Art Directors Guild, Writers Guild, Costume Designers Guild, Online Film Television Association, Online Film Critics Society, Critics Choice, London Critics Circle Film, American Cinema Editors, and Academy Awards/Oscars.

Functions to scrape the award winners + nominees from these award ceremonies:

In [39]:
def scrape_movie_awards(year):
    """
    Given a year, scrapes data off of IMDB for the results of 14 different award ceremonies and the categories invovled.
    :param year: integer year from 2000~2018
    :return: 12 ceremonies' award categories, 12 ceremonies' award results, Oscar categories, Oscar results
    """
    events = ['ev0000292', 'ev0000123', 'ev0000598', 'ev0000212', 'ev0000531', 'ev0000618', 'ev0000710',
              'ev0000190', 'ev0002704', 'ev0000511', 'ev0000133', 'ev0000403', 'ev0000017', 'ev0000003']

    htmls = []
    for e in events:
        htmls.append(requests.get("https://www.imdb.com/event/" + e + "/" + str(year + 1) + "/1?ref_=ttawd_ev_1").text)
    # ---------------AWARDS---------------
    # 1. Golden Globe
    # 2. BAFTA
    # 3. Screen Actors Guild
    # 4. Directors Guild
    # 5. Producers Guild
    # 6. Art Directors Guild
    # 7. Costume Designers Guild
    # 8. Online Film Television Association
    # 9. Online Film Critics Society
    # 10. Critics Choice
    # 11. London Critics Circle Film
    # 12. American Cinema Editors

    # 13. Oscar

    gg_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[0]) if 'Television' not in i][:14]
    gg = []
    for c in gg_categories:
        if 'Actor' in c or 'Actress' in c or 'Director' in c or (year == 2014 and 'Original Score' in c):
            gg.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[0])[:-1])
        else:
            gg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[0])[:-1])
    gg_categories, gg = id_categories('gg', gg_categories, gg)

    bafta_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[1]) if 'British' not in i and 'Best' in i and 'Series' not in i and 'Television' not in i and 'Features' not in i][:19]
    bafta = []
    for c in bafta_categories:
        if 'Actor' in c or 'Actress' in c or 'Director' in c:
            bafta.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[1])[:5])
        else:
            bafta.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[1])[:-1])
    bafta_categories, bafta = id_categories('bafta', bafta_categories, bafta)

    sag_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[2]) if 'Series' not in i and 'Motion Picture' not in i and 'Stunt' not in i and 'Cast' not in i][:4]
    sag = []
    for c in sag_categories:
        if 'Actor' in c or 'Actress' in c or 'Director' in c:
            sag.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[2])[:-1])
        else:
            sag.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[2])[:-1])
    sag_categories, sag = id_categories('sag', sag_categories, sag)

    dg_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[3]) if 'Feature Film' in i or 'Motion' or 'Documentary' in i and 'First' not in i][:2]
    dg = []
    for c in dg_categories:
        dg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[3])[:-1])
    dg_categories, dg = id_categories('dg', dg_categories, dg)

    pg_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[4]) if 'Producer of' in i and 'Theatrical Motion Pictures' in i][:3]
    pg = []
    for c in pg_categories:
        if year >= 2004:
            pg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[4])[:-1])
        else:
            pg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[4]))
    pg_categories, pg = id_categories('pg', pg_categories, pg)

    adg_categories = [i for i in re.findall('"categoryName":"([^"]*)","nominations"', htmls[5]) if 'Film' in i][:4]
    adg = []
    for c in adg_categories:
        if year == 2001 and c == 'Fantasy Film':
            adg.append(['A.I. Artificial Intelligence'])
        else:
            adg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[5])[:-1])
    adg_categories, adg = id_categories('adg', adg_categories, adg)

    cdg_categories  = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[7]) if 'Contemporary Film' in i
                      or 'Period Film' in i or 'Fantasy Film' in i][:3]
    cdg = []
    for c in cdg_categories:
        cdg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[7])[:-1])
    cdg_categories, cdg = id_categories('cdg', cdg_categories, cdg)

    ofta_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[8]) if 'Series' not in i and 'Ensemble' not in i
                       and 'Television' not in i and 'Actors and Actresses' not in i and 'Creative' not in i and 'Program' not in i
                       and 'Behind' not in i and 'Debut' not in i and 'Poster' not in i and 'Trailer' not in i and 'Stunt' not in i and
                       'Sequence' not in i and 'Voice-Over' not in i and 'Youth' not in i and 'Cinematic' not in i and 'Casting' not in i and 'Acting' not in i][:23]
    ofta = []
    for c in ofta_categories:
        if 'Actor' in c or 'Actress' in c or 'Director' in c:
            ofta.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[8])[:-1])
        else:
            ofta.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[8])[:-1])
    ofta_categories, ofta = id_categories('ofta', ofta_categories, ofta)

    ofcs_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[9]) if 'Debut' not in i
                       and 'Stunt' not in i and 'Television' not in i and 'Series' not in i][:18]
    ofcs = []
    for c in ofcs_categories:
        if 'Actor' in c or 'Actress' in c or 'Director' in c:
            ofcs.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[9])[:-1])
        else:
            ofcs.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[9])[:-1])
    ofcs_categories, ofcs = id_categories('ofcs', ofcs_categories, ofcs)

    cc_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[10]) if 'Series' not in i
                     and 'Young' not in i and 'Ensemble' not in i and 'TV' not in i and 'Television' not in i and 'Show' not in i][:23]
    cc = []
    for c in cc_categories:
        if 'Actor' in c or 'Actress' in c or 'Director' in c:
            cc.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[10])[:-1])
        else:
            cc.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[10])[:-1])
    cc_categories, cc = id_categories('cc', cc_categories, cc)

    lccf_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[11]) if 'British' not in i
                       and 'Technical' not in i and 'Screenwriter' not in i and 'Television' not in i][:8]
    lccf = []
    for c in lccf_categories:
        if 'Actor' in c or 'Actress' in c or 'Director' in c:
            lccf.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[11])[:-1])
        else:
            lccf.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[11])[:-1])
    lccf_categories, lccf = id_categories('lccf', lccf_categories, lccf)

    ace_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[12]) if 'Series' not in i
                      and 'Non-Theatrical' not in i and 'Television' not in i and 'Student' not in i][:4]
    ace = []
    for c in ace_categories:
        if 'Actor' in c or 'Actress' in c or 'Director' in c:
            ace.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[12])[:-1])
        else:
            ace.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[12])[:-1])
    ace_categories, ace = id_categories('ace', ace_categories, ace)

    oscar_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[13])][:24]
    oscar = []
    for c in oscar_categories:
        if c == oscar_categories[-1]:
            if 'Actor' in c or 'Actress' in c or 'Director' in c:
                oscar.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[13]))
            else:
                oscar.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[13]))
        else:
            if 'Actor' in c or 'Actress' in c or 'Director' in c:
                oscar.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[13])[:-1])
            else:
                oscar.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[13])[:-1])
    oscar_categories, oscar = id_categories('oscar', oscar_categories, oscar)

    return [gg_categories, bafta_categories, sag_categories, dg_categories, pg_categories, adg_categories, cdg_categories, ofta_categories, ofcs_categories, cc_categories, lccf_categories, ace_categories],\
           [gg, bafta, sag, dg, pg, adg, cdg, ofta, ofcs, cc, lccf, ace], oscar_categories, oscar


def id_categories(name, cs, aw):
    """
    This function is specifically called by scrape_movie_awards to link similar categories across award ceremonies by
    tagging them with IDs.
    :param name: award ceremony id/name
    :param cs: list of categories
    :param aw: list of award winners/nominees
    :return: list of categories ids (0~23) and list of award winners based on the available categories
    """
    if name == 'gg':
        replace = [next((s for s in cs if 'Best Motion Picture' in s and 'Drama' in s), None),
                   next((s for s in cs if 'Best Motion Picture' in s and 'Comedy' in s), None),
                   next((s for s in cs if 'Actor' in s and 'Drama' in s and 'Supporting' not in s), None),
                   next((s for s in cs if 'Actor' in s and 'Comedy' in s and 'Supporting' not in s), None),
                   next((s for s in cs if 'Actress' in s and 'Drama' in s and 'Supporting' not in s), None),
                   next((s for s in cs if 'Actress' in s and 'Comedy' in s and 'Supporting' not in s), None),
                   next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
                   next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
                   next((s for s in cs if 'Animated' in s), None)]
        id = [0, 0, 1, 1, 2, 2, 3, 4, 5]
    elif name == 'bafta':
        replace = [next((s for s in cs if 'Best Film' in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Animated' in s and 'Short' not in s), None)]
        id = [0, 1, 2, 3, 4, 5]
    elif name == 'sag':
        replace = [next((s for s in cs if 'Male' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Female' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Male' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Female' in s and 'Supporting' in s), None)]
        id = [1, 2, 3, 4]
    elif name == 'dg':
        replace = [next((s for s in cs if 'Feature' in s), None)]
        id = [0]
    elif name == 'pg':
        replace = [next((s for s in cs if 'Producer of Theatrical' in s), None),
             next((s for s in cs if 'Animated' in s), None)]
        id = [0, 5]
    elif name == 'adg':
        replace = [next((s for s in cs if 'Period' in s), None),
             next((s for s in cs if 'Fantasy' in s), None),
             next((s for s in cs if 'Contemporary' in s), None),
             next((s for s in cs if 'Animated' in s), None)]
        id = [0, 0, 0, 5]
    elif name == 'cdg':
        replace = [next((s for s in cs if 'Period' in s), None),
             next((s for s in cs if 'Fantasy' in s), None),
             next((s for s in cs if 'Contemporary' in s), None)]
        id = [0, 0, 0]
    elif name == 'ofta':
        replace = [next((s for s in cs if 'Best Picture' in s), None),
             next((s for s in cs if 'Best Actor' in s), None),
             next((s for s in cs if 'Breakthrough' in s and 'Male' in s), None),
             next((s for s in cs if 'Best Actress' in s), None),
             next((s for s in cs if 'Breakthrough' in s and 'Female' in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Animated' in s), None)]
        id = [0, 1, 1, 2, 2, 3, 4, 5]
    elif name == 'ofcs':
        replace = [next((s for s in cs if 'Best Picture' in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Animated' in s), None)]
        id = [0, 1, 2, 3, 4, 5]
    elif name == 'cc':
        replace = [next((s for s in cs if 'Best Picture' in s), None),
             next((s for s in cs if 'Best Action Movie' in s), None),
             next((s for s in cs if 'Best Comedy' in s), None),
             next((s for s in cs if 'Best Sci-Fi' in s or 'Best Horror' in s), None),
             next((s for s in cs if 'Actor' in s and 'Comedy' not in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actor' in s and 'Comedy' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actress' in s and 'Comedy' not in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actress' in s and 'Comedy' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Animated' in s), None)]
        id = [0, 0, 0, 0, 1, 1, 2, 2, 3, 4, 5]
    elif name == 'lccf':
        replace = [next((s for s in cs if 'Film' in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' not in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' in s), None)]
        id = [0, 1, 2, 3, 4]
    elif name == 'ace':
        replace = [next((s for s in cs if 'Feature Film' in s and 'Drama' in s), None),
             next((s for s in cs if 'Feature Film' in s and 'Comedy' in s), None),
             next((s for s in cs if 'Animated' in s), None)]
        id = [0, 0, 5]
    else:  # Oscars
        replace = [next((s for s in cs if 'Picture' in s), None),
             next((s for s in cs if 'Actor' in s and 'Leading' in s), None),
             next((s for s in cs if 'Actress' in s and 'Leading' in s), None),
             next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
             next((s for s in cs if 'Animated' in s and 'Short' not in s), None)]
        id = list(range(0, 6))


    noneIndices = [i for i, r in enumerate(replace) if r is None]
    replace = [i for i in replace if i is not None]
    id = [i for h, i in enumerate(id) if h not in noneIndices]

    temp = []
    for i, r in enumerate(replace):
        temp.append(aw[cs.index(r)])

    return id, temp


These functions above would have been a lot more concise and simple (with less if else statements and list comprehensions) if the format for categories and the listing of winners/nominees stayed consistent throughout the years.

Here are just a few of the countless exceptions/difficulties that needed to be dealt with:
1. Movie titles switched with the Actor/Actress/Director/etc:

1 | 2
- | - 
![Best Picture](./imgs/ex1.jpg) | ![Best Actor](./imgs/ex2.jpg)


2. Or how about this case when the composer and movie title for Best Original Score were switched ONLY in 2015?
![Best Original Score](./imgs/ex3.jpg)

3. Award categories changing every year

4. TV Shows and other unwanted categories (i.e. Best British/Canadian only film producer, Student Award, Best Youth Award, etc) mixed in

5. Regex in general

## 2. Data Organization/Manipulation

### 2.1 Combining BIGML with our Data

As the title suggests, we should combine the BIGML dataset with the data we just scraped for more data. However, we should try to avoid duplicates. Time to work on main.py now. First, imports first, including the collect_data.py we created.

In [40]:
# main.py
import pandas as pd
import numpy as np
np.random.seed(1)
import collect_data as cd
import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
import pickle

The function below, extract_movie_data(), calls the web-scraping functions from collect_data.py and returns a dataframe of movies and their tags. Of course, excluding duplicates from the BIGML dataset.

Note: the GitHub repository doesn't include imdb.csv because it's not useful in the presence of the final dataset (combined.csv)

In [41]:
def extract_movie_data():
	"""
	Extracts the movie titles from years 2000~2018 and their respective tags/details and outputs them in the form of
	a dataframe in similar format to the BIGML dataset.
	:return: dataframe of web-scraped movies and their tags
	"""
	if os.path.exists('./data/imdb.csv'):
		imdb = pd.read_csv('./data/imdb.csv', index_col=0)
	else:
		imdb = cd.imdb_feature_film(2000)
		for y in list(range(2001, 2019)):
			imdb = imdb.append(cd.imdb_feature_film(y))

		# Removes duplicate movies
		df = pd.read_csv('./data/bigml.csv')
		temp = []
		for index, row in imdb.iterrows():
			if row['movie'] not in list(df['movie']):
				temp.append([row['year'], row['movie'], row['movie_id']])
		imdb = pd.DataFrame(temp, columns=['year', 'movie', 'movie_id'])

		tags = []
		for index, row in imdb.iterrows():
			print(str(index) + '. ' + row['movie'])
			id = row['movie_id']
			extra = cd.movie_tags(id)

			if extra is not None:
				tags.append([row['year'], row['movie'], row['movie_id']] + extra)

		imdb = pd.DataFrame(tags, columns=['year', 'movie', 'movie_id', 'certificate', 'duration', 'genre', 'rate',
										   'metascore', 'synopsis', 'votes', 'gross', 'user_reviews', 'critic_reviews',
										   'popularity', 'awards_wins', 'awards_nominations'])
		imdb.to_csv('./data/imdb.csv')

	return imdb

Now, we have two dataframes (BIGML and our IMDB scrape) of similar formats. First let's delete unnecessary rows off of BIGML, specifically release_date and every column after awards_nominations, which can be down with the drop function like so:

In [42]:
bigml.drop([bigml.columns[10]], axis=1, inplace=True)
bigml.drop(bigml.columns[16:], axis=1, inplace=True)
bigml.to_csv('./data/bigml.csv')

Then, the only step left to do is to combine the two. Read BIGML in first, fill the NaN values in the gross and popularity columns with -1 to allow the model to train better, read in the IMDB dataset, append them, sort the dataset by year and movie, then reset the indices, and finally export our final dataset as combined.csv. Done!

In [43]:
def combine_datasets():
	"""
	Combines the BIGML dataset and our IMDB-scraped dataset
	:return: combined dataset
	"""
	bigml = pd.read_csv('./data/bigml.csv')
	bigml = bigml.drop(list(bigml.columns[bigml.columns.get_loc('Oscar_Best_Picture_won'):])+['release_date'], axis=1)
	imdb = extract_movie_data()
	dataframe = bigml.append(imdb, sort=False, ignore_index=True)
	dataframe.sort_values(['year', 'movie'], axis=0, ascending=True, inplace=True)
	dataframe = dataframe.reset_index(drop=True)
	dataframe.to_csv('./data/combined.csv')

	return dataframe

### 2.2 Category IDs
Now if you've been paying attention with the function code above, you might wonder what id_categories(name, cs, aw) does. To understand this, we have to look at awards_categories.csv (which I manually created):

In [44]:
award_categories = pd.read_csv('./data/awards_categories.csv')
award_categories.head(10)

Unnamed: 0,Oscars,Golden globe,bafta,screen actors guild,directors guild,producers guild,art directors guild,writers guild,costume designers guild,online film television association,online film critics society,critics choice,london critics circle film,american cinema editors
0,picture 0,best motion picture – drama,film,,feature film,Outstanding Producer of Theatrical Motion Pict...,period feature film,,,best picture,best picture,best picture,film of the year,feature film (dramatic)
1,,Best motion picture - musical/comedy,,,,,fantasy feature film,,,,,best action movie,,feature film (comedy)
2,,,,,,,contemporary feature film,,,,,best comedy,,
3,,,,,,,,,,,,best sci-fi/horror movie,,
4,,,,,,,,,,,,,,
5,actor 1,best performance by an actor in a motion pictu...,Leading actor,outstanding performance by a male actor in a l...,,,,,,best actor,best actor,best actor,actor of the year,
6,,best performance by an actor in a motion pictu...,,,,,,,,best breakthrough performance: male,,best actor in a comedy,,
7,actress 2,best performance by an actress in a motion pic...,Leading actress,outstanding performance by a female actor in a...,,,,,,best actress,best actress,best actress,actress of the year,
8,,best performance by an actress in a motion pic...,,,,,,,,best breakthrough performance: female,,best actress in a comedy,,
9,supporting actor 3,best performance by an actor in a supporting r...,supporting actor,outstanding performance by a male actor in a s...,,,,,,best supporting actor,best supporting actor,best supporting actor,supporting actor of the year,


As shown, all of the categories from all the award ceremonies are linked together with similar ones by tagging them with IDs. Of course, these IDs (0~23) all ultimately refer to the 24 Oscar categories we want to predict.

A few key points here:
1. Some Oscar categories are linked to MULTIPLE categories (ex: Oscar best picture = Golden Globe best motion picture drama & best motion picture musical/comedy
2. NaN values exist, meaning that not all ceremonies contain the 24 needed categories. This is fine.
3. By manually ID-ing them, the algorithm will be initially biased towards the idea that winning a certain category doesn't directly affect winning another category, which is a fine assumption. Maybe the algorithm will learn that it does as it trains.

So let's go back to the function id_categories. After retreiving all the category titles, the category names are replaced with ID integers ONLY if they exist, which the built-in next() calls are responsible for.

### 2.4 Award points

Now that we have IDs for all categories from all of the award ceremonies, we can give "award points" in order to specify to the algorithm which movie won and which movie got nominated. For example, if we give winners 1 point and nominees 0.5 points:

| Result | Points |
| --- | --- |
| Winner | 1 |
| Nominee | 0.5 |
| Nothing | 0 |

Seems intuitive right? But let's think about this more. Why should nominees be given half the winner's points? There's 1 winner, but usually around 5 nominees. So how about this:

| Result | Points |
| --- | --- |
| Winner | 1 |
| Nominee | 1/(nominee_count) |
| Nothing | 0 |

Better. Below is the function to do the magic.

In [45]:
def add_award_points(dataframe):
	"""
	Adds points to movies in categories that it won / was nominated in from all 14 award ceremonies. 1 point for winner,
	1/(number of nominees) points for nominee, and 0 points for neither.
	:param dataframe: final (combined) dataset
	:return: edited dataset with points added in
	"""
	if os.path.exists('./data/categories') and os.path.exists('./data/awards') and os.path.exists('./data/oscar_cs') and os.path.exists('./data/oscar_aw'):
		with open('./data/categories', 'rb') as f:
			categories = pickle.load(f)
		with open('./data/awards', 'rb') as f:
			awards = pickle.load(f)
		with open('./data/oscarCategories', 'rb') as f:
			oscarCategories = pickle.load(f)
		with open('./data/oscarAwards', 'rb') as f:
			oscarAwards = pickle.load(f)
	else:
		categories = []
		awards = []
		oscarCategories = []
		oscarAwards = []
		for y in range(2000, 2019):
			print(y)
			results = cd.scrape_movie_awards(y)
			categories.append(results[0])
			awards.append(results[1])
			oscarCategories.append(results[2])
			oscarAwards.append(results[3])
		with open('./data/categories', 'wb') as f:
			pickle.dump(categories, f)
		with open('./data/awards', 'wb') as f:
			pickle.dump(awards, f)
		with open('./data/oscarCategories', 'wb') as f:
			pickle.dump(oscarCategories, f)
		with open('./data/oscarAwards', 'wb') as f:
			pickle.dump(oscarAwards, f)

	categoryNames = ['best_picture', 'actor', 'actress', 'supporting_actor', 'supporting_actress', 'animated']
	for category in categoryNames:
		dataframe[category] = np.nan
	for category in categoryNames:
		dataframe['oscar_' + category] = np.nan

	# Adds points to all of the movies that have won/been nominated for awards in all categories (except Oscar)
	start = dataframe.columns.get_loc('best_picture')

	# Ensures that all movies' award points start at 0
	for i in dataframe.columns[start:]:
		dataframe[i] = 0

	for i, year in enumerate(categories):
		for j, event in enumerate(year):
			for k, award in enumerate(event):
				for l, movie in enumerate(awards[i][j][k]):
					index = dataframe.index[(dataframe.movie == movie)&((dataframe.year == 2000 + i)|(dataframe.year == 2000 + i + 1)|(dataframe.year == 2000 + i - 1))]
					if len(index) != 0:
						print(str(i) + ", " + str(j) + ", " + str(k) + ", " + str(l))
						print(movie + '\n')
						if l == 0: points = 1
						else: points = 1.0/len(awards[i][j][k])
						dataframe.loc[index[0], dataframe.columns[start + int(award)]] += points

	# Oscar points for data labels
	oscarStart = dataframe.columns.get_loc('oscar_best_picture')
	for i, year in enumerate(oscarCategories):
		for j, award in enumerate(year):
			for l, movie in enumerate(oscarAwards[i][j]):
				index = dataframe.index[(dataframe.movie == movie)&((dataframe.year == 2000 + i)|(dataframe.year == 2000 + i + 1)|(dataframe.year == 2000 + i - 1))]
				if len(index) != 0:
					print(str(i) + ", " + str(j) + ", " + str(l))
					print(movie)
					print(str(dataframe.loc[index[0], dataframe.columns[oscarStart + int(award)]]))
					if l == 0: points = 1
					else: points = 1.0/len(oscarAwards[i][j])
					dataframe.loc[index[0], dataframe.columns[oscarStart + int(award)]] = points
					print(str(dataframe.loc[index[0], dataframe.columns[oscarStart + int(award)]]) + '\n')

	# Computes average sum by dividing the award points by the number of award ceremonies the movie could have won in
	N = [10, 7, 7, 7, 7, 8]
	for i, col in enumerate(dataframe.columns[start:oscarStart]):
		dataframe[col] /= N[i]

	dataframe.to_csv('./data/combined.csv')
	return dataframe

Slick! After running add_award_points, let's quickly check what our final dataset looks like before we move on.

In [46]:
df = pd.read_csv('./data/combined.csv', index_col=0)
print(df.shape)
df.head(3)

(4132, 28)


Unnamed: 0,year,movie,movie_id,certificate,duration,genre,rate,metascore,synopsis,votes,...,actress,supporting_actor,supporting_actress,animated,oscar_best_picture,oscar_actor,oscar_actress,oscar_supporting_actor,oscar_supporting_actress,oscar_animated
0,2000,101 Reykjavík,tt0237993,Not Rated,88,"Comedy, Romance",6.9,68.0,Will the 30 y.o. Hlynur ever move out of his m...,9002,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2000,102 Dalmatians,tt0211181,G,100,Adventure|Comedy|Family,4.8,35.0,Cruella DeVil gets out of prison and goes afte...,27364,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2000,28 Days,tt0191754,PG-13,103,"Comedy, Drama",6.0,46.0,A big-city newspaper columnist is forced to en...,40573,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 3. The Algorithm

Before we do anything, it's important to make sure that the data preprocessing is all done. Below are code lines from the main function

In [47]:
# df = combine_datasets()
df = pd.read_csv('./data/combined.csv', index_col=0)
# df.fillna(-1, inplace=True)
# df = df.drop(df[~df['certificate'].isin(['G', 'PG', 'PG-13', 'R', 'Not Rated'])].index)
# df = add_award_points(df)
# df.to_csv('./data/combined.csv')

# Data preprocessing/encoding
df = df.drop(['movie', 'movie_id', 'synopsis', 'genre'], axis=1)
df['popularity'] = 1/np.array(df['popularity']) * 100
df = pd.get_dummies(df, columns=['certificate'])
cols = df.columns.tolist()
cols = cols[df.columns.get_loc('oscar_animated') + 1:] + cols[:df.columns.get_loc('oscar_animated') + 1]
df = df[cols]
df = df.reset_index(drop=True)
splitIndex = df.index[df['year'] == 2018][0]
df = df.drop(['year'], axis=1)

# Splits data into training and testing sets
oscarStart = df.columns.get_loc('oscar_best_picture')
x = df.iloc[:, :oscarStart].values
y = df.iloc[:, oscarStart:].values
y[(y > 0) & (y < 1)] = 0.5  # winner is 1, nominee is 0.5, nothing is 0
xTrain, xTest = x[:splitIndex], x[splitIndex:]
yTrain, yTest = y[:splitIndex], y[splitIndex:]

# Checks how imbalanced the data is
unique, counts = np.unique(yTrain, return_counts=True)
print(dict(zip(unique, counts)))

# Scales inputs to avoid one variable having more weight than another
sc = StandardScaler()
xTrain = sc.fit_transform(xTrain)
xTest = sc.transform(xTest)

modelType = 'neuralnetwork'
predictCategory = True

# One hot encoding for softmax activation function
trainTargets = [[] for i in range(0, 6)]
for i in yTrain:
    for idx, j in enumerate(i):
        if j == 1:  # winner
            trainTargets[idx].append([1, 0, 0])
        elif j == 0.5:  # nominee
            trainTargets[idx].append([0, 1, 0])
        else:  # loser/nothing
            trainTargets[idx].append([0, 0, 1])
yTrain = [np.array(i) for i in trainTargets]
testTargets = [[] for i in range(0, 6)]
for i in yTest:
    for idx, j in enumerate(i):
        if j == 1:  # winner
            testTargets[idx].append([1, 0, 0])
        elif j == 0.5:  # nominee
            testTargets[idx].append([0, 1, 0])
        else:  # loser/nothing
            testTargets[idx].append([0, 0, 1])
yTest = [np.array(i) for i in testTargets]

{0.0: 22948, 0.5: 439, 1.0: 103}


Okay, now onto the actual algorithm. Because we have multilabel (winner, nominee, or loser) and multiclass (6 Oscar categories), there are only a few algorithms/classifiers we can implement. Neural networks offer the most flexibility in terms of input and output, so here we go.

We use the Keras Functional API to create a deep neural network with multiple outputs. The reason there are 6 Dense output layers is because of the 6 Oscar categories we want to predict. Softmax seems to be the most reasonable activation function here because the probabilities of being a winner, nominee, and loser are mutually exclusive. 

In [48]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from keras.models import Sequential, Model, load_model
from keras.layers import Dense, Activation, Dropout, Input, BatchNormalization
from keras.optimizers import Adam

if os.path.exists('best.h5'):
    model = load_model('best.h5')
else:
    input = Input(shape=(xTrain.shape[1],))
    x = Dense(128, activation='relu')(input)
    x = BatchNormalization()(x)
    x = Dropout(0.2)(x)
    output1 = Dense(3, activation='softmax')(x)
    output2 = Dense(3, activation='softmax')(x)
    output3 = Dense(3, activation='softmax')(x)
    output4 = Dense(3, activation='softmax')(x)
    output5 = Dense(3, activation='softmax')(x)
    output6 = Dense(3, activation='softmax')(x)
    model = Model(inputs=input, outputs=[output1, output2, output3, output4, output5, output6])
    model.compile(optimizer=Adam(lr=0.01), loss='categorical_crossentropy')

    classWeights = {0: counts.sum()/counts[2], 1: counts.sum()/counts[1], 2: counts.sum()/counts[0]}
    model.fit(xTrain, yTrain, epochs=512, batch_size=32, class_weight=classWeights)
    # model.save('best.h5')

Finally, let's compute the accuracies of our model. Because our dataset is imbalanced, meaning that nearly all (97.7%) of the movies did not win or get nominated, a dumb model predicting every movie is a loser would also get a "good" accuracy. While we can use a different evaluation metric such as F1 Score or Precision and Recall, we can simple use accuracies FOR EACH CLASS. 

The function below, compute_model_accuracies, takes in input data and compares the model's predictions with the label outputs.

In [49]:
def compute_model_accuracies(predCategory, printout, m, x, y, split):
	categoryNames = ['best_picture', 'best_actor', 'best_actress', 'best_supporting_actor', 'best_supporting_actress', 'best_animated']
	df = pd.read_csv('./data/combined.csv', index_col=0)
	df = df.reset_index(drop=True)

	if not predCategory:
		yPred = m.predict_classes(x)

		totalAccuracy = accuracy_score(y, yPred)

		winnerIdx = [i for i, h in enumerate(y) if h == 0]
		winnerTrain = [y[i] for i in winnerIdx]
		winnerPred = [yPred[i] for i in winnerIdx]
		winnerAccuracy = accuracy_score(winnerTrain, winnerPred)

		nomineeIdx = [i for i, h in enumerate(y) if h == 1]
		nomineeTrain = [y[i] for i in nomineeIdx]
		nomineePred = [yPred[i] for i in nomineeIdx]
		nomineeAccuracy = accuracy_score(nomineeTrain, nomineePred)

		loserIdx = [i for i, h in enumerate(y) if h == 2]
		loserTrain = [y[i] for i in loserIdx]
		loserPred = [yPred[i] for i in loserIdx]
		loserAccuracy = accuracy_score(loserTrain, loserPred)

		print(printout + ' Total accuracy: ' + str(totalAccuracy))
		print('   ' + printout + ' Accuracy for predicting winners: ' + str(winnerAccuracy))
		print('   ' + printout + ' Accuracy for predicting nominees: ' + str(nomineeAccuracy))
		print('   ' + printout + ' Accuracy for predicting losers: ' + str(loserAccuracy))

		# Print the names of the predicted winners/nominees
		if printout == '(TESTING)':
			yPred = m.predict(x)
			for i, pred in enumerate(yPred):
				if pred == 2 or pred == 1:
					print(df.iloc[split + i].movie)
	else:
		yPred = m.predict(x)

		totalAccuracy = 0
		winnerAccuracy = 0
		nomineeAccuracy = 0
		loserAccuracy = 0
		for i in range(0, 6):
			true = y[i].argmax(axis=-1)
			pred = yPred[i].argmax(axis=-1)

			totalAccuracy += accuracy_score(true, pred)

			winnerIdx = [a for a, h in enumerate(true) if h == 0]
			winnerTrain = [true[a] for a in winnerIdx]
			winnerPred = [pred[a] for a in winnerIdx]
			winnerAccuracy += accuracy_score(winnerTrain, winnerPred)

			nomineeIdx = [i for i, h in enumerate(true) if h == 1]
			nomineeTrain = [true[a] for a in nomineeIdx]
			nomineePred = [pred[a] for a in nomineeIdx]
			nomineeAccuracy += accuracy_score(nomineeTrain, nomineePred)

			loserIdx = [a for a, h in enumerate(true) if h == 2]
			loserTrain = [true[a] for a in loserIdx]
			loserPred = [pred[a] for a in loserIdx]
			loserAccuracy += accuracy_score(loserTrain, loserPred)

		totalAccuracy /= 6; winnerAccuracy /= 6; nomineeAccuracy /= 6; loserAccuracy /= 6
		print(printout + ' Total accuracy: ' + str(totalAccuracy))
		print('   ' + printout + ' Accuracy for predicting winners: ' + str(winnerAccuracy))
		print('   ' + printout + ' Accuracy for predicting nominees: ' + str(nomineeAccuracy))
		print('   ' + printout + ' Accuracy for predicting losers: ' + str(loserAccuracy))
		print()

		# Print the names of the predicted winners/nominees
		if printout == '(TESTING)':
			yPred = [i.argmax(axis=-1) for i in yPred]
			temp = []
			for s in range(yPred[0].shape[0]):
				sample = []
				[sample.append(i[s]) for i in yPred]
				temp.append(sample)
			yPred = np.array(temp)

			for i, pred in enumerate(yPred):
				movie = df.iloc[split + i].movie
				winnerCategories = [categoryNames[a] for a, b in enumerate(pred) if b == 0]
				nomineeCategories = [categoryNames[a] for a, b in enumerate(pred) if b == 1]

				if winnerCategories and nomineeCategories:
					print(movie + ': Won ' + '|'.join(winnerCategories) + ', Nominated for ' + '|'.join(nomineeCategories))
				elif winnerCategories and not nomineeCategories:
					print(movie + ': Won ' + '|'.join(winnerCategories))
				elif not winnerCategories and nomineeCategories:
					print(movie + ': Nominated for ' + '|'.join(nomineeCategories))

Then, we run these two lines of code. The first line feeds in back the training data to check if the model has been trained well. The second line feeds in the testing data (movies from 2018) to check how well the model predicts the 2019 Oscars.

In [50]:
# Training accuracy (put training data back in) and testing accuracy
compute_model_accuracies(predictCategory, '(TRAINING)', model, xTrain, yTrain, splitIndex)
compute_model_accuracies(predictCategory, '(TESTING)', model, xTest, yTest, splitIndex)

(TRAINING) Total accuracy: 0.9983822903363132
   (TRAINING) Accuracy for predicting winners: 0.9470588235294116
   (TRAINING) Accuracy for predicting nominees: 0.9372351206928321
   (TRAINING) Accuracy for predicting losers: 0.9997381307628915

(TESTING) Total accuracy: 0.9838709677419355
   (TESTING) Accuracy for predicting winners: 0.5
   (TESTING) Accuracy for predicting nominees: 0.5158730158730159
   (TESTING) Accuracy for predicting losers: 0.9960652795631887

A Star Is Born: Nominated for best_actor
BlacKkKlansman: Nominated for best_picture|best_actor|best_supporting_actor
Bohemian Rhapsody: Nominated for best_actor
Can You Ever Forgive Me?: Nominated for best_actress|best_supporting_actor
Crazy Rich Asians: Nominated for best_picture
First Man: Nominated for best_supporting_actress
Green Book: Won best_supporting_actor, Nominated for best_picture
If Beale Street Could Talk: Nominated for best_supporting_actress
Incredibles 2: Won best_animated
Isle of Dogs: Nominated for best_

## 4. Algorithm Optimization

I won't go too much into detail here because it's not sometime I spent a lot of time on, but a few parameters were optimized. You can try tweaking:

- Feature vector
- Hidden layer size
- Activation functions
- Batch normalization
- Dropout
- Optimizer
- Custom loss function
- Class weights (important because of imbalanced data)
- Epochs, batch size, etc

Well, that's it for htis month-long project. Hope you enjoyed reading!

In [51]:
!jupyter nbconvert predicting_the_oscars.ipynb --to html

[NbConvertApp] Converting notebook predicting_the_oscars.ipynb to html
[NbConvertApp] Writing 495094 bytes to predicting_the_oscars.html
