# PTO - Predicting the Oscars


![Who will be the Oscar nominees and winners?](./oscars.jpg)

This notebook covers my progress on building an algorithm to accurately predict Oscar results. Not just the winners based on the nominees in each category, but also which movies will be nominees based on all of the movies released that year. The project repository can be cloned at https://github.com/j0hnk1m/predict-the-oscars if you're interested.

This mini-project only lasted for a month, so it may not be perfect. Hopefully, you guys can use this as a platform to get even better predictions. Let's dig in! I'll be using Anaconda Python 3.7. You may need to import a few packages yourself, however, using conda install {package}

There are 4 major sections to cover:
1. Data Collection
2. Data Organization/Manipulation
3. The Algorithm
4. Algorithm Optimization

## 1. Data Collection

### 1.1 The BIGML dataset

First things first, general imports first

In [18]:
import pandas as pd
import numpy as np
import os

I came across this dataset from https://bigml.com/user/academy_awards/gallery/dataset/5c6886e1eba31d73070017f5, which contained a list of movies from 2000~2018 and their details, including release year, movie_id (IMDB), certificate, duration, genre, IMDB rating, etc.

Here's what it looks like:

In [19]:
bigml = pd.read_csv('./data/BigML_Dataset_5ccb3e8ddb8b1d0886002d22.csv')
print(bigml.shape)
bigml.head(5)

(1235, 119)


Unnamed: 0,year,movie,movie_id,certificate,duration,genre,rate,metascore,synopsis,votes,...,New_York_Film_Critics_Circle_nominated,New_York_Film_Critics_Circle_nominated_categories,Los_Angeles_Film_Critics_Association_won,Los_Angeles_Film_Critics_Association_won_categories,Los_Angeles_Film_Critics_Association_nominated,Los_Angeles_Film_Critics_Association_nominated_categories,release_date.year,release_date.month,release_date.day-of-month,release_date.day-of-week
0,2001,Kate & Leopold,tt0035423,PG-13,118,Comedy|Fantasy|Romance,6.4,44.0,An English Duke from 1876 is inadvertedly drag...,66660,...,0,,0,,0,,2001.0,12.0,25.0,2.0
1,2000,Chicken Run,tt0120630,G,84,Animation|Adventure|Comedy,7.0,88.0,When a cockerel apparently flies into a chicke...,144475,...,1,Best Animated Film,1,Best Animation,1,Best Animation,2000.0,6.0,23.0,5.0
2,2005,Fantastic Four,tt0120667,PG-13,106,Action|Adventure|Family,5.7,40.0,A group of astronauts gain superpowers after a...,273203,...,0,,0,,0,,2005.0,7.0,8.0,5.0
3,2002,Frida,tt0120679,R,123,Biography|Drama|Romance,7.4,61.0,"A biography of artist Frida Kahlo, who channel...",63852,...,0,,0,,0,,2002.0,11.0,22.0,5.0
4,2001,The Lord of the Rings: The Fellowship of the Ring,tt0120737,PG-13,178,Adventure|Drama|Fantasy,8.8,92.0,A meek Hobbit from the Shire and eight compani...,1286275,...,0,,1,Best Music,2,Best Music|Best Production Design,2001.0,12.0,19.0,3.0


And each of the 1235 movies in this dataset has a lot of variables (119). If we print all of them out:

In [20]:
print(bigml.columns)

Index(['year', 'movie', 'movie_id', 'certificate', 'duration', 'genre', 'rate',
       'metascore', 'synopsis', 'votes',
       ...
       'New_York_Film_Critics_Circle_nominated',
       'New_York_Film_Critics_Circle_nominated_categories',
       'Los_Angeles_Film_Critics_Association_won',
       'Los_Angeles_Film_Critics_Association_won_categories',
       'Los_Angeles_Film_Critics_Association_nominated',
       'Los_Angeles_Film_Critics_Association_nominated_categories',
       'release_date.year', 'release_date.month', 'release_date.day-of-month',
       'release_date.day-of-week'],
      dtype='object', length=119)


This is a good start, but there are some issues. First, there's an unnecessary amount of variables that can be concised or grouped together, especially the award vectors. Second, there's NaN values. Third, there's not enough data to work with if the goal is to predict which movies will be oscar NOMINEES and in which category.

The first two issues are easily fixable later. To fix the third issue, let's start data scraping. 

### 1.2 Web-scraping movie lists from IMDB

We can create a new python file called "collect_data.py" and add the following imports and function. If you're confused, refer to the github files.

The following function, "imdb_feature_film", takes in a year from 2000~2018 and returns a dataframe of 350 movies scraped off of IMDB feature film lists, the year they were released, and their respective IMDB movie ids. I chose 350 movies every year simply because it looked like a good balance between having enough data and filtering out really, really weird + indie movies not likely to win Oscars anytime soon. You can change the number, however, by replacing 7 in the for loop with a different number.


In [31]:
import numpy as np
import pandas as pd
import requests
import re

def imdb_feature_film(year):
	"""
	Given a specific year (from 2000~2018), returns a dataframe of movies, their respective IMDB IDs, and release years.
	Example link where this function scrapes data from: https://www.imdb.com/year/2018/
	"""
	print(year)
	html = requests.get("https://www.imdb.com/year/" + str(year)).text

	movies = np.zeros((0, 2))
	for i in range(0, 7):  # 7 pages of 50 movies each = 350 top movies
		movies = np.concatenate([movies, np.flip(np.array(re.findall(r'<a href="/title/([^:?%]+?)/"[\r\n]+> <img alt="([^%]+?)"[\r\n]+', html)))])
		nextLink = "https://www.imdb.com" + re.findall(r'<a href="(/search/title\?title_type=feature&year=(?:.*)&start=(?:.*))"[\r\n]+class="lister-page-next next-page"', html)[0]
		html = requests.get(nextLink).text

	df = pd.DataFrame(movies, columns=['movie', 'movie_id'])
	df.insert(0, 'year', [year]*movies.shape[0], True)
	return df


Let's see the pandas dataframe of web-scraped movies from 2018 (It will take some time).

In [22]:
df_2018 = imdb_feature_film(2018)
print(df_2018.shape)
df_2018.head(10)

2018
(350, 3)


Unnamed: 0,year,movie,movie_id
0,2018,The Nun,tt5814060
1,2018,Incredibles 2,tt3606756
2,2018,Tell It to the Bees,tt7241926
3,2018,Bad Times at the El Royale,tt6628394
4,2018,Mission: Impossible - Fallout,tt4912910
5,2018,A Simple Favor,tt7040874
6,2018,Robin Hood,tt4532826
7,2018,Under the Silver Lake,tt5691670
8,2018,Goosebumps 2: Haunted Halloween,tt5664636
9,2018,Hereditary,tt7784604


Awesome, now we can move onto the next step, movie tags.

### 1.3 Web-scraping movie tags

Now that we have the movies and thier IMDB movie ids, we can create a function that uses that information and regex to scrape important tags that we may need to build an accurate prediction algorithm. The tags, which are the same as the ones in the BIDML dataset, are listed as comments in the function.

In [38]:
def movie_tags(id):
	"""
	Given a specific movie id (IMDB), returns a list of its tags/variables to be used as input variables.
	"""
	html = requests.get("https://www.imdb.com/title/" + id).text
	# ---------------TAGS---------------
	# certificate
	# duration
	# genre
	# rate
	# metascore
	# synopsis
	# votes
	# gross
	# user reviews
	# critic reviews
	# popularity
	# awards wins
	# awards nominations

	genre = re.findall('"genre": ([\s\S]+),\\n[\s\S]+"contentRating":', html)
	certificate = re.findall('"contentRating": "(.*)",\\n[\s\S]+<strong', html)
	rate = re.findall('<strong title="(.*) based on ', html)
	votes = re.findall('based on ([,0-9]+) user ratings">', html)
	user_reviews = re.findall('<span itemprop="reviewCount">([,0-9]+) user</span>', html)
	critic_reviews = re.findall('<span itemprop="reviewCount">([,0-9]+) critic</span>', html)
	duration = re.findall('<time datetime="PT(\d+)M">\\n', html)
	keywords = re.findall('<div class="summary_text">\\n(.*)\\n', html)[0].strip()
	metascore = re.findall('<div class="metacriticScore score_[\w]+ titleReviewBarSubItem">\\n<span>([0-9]+)<', html)

	if len(genre) == 0 or len(certificate) == 0 or len(rate) == 0 or len(votes) == 0 or len(user_reviews) == 0 or len(critic_reviews) == 0 or len(duration) == 0 or len(metascore) == 0:
		return None
	genre = ' '.join(genre[0].split()).replace('"', '').replace('[ ', '').replace(' ]', '')
	certificate = certificate[0]
	rate = float(rate[0])
	votes = int(votes[0].replace(',', ''))
	user_reviews = int(user_reviews[0].replace(',', ''))
	critic_reviews = int(critic_reviews[0].replace(',', ''))
	duration = int(duration[0].replace(',', ''))
	metascore = int(metascore[0])

	popularity = re.findall('titleReviewBarSubItem">\\n<span>[0-9]+<[\s\S]+ ([,0-9]+)\\n[\s\S]+\(<span class="titleOverviewSprite popularity', html)
	if len(popularity) == 0:
		popularity = -1
	else:
		popularity = int(popularity[0].replace(',', ''))

	awards_wins = re.findall('<span class="awards-blurb">[\s\S]+ (\d+) wins', html)
	if len(awards_wins) == 0:
		awards_wins = 0
	else:
		awards_wins = int(awards_wins[0])

	awards_nominations = re.findall('<span class="awards-blurb">[\s\S]+ (\d+) nominations', html)
	if len(awards_nominations) == 0:
		awards_nominations = 0
	else:
		awards_nominations = int(awards_nominations[0])

	gross = re.findall('Gross USA:</h4> \$([,0-9]+)', html)
	if len(gross) == 0:
		gross = -1
	else:
		gross = int(gross[0].replace(',', ''))

	tags = [certificate, duration, genre, rate, metascore, keywords, votes, gross, user_reviews, critic_reviews,
			popularity, awards_wins, awards_nominations]
	return tags

Again, let's check if the tags are correct using the movie 'Green Book', which won an Oscar for Best Picture this year (defintely recommend :))

![Green Book](https://s20352.pcdn.co/wp-content/uploads/2018/11/green-book-GBK_Tsr1Sheet_RGB_3SM_rgb.jpg)

In [39]:
tags = movie_tags('tt6966692')
print(tags)

['PG-13', 130, 'Biography, Comedy, Drama, Music', 8.3, 69, 'A working-class Italian-American bouncer becomes the driver of an African-American classical pianist on a tour of venues through the 1960s American South.', 191157, 85080171, 1037, 362, 64, 50, 89]


Now that we have the movie tags, all that's left for data collection is the awards it won and was nominated for. Unfortunately, it's not so simple. Let's see why.

First, we need to choose which award ceremonies' data we want to use for input. The Oscars results are what we want to predict, so they will be used for output labels. There are 14 ceremonies total used: Golden Globe, BAFTA< Screen Actors Guild, Directors Guild, Producers Guild, Art Directors Guild, Writers Guild, Costume Designers Guild, Online Film Television Association, Online Film Critics Society, Critics Choice, London Critics Circle Film, American Cinema Editors, and Academy Awards/Oscars.

Functions to scrape the award winners + nominees from these award ceremonies:

In [41]:
def scrape_movie_awards(year):
	events = ['ev0000292', 'ev0000123', 'ev0000598', 'ev0000212', 'ev0000531', 'ev0000618', 'ev0000710',
			  'ev0000190', 'ev0002704', 'ev0000511', 'ev0000133', 'ev0000403', 'ev0000017', 'ev0000003']

	htmls = []
	for e in events:
		htmls.append(requests.get("https://www.imdb.com/event/" + e + "/" + str(year + 1) + "/1?ref_=ttawd_ev_1").text)
	# ---------------AWARDS---------------
	# 1. Golden Globe
	# 2. BAFTA
	# 3. Screen Actors Guild
	# 4. Directors Guild
	# 5. Producers Guild
	# 6. Art Directors Guild
	# 7. Writers Guild
	# 8. Costume Designers Guild
	# 9. Online Film Television Association
	# 10. Online Film Critics Society
	# 11. Critics Choice
	# 12. London Critics Circle Film
	# 13. American Cinema Editors

	# 14. Oscar

	gg_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[0]) if 'Television' not in i][:14]
	gg = []
	for c in gg_categories:
		if 'Actor' in c or 'Actress' in c or 'Director' in c or (year == 2014 and 'Original Score' in c):
			gg.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[0])[:-1])
		else:
			gg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[0])[:-1])
	gg_categories, gg = id_categories('gg', gg_categories, gg)

	bafta_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[1]) if 'British' not in i and 'Best' in i and 'Series' not in i and 'Television' not in i and 'Features' not in i][:19]
	bafta = []
	for c in bafta_categories:
		if 'Actor' in c or 'Actress' in c or 'Director' in c:
			bafta.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[1])[:5])
		else:
			bafta.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[1])[:-1])
	bafta_categories, bafta = id_categories('bafta', bafta_categories, bafta)

	sag_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[2]) if 'Series' not in i and 'Motion Picture' not in i and 'Stunt' not in i and 'Cast' not in i][:4]
	sag = []
	for c in sag_categories:
		if 'Actor' in c or 'Actress' in c or 'Director' in c:
			sag.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[2])[:-1])
		else:
			sag.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[2])[:-1])
	sag_categories, sag = id_categories('sag', sag_categories, sag)

	dg_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[3]) if 'Feature Film' in i or 'Motion' or 'Documentary' in i and 'First' not in i][:2]
	dg = []
	for c in dg_categories:
		dg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[3])[:-1])
	dg_categories, dg = id_categories('dg', dg_categories, dg)

	pg_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[4]) if 'Producer of' in i and 'Theatrical Motion Pictures' in i][:3]
	pg = []
	for c in pg_categories:
		if year >= 2004:
			pg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[4])[:-1])
		else:
			pg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[4]))
	pg_categories, pg = id_categories('pg', pg_categories, pg)

	adg_categories = [i for i in re.findall('"categoryName":"([^"]*)","nominations"', htmls[5]) if 'Film' in i][:4]
	adg = []
	for c in adg_categories:
		if year == 2001 and c == 'Fantasy Film':
			adg.append(['A.I. Artificial Intelligence'])
		else:
			adg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[5])[:-1])
	adg_categories, adg = id_categories('adg', adg_categories, adg)

	wg_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[6]) if 'Original Screenplay'
					 in i or 'Adapted Screenplay' in i or i == 'Documentary Screenplay'][:3]
	wg = []
	for c in wg_categories:
		wg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[6])[:-1])
	wg_categories, wg = id_categories('wg', wg_categories, wg)

	cdg_categories  = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[7]) if 'Contemporary Film' in i
					  or 'Period Film' in i or 'Fantasy Film' in i][:3]
	cdg = []
	for c in cdg_categories:
		cdg.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[7])[:-1])
	cdg_categories, cdg = id_categories('cdg', cdg_categories, cdg)

	ofta_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[8]) if 'Series' not in i and 'Ensemble' not in i
					   and 'Television' not in i and 'Actors and Actresses' not in i and 'Creative' not in i and 'Program' not in i
					   and 'Behind' not in i and 'Debut' not in i and 'Poster' not in i and 'Trailer' not in i and 'Stunt' not in i and
					   'Sequence' not in i and 'Voice-Over' not in i and 'Youth' not in i and 'Cinematic' not in i and 'Casting' not in i and 'Acting' not in i][:23]
	ofta = []
	for c in ofta_categories:
		if 'Actor' in c or 'Actress' in c or 'Director' in c:
			ofta.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[8])[:-1])
		else:
			ofta.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[8])[:-1])
	ofta_categories, ofta = id_categories('ofta', ofta_categories, ofta)

	ofcs_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[9]) if 'Debut' not in i
					   and 'Stunt' not in i and 'Television' not in i and 'Series' not in i][:18]
	ofcs = []
	for c in ofcs_categories:
		if 'Actor' in c or 'Actress' in c or 'Director' in c:
			ofcs.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[9])[:-1])
		else:
			ofcs.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[9])[:-1])
	ofcs_categories, ofcs = id_categories('ofcs', ofcs_categories, ofcs)

	cc_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[10]) if 'Series' not in i
					 and 'Young' not in i and 'Ensemble' not in i and 'TV' not in i and 'Television' not in i and 'Show' not in i][:23]
	cc = []
	for c in cc_categories:
		if 'Actor' in c or 'Actress' in c or 'Director' in c:
			cc.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[10])[:-1])
		else:
			cc.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[10])[:-1])
	cc_categories, cc = id_categories('cc', cc_categories, cc)

	lccf_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[11]) if 'British' not in i
					   and 'Technical' not in i and 'Screenwriter' not in i and 'Television' not in i][:8]
	lccf = []
	for c in lccf_categories:
		if 'Actor' in c or 'Actress' in c or 'Director' in c:
			lccf.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[11])[:-1])
		else:
			lccf.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[11])[:-1])
	lccf_categories, lccf = id_categories('lccf', lccf_categories, lccf)

	ace_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[12]) if 'Series' not in i
					  and 'Non-Theatrical' not in i and 'Television' not in i and 'Student' not in i][:4]
	ace = []
	for c in ace_categories:
		if 'Actor' in c or 'Actress' in c or 'Director' in c:
			ace.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[12])[:-1])
		else:
			ace.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[12])[:-1])
	ace_categories, ace = id_categories('ace', ace_categories, ace)

	oscar_categories = [i for i in re.findall('"categoryName":"([^"]*?)","nominations"', htmls[13])][:24]
	oscar = []
	for c in oscar_categories:
		if c == oscar_categories[-1]:
			if 'Actor' in c or 'Actress' in c or 'Director' in c:
				oscar.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[13]))
			else:
				oscar.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[13]))
		else:
			if 'Actor' in c or 'Actress' in c or 'Director' in c:
				oscar.append(re.findall(re.escape(c) + '",(?:.*?)"secondaryNominees":\[{"name":"([^"]*?)","note":null', htmls[13])[:-1])
			else:
				oscar.append(re.findall(re.escape(c) + '",(?:.*?)"primaryNominees":\[{"name":"([^"]*?)","note":null', htmls[13])[:-1])
	oscar_categories, oscar = id_categories('oscar', oscar_categories, oscar)

	return [gg_categories, bafta_categories, sag_categories, dg_categories, pg_categories, adg_categories, wg_categories, cdg_categories, ofta_categories, ofcs_categories, cc_categories, lccf_categories, ace_categories],\
		   [gg, bafta, sag, dg, pg, adg, wg, cdg, ofta, ofcs, cc, lccf, ace], oscar_categories, oscar


def id_categories(name, cs, aw):
	if name == 'gg':
		replace = [next((s for s in cs if 'Best Motion Picture' in s and 'Drama' in s), None),
				   next((s for s in cs if 'Best Motion Picture' in s and 'Comedy' in s), None),
				   next((s for s in cs if 'Actor' in s and 'Drama' in s and 'Supporting' not in s), None),
				   next((s for s in cs if 'Actor' in s and 'Comedy' in s and 'Supporting' not in s), None),
				   next((s for s in cs if 'Actress' in s and 'Drama' in s and 'Supporting' not in s), None),
				   next((s for s in cs if 'Actress' in s and 'Comedy' in s and 'Supporting' not in s), None),
				   next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
				   next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
				   next((s for s in cs if 'Animated' in s), None),
				   next((s for s in cs if 'Director' in s), None),
				   next((s for s in cs if 'Foreign' in s), None),
				   next((s for s in cs if 'Original Score' in s), None),
				   next((s for s in cs if 'Original Song' in s), None),
				   next((s for s in cs if 'Screenplay' in s), None)]
		id = [0, 0, 1, 1, 2, 2, 3, 4, 5, 8, 12, 14, 15, 22]
	elif name == 'bafta':
		replace = [next((s for s in cs if 'Best Film' in s), None),
			 next((s for s in cs if 'Actor' in s and 'Supporting' not in s), None),
			 next((s for s in cs if 'Actress' in s and 'Supporting' not in s), None),
			 next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
			 next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
			 next((s for s in cs if 'Animated' in s and 'Short' not in s), None),
			 next((s for s in cs if 'Cinematography' in s), None),
			 next((s for s in cs if 'Costume Design' in s), None),
			 next((s for s in cs if 'Documentary' in s), None),
			 next((s for s in cs if 'Editing' in s), None),
			 next((s for s in cs if 'Not' in s and 'English' in s), None),
			 next((s for s in cs if 'Make Up' in s or 'Hair' in s), None),
			 next((s for s in cs if 'Production Design' in s), None),
			 next((s for s in cs if 'Short' in s and 'Animat' in s), None),
			 next((s for s in cs if 'Short' in s and 'Film' in s), None),
			 next((s for s in cs if 'Sound' in s), None),
			 next((s for s in cs if 'Visual Effects' in s), None),
			 next((s for s in cs if 'Screenplay' in s and 'Adapted' in s), None),
			 next((s for s in cs if 'Screenplay' in s and 'Original' in s), None)]
		id = [0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 12, 13, 16, 17, 18, 19, 21, 22, 23]
	elif name == 'sag':
		replace = [next((s for s in cs if 'Male' in s and 'Supporting' not in s), None),
			 next((s for s in cs if 'Female' in s and 'Supporting' not in s), None),
			 next((s for s in cs if 'Male' in s and 'Supporting' in s), None),
			 next((s for s in cs if 'Female' in s and 'Supporting' in s), None)]
		id = [1, 2, 3, 4]
	elif name == 'dg':
		replace = [next((s for s in cs if 'Feature' in s), None),
			 next((s for s in cs if 'Documentary' in s), None),]
		id = [0, 9]
	elif name == 'pg':
		replace = [next((s for s in cs if 'Producer of Theatrical' in s), None),
			 next((s for s in cs if 'Animated' in s), None),
			 next((s for s in cs if 'Documentary' in s), None)]
		id = [0, 5, 9]
	elif name == 'adg':
		replace = [next((s for s in cs if 'Period' in s), None),
			 next((s for s in cs if 'Fantasy' in s), None),
			 next((s for s in cs if 'Contemporary' in s), None),
			 next((s for s in cs if 'Animated' in s), None)]
		id = [0, 0, 0, 5]
	elif name == 'wg':
		replace = [next((s for s in cs if 'Documentary' in s), None),
			next((s for s in cs if 'Adapted' in s), None),
			 next((s for s in cs if 'Original' in s), None)]
		id = [9, 22, 23]
	elif name == 'cdg':
		replace = [next((s for s in cs if 'Period' in s), None),
			 next((s for s in cs if 'Fantasy' in s), None),
			 next((s for s in cs if 'Contemporary' in s), None)]
		id = [0, 0, 0]
	elif name == 'ofta':
		replace = [next((s for s in cs if 'Best Picture' in s), None),
			 next((s for s in cs if 'Best Actor' in s), None),
			 next((s for s in cs if 'Breakthrough' in s and 'Male' in s), None),
			 next((s for s in cs if 'Best Actress' in s), None),
			 next((s for s in cs if 'Breakthrough' in s and 'Female' in s), None),
			 next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
			 next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
			 next((s for s in cs if 'Animated' in s), None),
			 next((s for s in cs if 'Cinematography' in s), None),
			 next((s for s in cs if 'Costume Design' in s), None),
			 next((s for s in cs if 'Director' in s), None),
			 next((s for s in cs if 'Documentary' in s), None),
			 next((s for s in cs if 'Film Editing' in s), None),
			 next((s for s in cs if 'Foreign' in s), None),
			 next((s for s in cs if 'Makeup' in s or 'Hair' in s), None),
			 next((s for s in cs if 'Original Score' in s), None),
			 next((s for s in cs if 'Original Song' in s), None),
			 next((s for s in cs if 'Production Design' in s), None),
			 next((s for s in cs if 'Sound' in s and 'Editing' in s), None),
			 next((s for s in cs if 'Sound' in s and 'Mixing' in s), None),
			 next((s for s in cs if 'Visual Effects' in s), None),
			 next((s for s in cs if 'Screenplay' in s and 'Another' in s), None),
			 next((s for s in cs if 'Screenplay' in s and 'Directly' in s), None)]
		id = [0, 1, 1, 2, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 13, 14, 15, 16, 19, 20, 21, 22, 23]
	elif name == 'ofcs':
		replace = [next((s for s in cs if 'Best Picture' in s), None),
			 next((s for s in cs if 'Actor' in s and 'Supporting' not in s), None),
			 next((s for s in cs if 'Actress' in s and 'Supporting' not in s), None),
			 next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
			 next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
			 next((s for s in cs if 'Animated' in s), None),
			 next((s for s in cs if 'Cinematography' in s), None),
			 next((s for s in cs if 'Costume Design' in s), None),
			 next((s for s in cs if 'Director' in s), None),
			 next((s for s in cs if 'Documentary' in s), None),
			 next((s for s in cs if 'Editing' in s), None),
			 next((s for s in cs if 'Not' in s and 'English' in s), None),
			 next((s for s in cs if 'Original Score' in s), None),
			 next((s for s in cs if 'Original Song' in s), None),
			 next((s for s in cs if 'Sound' in s), None),
			 next((s for s in cs if 'Visual Effects' in s), None),
			 next((s for s in cs if 'Screenplay' in s and 'Adapted'), None),
			 next((s for s in cs if 'Screenplay' in s and 'Original'), None)]
		id = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 11, 12, 14, 15, 19, 21, 22, 23]
	elif name == 'cc':
		replace = [next((s for s in cs if 'Best Picture' in s), None),
			 next((s for s in cs if 'Best Action Movie' in s), None),
			 next((s for s in cs if 'Best Comedy' in s), None),
			 next((s for s in cs if 'Best Sci-Fi' in s or 'Best Horror' in s), None),
			 next((s for s in cs if 'Actor' in s and 'Comedy' not in s and 'Supporting' not in s), None),
			 next((s for s in cs if 'Actor' in s and 'Comedy' in s and 'Supporting' not in s), None),
			 next((s for s in cs if 'Actress' in s and 'Comedy' not in s and 'Supporting' not in s), None),
			 next((s for s in cs if 'Actress' in s and 'Comedy' in s and 'Supporting' not in s), None),
			 next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
			 next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
			 next((s for s in cs if 'Animated' in s), None),
			 next((s for s in cs if 'Cinematography' in s), None),
			 next((s for s in cs if 'Costume Design' in s), None),
			 next((s for s in cs if 'Director' in s), None),
			 next((s for s in cs if 'Editing' in s), None),
			 next((s for s in cs if 'Foreign' in s), None),
			 next((s for s in cs if 'Makeup' in s or 'Hair' in s), None),
			 next((s for s in cs if 'Score' in s), None),
			 next((s for s in cs if 'Song' in s), None),
			 next((s for s in cs if 'Production Design' in s), None),
			 next((s for s in cs if 'Visual Effects' in s), None),
			 next((s for s in cs if 'Adapted Screenplay' in s), None),
			 next((s for s in cs if 'Original Screenplay' in s), None)]
		id = [0, 0, 0, 0, 1, 1, 2, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 14, 15, 16, 21, 22, 23]
	elif name == 'lccf':
		replace = [next((s for s in cs if 'Film' in s), None),
			 next((s for s in cs if 'Actor' in s and 'Supporting' not in s), None),
			 next((s for s in cs if 'Actress' in s and 'Supporting' not in s), None),
			 next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
			 next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
			 next((s for s in cs if 'Director' in s), None),
			 next((s for s in cs if 'Documentary' in s), None),
			 next((s for s in cs if 'Foreign' in s), None)]
		id = [0, 1, 2, 3, 4, 8, 9, 12]
	elif name == 'ace':
		replace = [next((s for s in cs if 'Feature Film' in s and 'Drama' in s), None),
			 next((s for s in cs if 'Feature Film' in s and 'Comedy' in s), None),
			 next((s for s in cs if 'Animated' in s), None),
			 next((s for s in cs if 'Documentary' in s), None)]
		id = [0, 0, 5, 9]
	else:  # Oscars
		replace = [next((s for s in cs if 'Picture' in s), None),
			 next((s for s in cs if 'Actor' in s and 'Leading' in s), None),
			 next((s for s in cs if 'Actress' in s and 'Leading' in s), None),
			 next((s for s in cs if 'Actor' in s and 'Supporting' in s), None),
			 next((s for s in cs if 'Actress' in s and 'Supporting' in s), None),
			 next((s for s in cs if 'Animated' in s and 'Short' not in s), None),
			 next((s for s in cs if 'Cinematography' in s), None),
			 next((s for s in cs if 'Costume Design' in s), None),
			 next((s for s in cs if 'Direct' in s), None),
			 next((s for s in cs if 'Documentary' in s), None),
			 next((s for s in cs if 'Documentary' in s and 'Short' in s), None),
			 next((s for s in cs if 'Film Editing' in s), None),
			 next((s for s in cs if 'Foreign' in s), None),
			 next((s for s in cs if 'Makeup' in s or 'Hair' in s), None),
			 next((s for s in cs if 'Original Score' in s), None),
			 next((s for s in cs if 'Original Song' in s), None),
			 next((s for s in cs if 'Production Design' in s), None),
			 next((s for s in cs if 'Short' in s and 'Animat' in s), None),
			 next((s for s in cs if 'Short' in s and 'Live' in s), None),
			 next((s for s in cs if 'Sound' in s and 'Editing' in s), None),
			 next((s for s in cs if 'Sound' in s and 'Mixing' in s), None),
			 next((s for s in cs if 'Visual Effects' in s), None),
			 next((s for s in cs if 'Screenplay' in s and 'Adapted' in s), None),
			 next((s for s in cs if 'Screenplay' in s and 'Original' in s), None)]
		id = list(range(0, 24))

	none_index = [i for i in replace if i is None]
	cs = [c for c in cs if c not in none_index]
	aw = [a for a in aw if a not in none_index]

	for i, c in enumerate(cs):
		cs[i] = str(id[i])
	return cs, aw

These functions above would have been a lot more concise and simple (with less if else statements and list comprehensions) if the format for categories and the listing of winners/nominees stayed consistent throughout the years.

Here are just a few of the countless exceptions/difficulties that needed to be dealt with:
1. The movie title is switched with the Actor/Actress/Director/etc
![Best Picture](./ex1.jpg)
![Best Actor](./ex2.jpg)