# Analysis of Oscar Genre Bias

This is the script we used to analyze discrepencies between the rates in which different genres receive Best Picture nominations compared to the rate at which the movies have been produced. 

## Step One: Read the data

We're going to start by reading the IMDb data and Oscar data we collected into memory. 

In [1]:
import collections
import pandas as pd
import numpy as np
IMDB_FILE = "data/imdb_movie_data.csv"
OSCAR_FILE = "data/merged_major_categories.csv"
oscar_data = pd.read_csv(OSCAR_FILE)
imdb_data = pd.read_csv(IMDB_FILE, low_memory=False)
oscar_data.groupby("award").describe()["start_year"]

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
award,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
ACTOR,232.0,1952.086207,13.763978,1927.0,1941.0,1952.0,1964.0,1975.0
ACTOR IN A LEADING ROLE,215.0,1997.0,12.438634,1976.0,1986.0,1997.0,2008.0,2018.0
ACTOR IN A SUPPORTING ROLE,415.0,1977.0,23.987215,1936.0,1956.0,1977.0,1998.0,2018.0
ACTRESS,236.0,1951.728814,13.923903,1927.0,1940.0,1952.0,1964.0,1975.0
ACTRESS IN A LEADING ROLE,215.0,1997.0,12.438634,1976.0,1986.0,1997.0,2008.0,2018.0
ACTRESS IN A SUPPORTING ROLE,415.0,1977.0,23.987215,1936.0,1956.0,1977.0,1998.0,2018.0
BEST MOTION PICTURE,90.0,1952.5,5.217193,1944.0,1948.0,1952.5,1957.0,1961.0
BEST PICTURE,324.0,1992.783951,17.224153,1962.0,1978.0,1994.0,2009.0,2018.0
DIRECTING,444.0,1973.984234,25.856388,1928.0,1952.0,1974.0,1996.0,2018.0
DIRECTING (Comedy Picture),2.0,1927.0,0.0,1927.0,1927.0,1927.0,1927.0,1927.0


In [2]:
best_picture_categories = {
    "BEST MOTION PICTURE", 
    "BEST PICTURE", 
    "OUTSTANDING MOTION PICTURE", 
    "OUTSTANDING PICTURE", 
    "OUTSTANDING PRODUCTION"
}
oscar_data.head()

Unnamed: 0,year,award,won,nominee,nominated_film,addl_notes,special_citation,start_year,oscar_id,title_lower,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,title
0,1927/28,ACTOR,False,Richard Barthelmess,The Noose,,,1927,0,the noose,tt0019217,movie,The Noose,The Noose,0,1928,,65.0,Drama,the noose
1,1927/28,ACTOR,True,Emil Jannings,The Last Command,,,1927,1,the last command,tt0019071,movie,The Last Command,The Last Command,0,1928,,88.0,"Drama,History,Romance",the last command
2,1927/28,ACTRESS,False,Louise Dresser,A Ship Comes In,,,1927,2,a ship comes in,tt0018389,movie,A Ship Comes In,A Ship Comes In,0,1928,,70.0,Drama,a ship comes in
3,1927/28,ACTRESS,True,Janet Gaynor,7th Heaven,,,1927,3,7th heaven,tt0018379,movie,7th Heaven,7th Heaven,0,1927,,110.0,"Drama,Romance",7th heaven
4,1927/28,ACTRESS,False,Gloria Swanson,Sadie Thompson,,,1927,4,sadie thompson,tt0019344,movie,Sadie Thompson,Sadie Thompson,0,1928,,97.0,Drama,sadie thompson


## Step Two: Filtering the Data

From here, we're going to filter out films that are not feature-length fictional films or that were not released between 1930 and 2018 (when our Oscar data takes place). We're also going to collect the genres for each of these films.

In [3]:
best_picture = oscar_data[
    (oscar_data.award.isin(best_picture_categories)) &
    (oscar_data.start_year >= 1930)
].copy()
genres = best_picture.genres.str.split(",").apply(collections.Counter)
oscar_data_g = genres.sum().most_common()
oscar_data_g

[('Drama', 477),
 ('Romance', 167),
 ('Comedy', 117),
 ('Biography', 108),
 ('History', 64),
 ('Adventure', 63),
 ('Crime', 61),
 ('War', 38),
 ('Action', 32),
 ('Thriller', 32),
 ('Musical', 29),
 ('Fantasy', 27),
 ('Family', 26),
 ('Mystery', 25),
 ('Music', 22),
 ('Western', 14),
 ('Film-Noir', 13),
 ('Sport', 10),
 ('Sci-Fi', 10),
 ('Animation', 3),
 ('Horror', 2)]

In [4]:
import re
imdb_data = imdb_data[imdb_data.startYear.str.match(r"[0-9]+")]
imdb_data["year"] = imdb_data.startYear.apply(int)
imdb_data = imdb_data[(imdb_data.year >= 1930) & (imdb_data.year <= 2018)]
imdb_data.drop(imdb_data[
    (imdb_data.genres == "\\N") | 
    (imdb_data.genres.str.contains("Short")) | 
    (imdb_data.genres.str.contains("Documentary")) |
    (imdb_data.genres.str.contains("Reality-TV")) | 
    (imdb_data.genres.str.contains("News")) |
    (imdb_data.genres.str.contains("Game-Show")) |
    (imdb_data.genres.str.contains("Talk-Show")) |
    (imdb_data.genres.str.contains("Adult"))
].index, inplace=True)

In [5]:
imdb_genres = imdb_data.genres.str.split(",").apply(collections.Counter)
imdb_genre_g = imdb_genres.sum().most_common()
imdb_genre_g

[('Drama', 144790),
 ('Comedy', 74009),
 ('Action', 35043),
 ('Romance', 33843),
 ('Crime', 24440),
 ('Thriller', 23571),
 ('Horror', 19297),
 ('Adventure', 17272),
 ('Family', 10925),
 ('Mystery', 10479),
 ('Fantasy', 9005),
 ('Musical', 8017),
 ('Sci-Fi', 6752),
 ('War', 5919),
 ('Music', 5180),
 ('History', 5122),
 ('Animation', 4835),
 ('Biography', 4509),
 ('Western', 4369),
 ('Sport', 3170),
 ('Film-Noir', 778)]

In [6]:
{ genre:f"{value/len(best_picture) * 100:.2f}%" for genre, value in oscar_data_g }

{'Drama': '88.17%',
 'Romance': '30.87%',
 'Comedy': '21.63%',
 'Biography': '19.96%',
 'History': '11.83%',
 'Adventure': '11.65%',
 'Crime': '11.28%',
 'War': '7.02%',
 'Action': '5.91%',
 'Thriller': '5.91%',
 'Musical': '5.36%',
 'Fantasy': '4.99%',
 'Family': '4.81%',
 'Mystery': '4.62%',
 'Music': '4.07%',
 'Western': '2.59%',
 'Film-Noir': '2.40%',
 'Sport': '1.85%',
 'Sci-Fi': '1.85%',
 'Animation': '0.55%',
 'Horror': '0.37%'}

In [7]:
{ genre:f"{(value/len(imdb_data)) * 100:.2f}%" for genre, value in imdb_genre_g }

{'Drama': '53.24%',
 'Comedy': '27.21%',
 'Action': '12.89%',
 'Romance': '12.44%',
 'Crime': '8.99%',
 'Thriller': '8.67%',
 'Horror': '7.10%',
 'Adventure': '6.35%',
 'Family': '4.02%',
 'Mystery': '3.85%',
 'Fantasy': '3.31%',
 'Musical': '2.95%',
 'Sci-Fi': '2.48%',
 'War': '2.18%',
 'Music': '1.90%',
 'History': '1.88%',
 'Animation': '1.78%',
 'Biography': '1.66%',
 'Western': '1.61%',
 'Sport': '1.17%',
 'Film-Noir': '0.29%'}

Here, I'm checking changes over the decades for drama nominees and for the production of dramas as a whole. Neither really changed that much over this time period, so we wound up not including this data in our final analysis.

In [8]:
best_picture["decade"] = best_picture.start_year.apply(str).str.extract(r"(^[0-9]{3})").apply(lambda x: x + "0s")
best_picture["drama_count"] = best_picture.genres.str.split(",").apply(collections.Counter).apply(lambda x: x["Drama"])
best_picture.groupby("decade").sum()["drama_count"] / best_picture.groupby("decade").count()["drama_count"] * 100

decade
1930s    75.862069
1940s    88.571429
1950s    90.000000
1960s    82.000000
1970s    90.000000
1980s    96.000000
1990s    98.000000
2000s    90.909091
2010s    89.873418
Name: drama_count, dtype: float64

In [9]:
imdb_data["decade"] = imdb_data.startYear.apply(str).str.extract(r"(^[0-9]{3})").apply(lambda x: x + "0s")
imdb_data["drama_count"] = imdb_data.genres.str.split(",").apply(collections.Counter).apply(lambda x: x["Drama"])
imdb_data.groupby("decade").sum()["drama_count"] / imdb_data.groupby("decade").count()["drama_count"]

decade
1930s    0.575566
1940s    0.537697
1950s    0.569901
1960s    0.523474
1970s    0.536685
1980s    0.548082
1990s    0.523921
2000s    0.538633
2010s    0.512566
Name: drama_count, dtype: float64

In [10]:
best_picture.groupby("decade").count().drama_count

decade
1930s    87
1940s    70
1950s    50
1960s    50
1970s    50
1980s    50
1990s    50
2000s    55
2010s    79
Name: drama_count, dtype: int64

## Overall Genre Rates

Now, we're going to find the rate at which each of several genres has received Best Picture nominations and compare those to the rate at which movies of those genres have been produced.

In [11]:
best_picture.drama_count.sum() / len(best_picture)

0.8817005545286506

In [12]:
imdb_data.drama_count.sum() / len(imdb_data)

0.5324062157570766

In [13]:
imdb_data.genres.str.split(",").apply(collections.Counter).apply(lambda x: x["Action"]).sum() / len(imdb_data)

0.12885635070636947

In [14]:
best_picture.genres.str.split(",").apply(collections.Counter).apply(lambda x: x["Action"]).sum() / len(best_picture)

0.059149722735674676

In [15]:
imdb_data.genres.str.split(",").apply(collections.Counter).apply(lambda x: x["Comedy"]).sum() / len(imdb_data)

0.2721379350919641

In [16]:
best_picture.genres.str.split(",").apply(collections.Counter).apply(lambda x: x["Comedy"]).sum() / len(best_picture)

0.21626617375231053

## Rate Differences

Now, we'll see which genres disproportionately receive or disproportionately do not receive Best Picture nominations. In the end, we focused on cases where the underlying percentage of production *and* where the difference was particularly strike.

In [17]:
best_picture_pcts = { genre: value/len(best_picture) * 100 for genre, value in oscar_data_g }
imdb_pcts = { genre: (value/len(imdb_data)) * 100 for genre, value in imdb_genre_g }
genres_df = pd.Series(best_picture_pcts)
genres_df = pd.concat(
    [pd.Series(best_picture_pcts), pd.Series(imdb_pcts)], 
    axis=1, 
    sort=False, 
    keys=["best_picture", "imdb"]
).fillna(0)
genres_df["difference"] = genres_df.best_picture - genres_df.imdb
genres_df.sort_values(by="difference", ascending=False).head()

Unnamed: 0,best_picture,imdb,difference
Drama,88.170055,53.240622,34.929434
Romance,30.868762,12.444384,18.424378
Biography,19.963031,1.658001,18.30503
History,11.829945,1.883407,9.946538
Adventure,11.645102,6.351074,5.294028


In [18]:
genres_df.sort_values(by="difference").head(6)

Unnamed: 0,best_picture,imdb,difference
Action,5.914972,12.885635,-6.970663
Horror,0.369686,7.095685,-6.726
Comedy,21.626617,27.213794,-5.587176
Thriller,5.914972,8.667275,-2.752302
Animation,0.554529,1.777874,-1.223346
Sci-Fi,1.848429,2.482773,-0.634344
