# COGS 108 - Final Project (change this to your project's title)

# Permissions

Place an `X` in the appropriate bracket below to specify if you would like your group's project to be made available to the public. (Note that student names will be included (but PIDs will be scraped from any groups who include their PIDs).

* [  ] YES - make available
* [  ] NO - keep private

# Names

- Dhanashree Kulkarni
- Neela Kolte
- Krystal Chai
- Kira Nguyen
- Curtis Chen

# Abstract

Please write one to four paragraphs that describe a very brief overview of why you did this, how you did, and the major findings and conclusions.

# Research Question

-  Include a specific, clear data science question.
-  Make sure what you're measuring (variables) to answer the question is clear

What is your research question? Include the specific question you're setting out to answer. This question should be specific, answerable with data, and clear. A general question with specific subquestions is permitted. (1-2 sentences)



## Background and Prior Work


The Oscars is an awards ceremony meant to honor notable films in 24 respective categories, some of which include: Best Supporting Actress, Best Visual Effects, Best Director, and the like. Those who are nominated in each category are determined by the Academy, a group of individuals deemed to be dominated by white people (reported 19% nonwhite in 2022). Since Academy members also determine the winners, much controversy has erupted surrounding the lack of diversity and prejudiced nominations. 

The Oscars has historically been accused of being racist and sexist, and often not representing the opinion of viewer populations. Nominations of non-white artists are still much lesser, and there continues to be a lack of representation in the winners of the awards. From 1989 to 2015, 98.9% of winners for the Best Actress award were white artists, and similarly 93.2% of winners for the Best Actor award were white.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) From the beginning of the Academy Awards in 1929, only about 17% of the nominees have been women. The Oscars have claimed to take big steps to combat racism and bias in the proceedings of the awards but efforts to combat sexism are much less profound.

Nonetheless, the improvements, a panicked response to the public retaliating against this inequity in 2015 with the trending hashtag #OscarsSoWhite, doubled women and tripled their members of color on the Academy board.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) The overall nomination for underrepresented communities has also increased from about 9.5% to 17% since the hashtag made its rounds.<a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3) However, The Inclusion List, a prior and continuing work focusing on data analysis in the entertainment industry to promote inclusivity, corroborates the need for even greater initiatives. Though their analysis, sorted into gender and ethnicity and broken down in every single award category, shows an overall uptick, their visualizations often highlight the remaining inequality (ex. Pie chart with text, “<2% of nominees for Best Director were women”).<a name="cite_ref-4"></a>[<sup>4</sup>](#cite_note-4) Our project covers a similar domain (more narrow in range), but aims to contextualize the analysis through recent trends in film and further demographic information if possible.

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Chen, S. (13 May 2022) The Discriminatory Bias of Award Shows. *The Spectator*. https://stuyspec.com/article/the-discriminatory-bias-of-award-shows 
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Latif, L. (14 April 2021) Has the Oscars really faced up to its race problem? *BBC*. https://www.bbc.com/culture/article/20210414-has-the-oscars-really-fixed-its-race-problem 
3. <a name="cite_note-3"></a> [^](#cite_ref-3) Smith, S. (2025) Oscars So White *The Inclusion List*. https://www.inclusionlist.org/oscars/oscars-so-white 
4. <a name="cite_note-4"></a> [^](#cite_ref-4) Smith, S. (2025) Best Director *The Inclusion List*. https://www.inclusionlist.org/oscars/director 


# Hypothesis



We hypothesize that an academy award nominee’s intersectionality of race and gender will negatively affect their likelihood of winning (win rate) if they are not white and/ or male. We control for the type of film by 1) mainly focusing on categories that do not take film type into account, and 2) aggregating all types of films for writing categories. We intend to test for statistical evidence of disparity between race and gender specifically that will overall hinder the nominees’ likelihood of winning.
Additionally, We also want to include birthplace as a proxy to industry connection as a variable to find a positive correlation between industry connection and nomination rate in order to further expand on prior work.



# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Denoted as oscars_data.csv in the repo.
  - Dataset Name: Academy awards dataset (oscars)
  - Link to the dataset: https://www.kaggle.com/datasets/dharmikdonga/academy-awards-dataset-oscars%20/data
  - Number of observations: 10396
  - Number of variables: 9
- Dataset #2 (if you have more than one!)
  - Denoted as birthplace_data.csv in the repo.
  - Dataset Name: Oscars - Best Actors and Actresses
  - Link to the dataset: http://jse.amstat.org/datasets/oscars.dat.txt
  - Number of observations: 155
  - Number of variables: 3

Dataset #1 was configured in reference to another kaggle dataset, https://www.kaggle.com/datasets/unanimad/the-oscar-award/data, but expands on it by adding columns on race and gender. It contains information on the gender, race, category, ceremony number and year for all Oscar winners and nominees from 1927-2019. It also includes the film name and year the film was made. The variables in the dataset are of object (string), Boolean, and integer type. The wrangling process for this dataset included: extracting only values that applied to our research, excluding columns we did not need (film title and film year), and standardizing unique values to certain categories as we see fit. After cleaning the .csv file, we did not find any null values.
Dataset #2 was curated by the Journal of Statistical Education. It has data on the winners of the Best Actor and Best Actress awards from 1929 to 2005, and details about their birthplace, year of birth, brith month, birth day, and the age at which they won the Oscar. For our purposes we extracted only the columns for the name of the actor/actress, their gender, the number of the ceremony in which they won, and their birthplace. Birthplace is defined as the U.S. state of birth for those born in the U.S. or the country of birth for those born elsewhere. All the variables in this dataset are of object (string) data type. The wrangling process for this dataset included filtering the relevant columns and dropping the rest. There is no missing data but it covers only a subset of the years covered in the first dataset.



In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1.inset_locator import inset_axes
import folium
from folium.plugins import HeatMap
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut
import time
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import statsmodels.formula.api as smf

## Dataset #1 (oscar_data.csv)

In [None]:
data1 = pd.read_csv('oscars_data.csv')
print(data1.head())
print('Shape of data1: ', data1.shape)

In [None]:
# Extracting values that we want to focus on in our project.
data1 = data1[data1['Category'].isin(['Best Actor', 'Best Actress', 'DIRECTING (Comedy Picture)','DIRECTING (Dramatic Picture)', 'DIRECTING', 'WRITING (Adapted Screenplay)', 'WRITING (Original Story)','WRITING (Title Writing)', 'WRITING (Original Screenplay)', 'WRITING (Original Motion Picture Story)','WRITING (Motion Picture Story)', 'WRITING (Story and Screenplay)', 'WRITING (Screenplay--Original)', 'WRITING (Story and Screenplay--written directly for the screen)', 'WRITING (Story and Screenplay--based on material not previously published or produced)', 'WRITING (Story and Screenplay--based on factual material or material not previously published or produced)','WRITING (Screenplay Written Directly for the Screen--based on factual material or on story material not previously published or produced)','WRITING (Screenplay Written Directly for the Screen)'])]
# Dropping the columns that we are not using in our analysis.
data1.drop('film', axis=1, inplace=True)
data1.drop('year_film', axis=1, inplace=True)
# Naming the columns of data1.
data1.columns = ['Year_ceremony', 'Ceremony', 'Category', 'Gender', 'Name', 'Race','Winner']
print(data1.head())
print('Revised shape of data1: ', data1.shape)

In [None]:
# Seeing if all values of our revised DataFrame are applicable to our subsequent function.
list(data1['Category'].unique())

In [None]:
# Creating a function that performs simple substring checking to group repeating unique values together.
def standardize_categories(category):
# Converting all text to lowercase and using strip() to remove leading and trailing characters.
  category = category.strip().lower()
  if 'actor' in category:
    category = 'Best Actor'
  elif 'actress' in category:
    category = 'Best Actress'
  elif 'directing' in category:
    category = 'Best Director'
  elif 'writing' in category:
    category = 'Best Screenplay'
  else:
    return None
  category = category.strip()
  return category

In [None]:
# Applying the transformation to our revised dataset.
data1['Category'] = data1['Category'].apply(standardize_categories)
# Verifying that the function works and is applied to our new dataset.
list(data1['Category'].unique())

In [None]:
# Checking for missing data.
data1.isnull().sum()

In [None]:
print(data1['Race'].value_counts())
print(data1['Gender'].value_counts())

In [None]:
# Visualizing the distribution of race entries in our dataset.
sns.countplot(x = data1['Race'])
plt.show()

In [None]:
# Visualizing the distribution of male and female entries in our dataset.
sns.countplot(x = data1['Gender'])
plt.show()

## Dataset #2 (birthplace_data.csv)

In [None]:
data2 = pd.read_csv('birthplace_data.csv')
# Naming the columns of data2.
data2.columns = ['Gender','Ceremony','Year_of_award','Name','Film','Age_when_won','State/Country','Birth_month','Birth_day','Birth_year']
# Dropping the columns that we are not using in our analysis.
data2.drop(['Year_of_award','Film','Age_when_won','Birth_month','Birth_day','Birth_year'], axis=1, inplace=True)
print(data2.head())
print('Shape of data2: ', data2.shape)

In [None]:
# Checking for missing data.
print(data2.isnull().sum())

In [None]:
# Visualizing the distribution of male and female entries in our dataset.
print(data2['Gender'].value_counts())
sns.countplot(x = data2['Gender'])
plt.show()

# Results

## Exploratory Data Analysis



## Visualizing Trends in Oscar Winners by Race and Gender Over Time



In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

## Inferential Statistics 



In [None]:
# Replacing True and False values with 1s and 0s.
data1 = data1.replace({True: 1, False: 0})

In [None]:
# Running a logistic regression statistics model.
mod = smf.logit('Winner ~ C(Gender, Treatment(reference="Male")) + C(Race, Treatment(reference="White")) + C(Category) + Year_ceremony', data1).fit()
print(mod.summary())

## Predictive Model



In [None]:
# THIS IS THE PREPROCESSING FOR THE MODEL.
# Defining the predictors and target variables.
X = data1[['Category', 'Race', 'Gender', 'Year_ceremony']]
y= data1['Winner']

In [None]:
# Encoding the categorical independent variables using the OneHotEncoder.
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
X_encoded = encoder.fit_transform(X[['Race', 'Gender', 'Category']])

In [None]:
# Creating a new dataframe from the encoded values.
X_encoded_df = pd.DataFrame(X_encoded, columns=encoder.get_feature_names_out(['Race', 'Gender', 'Category']))
X_encoded_df.reset_index(drop=True)
# Adding the Year_ceremony values to the independent variables.
new_column = list(X['Year_ceremony'])
X_encoded_df['Year_ceremony'] = new_column

In [None]:
# Splitting data into training and test data and shuffling.
X_train, X_test, y_train, y_test = train_test_split(X_encoded_df, y, test_size=0.5, random_state=42, shuffle=True)

In [None]:
# Scaling the Year_ceremony column using StandardScaler.
sc = StandardScaler()
X_train['Year_ceremony'] = sc.fit_transform(np.array(X_train['Year_ceremony']).reshape(-1, 1))
X_test['Year_ceremony'] = sc.fit_transform(np.array(X_test['Year_ceremony']).reshape(-1, 1))

In [None]:
# Training Random Forest Classifier based on n_estimators that we determined through cross-validation scores.
rf = RandomForestClassifier(n_estimators=9, random_state=0)
rf.fit(X_train, y_train)
# Making predictions, and printing the classification report and confusion matrix.
y_pred = rf.predict(X_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

# Ethics & Privacy

- Thoughtful discussion of ethical concerns included
- Ethical concerns consider the whole data science process (question asked, data collected, data being used, the bias in data, analysis, post-analysis, etc.)
- How your group handled bias/ethical concerns clearly described

Acknowledge and address any ethics & privacy related issues of your question(s), proposed dataset(s), and/or analyses. Use the information provided in lecture to guide your group discussion and thinking. If you need further guidance, check out [Deon's Ethics Checklist](http://deon.drivendata.org/#data-science-ethics-checklist). In particular:

- Are there any biases/privacy/terms of use issues with the data you propsed?
- Are there potential biases in your dataset(s), in terms of who it composes, and how it was collected, that may be problematic in terms of it allowing for equitable analysis? (For example, does your data exclude particular populations, or is it likely to reflect particular human biases in a way that could be a problem?)
- How will you set out to detect these specific biases before, during, and after/when communicating your analysis?
- Are there any other issues related to your topic area, data, and/or analyses that are potentially problematic in terms of data privacy and equitable impact?
- How will you handle issues you identified?

# Discusison and Conclusion

Wrap it all up here.  Somewhere between 3 and 10 paragraphs roughly.  A good time to refer back to your Background section and review how this work extended the previous stuff. 


# Team Contributions

Speficy who did what.  This should be pretty granular, perhaps bullet points, no more than a few sentences per person.