# Data Processing

The HuggingFace dataset, [Movie-o-Label](https://huggingface.co/datasets/Francis2003/Movie-O-Label), does not contain accurate labels for which movies won an Oscar, only those that earned a nomination. Fortunately, one of Movie-o-Label's [reference datasets](https://github.com/DLu/oscar_data) does contain this information.

This notebook correctly labels which movies won an Oscar in the HuggingFace dataset by cross-referencing `FilmId`/`imdb_id` and then saves the `parquet` files.

In [None]:

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from datasets import load_dataset

import scipy
import sklearn 
import statsmodels

import os

## Load Data

In [None]:
raw_dir = os.path.join('..','data', 'raw')
processed_dir = os.path.join('..','data', 'processed')

In [None]:
# Load the dataset from the Hugging Face Hub
# This will download the data and cache it locally for future use.
ds = load_dataset(os.path.join(raw_dir, 'Movie-O-Label'))
print(ds)

DatasetDict({
    train: Dataset({
        features: ['movie_name', 'imdb_id', 'title', 'year', 'summary', 'script', 'script_plain', 'script_clean', 'nominated', 'winner'],
        num_rows: 1320
    })
    validation: Dataset({
        features: ['movie_name', 'imdb_id', 'title', 'year', 'summary', 'script', 'script_plain', 'script_clean', 'nominated', 'winner'],
        num_rows: 440
    })
    test: Dataset({
        features: ['movie_name', 'imdb_id', 'title', 'year', 'summary', 'script', 'script_plain', 'script_clean', 'nominated', 'winner'],
        num_rows: 440
    })
})


In [3]:
df_train = ds['train'].to_pandas()
df_val   = ds['validation'].to_pandas()
df_test  = ds['test'].to_pandas()

df_train.head()

Unnamed: 0,movie_name,imdb_id,title,year,summary,script,script_plain,script_clean,nominated,winner
0,Above the Law_1988,tt0094602,Above the Law,1988,"Sergeant Nico Toscani, a native of Palermo, Si...",<script>\n <scene>\n <stage_direction>ABOV...,\n \n ABOVE THE LAW \n TITLES SEQUE...,ABOVE THE LAW\nTITLES SEQUENCE - MONTAGE WITH ...,0,0
1,Fracture_2007,tt0488120,Fracture,2007,"Theodore ""Ted"" Crawford (Anthony Hopkins), a w...",<script>\n <scene>\n <stage_direction>FRAC...,\n \n FRACTURE \n CREDITS SEQUENCE ...,FRACTURE\nCREDITS SEQUENCE : EXTREME CLOSE - U...,0,0
2,She Said_2022,tt11198810,She Said,2022,"In 2017, New York Times reporter Jodi Kantor r...",<script>\n <scene>\n <character>SHE SAID</...,\n \n SHE SAID \n Screenplay by \n ...,SHE SAID\nScreenplay by\nRebecca Lenkiewicz Ba...,0,0
3,Unbroken_2014,tt1809398,Unbroken,2014,During an April 1943 bombing mission against t...,<script>\n <scene>\n <character>UNBROKEN</...,\n \n UNBROKEN \n Screenplay by \n ...,UNBROKEN\nScreenplay by\nJoel Coen &amp; Ethan...,0,0
4,The Bonfire of the Vanities_1990,tt0099165,The Bonfire of the Vanities,1990,Sherman McCoy is a Wall Street bond trader who...,<script>\n <scene>\n <stage_direction>EXT....,\n \n EXT. MANHATTAN SKYLINE - NIGHT \n...,EXT. MANHATTAN SKYLINE - NIGHT\nMOVING IN FAST...,0,0


In [4]:
dfs = [df_train, df_val, df_test]

In [None]:
csv_df = pd.read_csv(os.path.join(raw_dir, 'oscar_data','oscars.csv'),sep='	')

In [6]:
df_filter = (csv_df['Class'] == 'Writing') & (csv_df['Winner'] == True)
oscar_wins_df = csv_df[df_filter]

oscar_wins_df.head()

Unnamed: 0,Ceremony,Year,Class,CanonicalCategory,Category,Film,FilmId,Name,Nominees,NomineeIds,Winner,Detail,Note,Citation
26,1,1927/28,Writing,WRITING (Adapted Screenplay),WRITING (Adaptation),7th Heaven,tt0018379,Benjamin Glazer,Benjamin Glazer,nm0322227,True,,,
28,1,1927/28,Writing,WRITING (Original Story),WRITING (Original Story),Underworld,tt0018526,Ben Hecht,Ben Hecht,nm0372942,True,,,
30,1,1927/28,Writing,WRITING (Title Writing),WRITING (Title Writing),,,Joseph Farnham,Joseph Farnham,nm0267868,True,,NOTE: This award was not associated with any s...,
69,2,1928/29,Writing,WRITING (Adapted Screenplay),WRITING,The Patriot,tt0019257,Hans Kraly,Hans Kraly,nm0473134,True,,NOTE: THIS IS NOT AN OFFICIAL NOMINATION. Ther...,
110,3,1929/30,Writing,WRITING (Adapted Screenplay),WRITING,The Big House,tt0020686,Frances Marion,Frances Marion,nm0547966,True,,,


## Clean Data

For every record in `oscar_wins_df` check if it's in the Huggingface dataset. If so, mark as winner in the Huggingface Dataframe.

In [10]:
for film_id in oscar_wins_df['FilmId']:
    for df in dfs:
        df_filter = df['imdb_id'] == str(film_id)
        filter_df = df[df_filter]
        if df_filter.any():
            if df_filter.sum() > 1:
                print(f"Weird! Multiple matches found for film_id: {film_id}")
                print(df[df_filter])
                print()
            df.loc[df_filter, 'winner'] = 1

In [11]:
df_train['winner'].unique()

array([0, 1])

In [12]:
df_val['winner'].unique()

array([0, 1])

In [13]:
df_test['winner'].unique()

array([0, 1])

In [None]:
df_train.to_parquet(os.path.join(processed_dir,'train_clean.parquet'))

In [None]:
df_val.to_parquet(os.path.join(processed_dir,'val_clean.parquet'))

In [None]:
df_test.to_parquet(os.path.join(processed_dir,'test_clean.parquet'))