# Data Processing

The HuggingFace dataset, [Movie-o-Label](https://huggingface.co/datasets/Francis2003/Movie-O-Label), does not contain accurate labels for which movies won an Oscar, only those that earned a nomination. Fortunately, one of Movie-o-Label's [reference datasets](https://github.com/DLu/oscar_data) does contain this information.

This notebook correctly labels which movies won an Oscar in the HuggingFace dataset by cross-referencing `FilmId`/`imdb_id` and then saves the `parquet` files.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from datasets import load_dataset

import scipy
import sklearn 
import statsmodels

import os

## Load Data

In [None]:
raw_dir = os.path.join('..','data', 'raw')
processed_dir = os.path.join('..','data', 'processed')

In [None]:
# Load the dataset from the Hugging Face Hub
# This will download the data and cache it locally for future use.
ds = load_dataset(os.path.join(raw_dir, 'Movie-O-Label'))
print(ds)

In [None]:
df_train = ds['train'].to_pandas()
df_val   = ds['validation'].to_pandas()
df_test  = ds['test'].to_pandas()

df_train.head()

In [None]:
dfs = [df_train, df_val, df_test]

In [None]:
csv_df = pd.read_csv(os.path.join(raw_dir, 'oscar_data','oscars.csv'),sep='	')

In [None]:
df_filter = (csv_df['Class'] == 'Writing') & (csv_df['Winner'] == True)
oscar_wins_df = csv_df[df_filter]

oscar_wins_df.head()

## Clean Data

For every record in `oscar_wins_df` check if it's in the Huggingface dataset. If so, mark as winner in the Huggingface Dataframe.

In [None]:
for film_id in oscar_wins_df['FilmId']:
    for df in dfs:
        df_filter = df['imdb_id'] == str(film_id)
        filter_df = df[df_filter]
        if df_filter.any():
            if df_filter.sum() > 1:
                print(f"Weird! Multiple matches found for film_id: {film_id}")
                print(df[df_filter])
                print()
            df.loc[df_filter, 'winner'] = 1

In [None]:
df_train['winner'].unique()

In [None]:
df_val['winner'].unique()

In [None]:
df_test['winner'].unique()

In [None]:
df_train.to_parquet(os.path.join(processed_dir,'train_clean.parquet'))

In [None]:
df_val.to_parquet(os.path.join(processed_dir,'val_clean.parquet'))

In [None]:
df_test.to_parquet(os.path.join(processed_dir,'test_clean.parquet'))