In [49]:
import pandas as pd
import numpy as np

In [50]:
oscar_df = pd.read_csv('the_oscar_award.csv')
metadata_df = pd.read_csv('movies_metadata.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [51]:
metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [52]:
oscar_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10395 entries, 0 to 10394
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   year_film      10395 non-null  int64 
 1   year_ceremony  10395 non-null  int64 
 2   ceremony       10395 non-null  int64 
 3   category       10395 non-null  object
 4   name           10395 non-null  object
 5   film           10091 non-null  object
 6   winner         10395 non-null  bool  
dtypes: bool(1), int64(3), object(3)
memory usage: 497.5+ KB


As per Academy Awards rules, movies nominated in a given year have to be exhibited between March and December of the previous year, so the year_film and year_ceremony columns are redundant, as for every datapoint it should be:
year_film = year_ceremony - 1
We check that this is the case. 

In [53]:
mask = oscar_df.year_film == (oscar_df.year_ceremony-1)
oscar_df[~mask]

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner


The same argument of redundancy applies to the ceremony column. We drop these columns, keeping only year_ceremony.

In [54]:
oscar_df = oscar_df.drop(['year_film','ceremony'], axis=1)

We see that some datapoints have null values in the 'film' column. We explore this data to find out in what categories this is happening

In [55]:
oscar_df[oscar_df.film.isnull()].category.unique()

array(['ENGINEERING EFFECTS', 'WRITING (Title Writing)', 'SPECIAL AWARD',
       'SOUND RECORDING', 'ASSISTANT DIRECTOR',
       'IRVING G. THALBERG MEMORIAL AWARD',
       'SPECIAL FOREIGN LANGUAGE FILM AWARD',
       'HONORARY FOREIGN LANGUAGE FILM AWARD', 'HONORARY AWARD',
       'JEAN HERSHOLT HUMANITARIAN AWARD', 'SPECIAL ACHIEVEMENT AWARD'],
      dtype=object)

Some of these categories regard special honorary awards which are not necessarily related to a film: we are going to ignore these rows.

In [56]:
mask = (oscar_df.category == 'HONORARY AWARD') | \
       (oscar_df.category == 'SPECIAL AWARD') | \
       (oscar_df.category == 'IRVING G. THALBERG MEMORIAL AWARD') | \
       (oscar_df.category == 'JEAN HERSHOLT HUMANITARIAN AWARD') | \
       (oscar_df.category == 'SPECIAL ACHIEVEMENT AWARD') | \
       (oscar_df.category == 'HONORARY FOREIGN LANGUAGE FILM AWARD') | \
       (oscar_df.category == 'SPECIAL FOREIGN LANGUAGE FILM AWARD')

oscar_df = oscar_df[~mask]
oscar_df[oscar_df.film.isnull()].head()

Unnamed: 0,year_ceremony,category,name,film,winner
16,1928,ENGINEERING EFFECTS,Ralph Hammeras,,False
18,1928,ENGINEERING EFFECTS,Nugent Slaughter,,False
31,1928,WRITING (Title Writing),Joseph Farnham,,True
32,1928,WRITING (Title Writing),"George Marion, Jr.",,False
145,1931,SOUND RECORDING,Samuel Goldwyn - United Artists Studio Sound D...,,False


As to the other 30 rows, by checking manually, we can see that the data we are looking for is also missing on the official Oscars website: it is possible, for example, that the people nominated for these categories would work on different films in the same year.
For this reason we are going to drop these rows.

In [57]:
oscar_df = oscar_df.dropna(subset=['film'])
oscar_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10091 entries, 0 to 10390
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   year_ceremony  10091 non-null  int64 
 1   category       10091 non-null  object
 2   name           10091 non-null  object
 3   film           10091 non-null  object
 4   winner         10091 non-null  bool  
dtypes: bool(1), int64(1), object(3)
memory usage: 404.0+ KB


In [58]:
metadata_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

In [59]:
metadata_df.video.value_counts()

False    45367
True        93
Name: video, dtype: int64

In [60]:
metadata_df.status.value_counts()

Released           45014
Rumored              230
Post Production       98
In Production         20
Planned               15
Canceled               2
Name: status, dtype: int64

We now go on with the data exploration and cleaning of the Metadata dataset. As we can see the dataset has 23 columns, many of which useless for the scope of this project:
- 'belongs to collection': describes whether or not a film belongs in one of the listed collections
- 'homepage': a link to the films website
- 'id' and 'imdb_id': a numerical identifier of the film in the database
- 'overview': a summarized plot of the movie
- 'tagline': a sentence the film was advertised with
- 'poster_path': the name of a file containing the poster for the film

On the other hand, 'video' is a boolean value that distinguishes films from other types of video content: in this project we are only interested in studying data related to films, so we'll drop rows having video value True and we'll drop the column.
We also checked that the vast majority of datapoints in the dataset have False as value in this column. 

We'll do the same thing for the 'status' column, which identifies whether films have been released or not: films that are not released aren't eligible for an Oscar nomination. Once again, dropping these rows doesn't cause a big loss in the number of data points.

In [61]:
metadata_df = metadata_df.drop(['belongs_to_collection','homepage','id','imdb_id','overview','poster_path'], axis=1)
metadata_df = metadata_df[metadata_df.video == False]
metadata_df = metadata_df[metadata_df.status == 'Released']
metadata_df = metadata_df.drop(['video','status'], axis=1)