## Data cleaning of movie data set

### Trying to reduce data size of 18000++ 

#### We will be doing this by dropping movies without release_date, budget and revenue(if both are missing, since it becomes inaccurate to fill in 2 unknown values) since they do not provide any meaning without these values. We will also drop movie whose release data are before 1927 and after 2022 as we are not concerned with those time periods since oscar starts for movies from 1927 till present.

#### The assumption is that movies without both budget and revenue will not be able to provide much value to our dataset

In [2]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import math
from sklearn.impute import SimpleImputer

Importing csv data and viewing first 5 rows


In [3]:
data = pd.read_csv("MovieData.csv")
data.head()

Unnamed: 0,Index,id,original_title,popularity,budget,revenue,release_date,popularity.1,vote_average,runtime,top_casts_popularity_avg,casts_popularity_sum,top_cast_popularity,top_crews_popularity_avg,crews_popularity_sum,top_crew_popularity
0,0,2,Ariel,9.553,,,1988.0,10.656,7.053,73.0,2.515714,43.183,3.574,2.277857,36.045,2.936
1,1,3,Varjoja paratiisissa,9.228,,,1986.0,8.276,7.183,74.0,2.672714,40.189,3.469,2.158429,28.203,2.936
2,2,5,Four Rooms,18.254,4000000.0,4257354.0,1995.0,22.784,5.744,98.0,32.033857,362.055,36.681,15.715714,225.92,27.939
3,3,6,Judgment Night,11.309,21000000.0,12136938.0,1993.0,11.53,6.543,109.0,14.834857,124.008,23.049,3.600143,32.165,5.436
4,4,11,Star Wars,87.513,11000000.0,775398007.0,1977.0,86.624,8.207,121.0,17.367714,297.868,37.206,7.543,121.872,10.994


In [4]:
data.info()
data.shape[0]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18350 entries, 0 to 18349
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Index                     18350 non-null  int64  
 1   id                        18350 non-null  int64  
 2   original_title            18350 non-null  object 
 3   popularity                18350 non-null  float64
 4   budget                    8064 non-null   float64
 5   revenue                   8417 non-null   float64
 6   release_date              18319 non-null  float64
 7   popularity.1              18349 non-null  float64
 8   vote_average              17894 non-null  float64
 9   runtime                   18063 non-null  float64
 10  top_casts_popularity_avg  18154 non-null  float64
 11  casts_popularity_sum      18154 non-null  float64
 12  top_cast_popularity       18154 non-null  float64
 13  top_crews_popularity_avg  18256 non-null  float64
 14  crews_

18350

There are 18350 rows

Restting index in case the indexing is not in order

In [5]:
data.reset_index(drop=True)

Unnamed: 0,Index,id,original_title,popularity,budget,revenue,release_date,popularity.1,vote_average,runtime,top_casts_popularity_avg,casts_popularity_sum,top_cast_popularity,top_crews_popularity_avg,crews_popularity_sum,top_crew_popularity
0,0,2,Ariel,9.553,,,1988.0,10.656,7.053,73.0,2.515714,43.183,3.574,2.277857,36.045,2.936
1,1,3,Varjoja paratiisissa,9.228,,,1986.0,8.276,7.183,74.0,2.672714,40.189,3.469,2.158429,28.203,2.936
2,2,5,Four Rooms,18.254,4000000.0,4257354.0,1995.0,22.784,5.744,98.0,32.033857,362.055,36.681,15.715714,225.920,27.939
3,3,6,Judgment Night,11.309,21000000.0,12136938.0,1993.0,11.530,6.543,109.0,14.834857,124.008,23.049,3.600143,32.165,5.436
4,4,11,Star Wars,87.513,11000000.0,775398007.0,1977.0,86.624,8.207,121.0,17.367714,297.868,37.206,7.543000,121.872,10.994
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18345,18345,1101747,Despierta,10.543,,,2023.0,3.430,2.000,11.0,1.848500,3.697,2.937,1.064000,1.064,1.064
18346,18346,1101748,I am Poem,10.543,,,2023.0,3.391,,15.0,0.600000,2.400,0.600,0.600000,0.600,0.600
18347,18347,1101750,El olvido,10.543,,,2023.0,4.587,2.000,8.0,0.600000,1.800,0.600,0.600000,0.600,0.600
18348,18348,1101848,Gaia,10.543,,,2023.0,5.061,,9.0,0.600000,1.200,0.600,0.600000,1.200,0.600


Dropping rows where both "budget" and "revenue" are both null values since it would be difficult to fill in values using strategies like median when we do not know how inflation will affect these variables

In [113]:
for i in tqdm(range(18350)):
    data.dropna(subset=["budget","revenue"],how="all",inplace=True,axis =0)

100%|██████████| 18350/18350 [00:21<00:00, 863.60it/s] 


The data now has 9983 entries

In [6]:
print(data.shape)
print(data["release_date"].isnull().value_counts())

(18350, 16)
False    18319
True        31
Name: release_date, dtype: int64


From the values, it looks like there are no movies with missing release_date

Excluding movies whose release data are before 1927 and after 2022 since first oscar was for movies released at 1927 and latest oscar is for movies released at 2022

In [14]:
data = pd.DataFrame(data[(data["release_date"]<=2022) & (data["release_date"]>=1927)])
data.shape

(17656, 16)

In [146]:
data2 = data.drop(["Index"],axis=1)
data2=data2.reset_index(drop=True)

obtaining percentage revenue based on budget (revenue/budget)*100% to compare percentage rather than aboslute values to account for inflation

In [148]:
percentage_revenue =[]
for k in range(0,9882):
    if(math.isnan(data2.iloc[k].at["revenue"]) or math.isnan(data2.iloc[k].at["budget"])):
        percentage_revenue.append(np.nan)
    else:
        percent = round(((data2.iloc[k].at["revenue"])/(data2.iloc[k].at["budget"])*100),2)
        percentage_revenue.append(percent)

In [149]:
print(len(percentage_revenue))
p_revenue={}
p_revenue["percentage_revenue"]=percentage_revenue
percent_revenue = pd.DataFrame(p_revenue)
print(percent_revenue)
data3 = pd.concat([data2,percent_revenue],axis=1)


9882
      percentage_revenue
0                 106.43
1                  57.79
2                7049.07
3                1000.36
4                1231.61
...                  ...
9877                 NaN
9878                 NaN
9879                 NaN
9880                 NaN
9881                 NaN

[9882 rows x 1 columns]


Importing csv file which contains all best picture winning movies from 1927 to 2022

In [10]:
winners = pd.read_csv("someWinners.csv")

In [11]:
winners.head()
# name refers to the company who won

Unnamed: 0,Index,year_film,film,winner
0,0,1927,The Last Command,True
1,1,1927,7th Heaven,True
2,2,1927,Wings,True
3,3,1928,In Old Arizona,True
4,4,1928,Coquette,True


dropping Index, year_ceremony(the year when oscar ceremony took place),ceremony, category since they are all best pictures and name of company who won

In [12]:
col = ["year_film","Index"]
winners=winners.drop(col,axis =1)

Removing any duplicate movies

In [15]:
data4 = data.drop_duplicates(subset=["original_title"],keep='first')
data4 = data4.reset_index(drop=True)


In [16]:
print(winners.shape) #keeping track of the dimensions
print(data4.shape) 
print(winners.head())
print(data4.head())

(243, 2)
(17060, 16)
               film  winner
0  The Last Command    True
1        7th Heaven    True
2             Wings    True
3    In Old Arizona    True
4          Coquette    True
   Index  id        original_title  popularity      budget      revenue  \
0      0   2                 Ariel       9.553         NaN          NaN   
1      1   3  Varjoja paratiisissa       9.228         NaN          NaN   
2      2   5            Four Rooms      18.254   4000000.0    4257354.0   
3      3   6        Judgment Night      11.309  21000000.0   12136938.0   
4      4  11             Star Wars      87.513  11000000.0  775398007.0   

   release_date  popularity.1  vote_average  runtime  \
0        1988.0        10.656         7.053     73.0   
1        1986.0         8.276         7.183     74.0   
2        1995.0        22.784         5.744     98.0   
3        1993.0        11.530         6.543    109.0   
4        1977.0        86.624         8.207    121.0   

   top_casts_popularity

Combining the 2 data together

In [20]:
name_list = []
index_list = []
Win = []
for a in range(0,243): #obtain names of award winning films
    name_list.append(winners.iloc[a].at["film"])

for b in range(0,17060): #obtain index of award winning films
    if data4.iloc[b].at["original_title"] in name_list:
        index_list.append(b)

print(len(index_list)) #ensure there are 95

for c in range(0,17060): #add true for index in index_list, false otherwise
    if c in index_list:
        Win.append(True)
    else:
        Win.append(False)

Win_dictionary = {}
Win_dictionary["Won"]= Win
Win_df = pd.DataFrame(Win_dictionary)
data5 = pd.concat([data4,Win_df],axis=1)

191


In [177]:
#checking the contents
print(data5.head())
print(data5.shape)

   id  original_title  popularity      budget      revenue  release_date  \
0   5      Four Rooms      18.254   4000000.0    4257354.0        1995.0   
1   6  Judgment Night      11.309  21000000.0   12136938.0        1993.0   
2  11       Star Wars      87.513  11000000.0  775398007.0        1977.0   
3  12    Finding Nemo      99.249  94000000.0  940335536.0        2003.0   
4  13    Forrest Gump      69.080  55000000.0  677387716.0        1994.0   

   popularity.1  vote_average  runtime  top_casts_popularity_avg  \
0        22.784         5.744     98.0                 32.033857   
1        11.530         6.543    109.0                 14.834857   
2        86.624         8.207    121.0                 17.367714   
3        99.340         7.824    100.0                 23.006429   
4        68.192         8.481    142.0                 32.837571   

   casts_popularity_sum  top_cast_popularity  top_crews_popularity_avg  \
0               362.055               36.681                

Filling NAN or null values using SimpleImputer and median.
median is chosen as any missing values will be a numerical variable
"budget" and "revenue" has been replaced by "percetage_revenue"

Removing first "popularity"
Both "popularity" and "popualarity.1" refers to the same thing it's just the data was taken on different days

In [181]:
data6=data5.drop(["popularity"],axis=1)
print(data6.head())

   id  original_title      budget      revenue  release_date  popularity.1  \
0   5      Four Rooms   4000000.0    4257354.0        1995.0        22.784   
1   6  Judgment Night  21000000.0   12136938.0        1993.0        11.530   
2  11       Star Wars  11000000.0  775398007.0        1977.0        86.624   
3  12    Finding Nemo  94000000.0  940335536.0        2003.0        99.340   
4  13    Forrest Gump  55000000.0  677387716.0        1994.0        68.192   

   vote_average  runtime  top_casts_popularity_avg  casts_popularity_sum  \
0         5.744     98.0                 32.033857               362.055   
1         6.543    109.0                 14.834857               124.008   
2         8.207    121.0                 17.367714               297.868   
3         7.824    100.0                 23.006429               467.094   
4         8.481    142.0                 32.837571               526.057   

   top_cast_popularity  top_crews_popularity_avg  crews_popularity_sum  \


In [184]:
impute = SimpleImputer(missing_values=np.nan,strategy="median")
imputer= impute.fit(data6.iloc[:,5:15])
data6.iloc[:,5:15]=imputer.transform(data6.iloc[:,5:15])

Export a cleaned movie data csv

In [187]:
data6.to_csv("CleanedMovieData.csv")