# Data Cleaning 
The following demonstrates how we processed our data to make it analysis ready.

## Our python imports

In [1]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import duckdb

Here we import our title.basics.tsv file in its raw form.

In [2]:

movies = pd.read_csv('IMBDData/title.basics.tsv', delimiter="\t", low_memory=False)

Here we look at  the first 10 rows of our dataset to decide how we are going to clean it.

In [3]:
print(movies.shape)
movies.head(10)

(11288257, 9)


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Poor Pierrot,Pauvre Pierrot,0,1892,\N,5,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"Short,Sport"
7,tt0000008,short,Edison Kinetoscopic Record of a Sneeze,Edison Kinetoscopic Record of a Sneeze,0,1894,\N,1,"Documentary,Short"
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance
9,tt0000010,short,Leaving the Factory,La sortie de l'usine Lumière à Lyon,0,1895,\N,1,"Documentary,Short"


In the following cell we filter out many of the things that are irrilevant to our research question and clean our data by removing empty values marked as "\N" in the original dataset. Since our focus is on movies, we also filter out data related to short films or tv series or other types that the original dataset had. We also limit our analysis to movies that were released in the 21st century so that it is more relevant to our generation. 

Since our dataset is huge we had to do multiple filtering steps and cleaning to get a clean, nice, and relatively smaller dataset that we can easily run our analysis on for later parts of this project. We save the filtered out dataset in a csv file that we titled "cleaned_data" and we will refer to it later when we do more analysis and when we do combined analysis with the movie ratings dataset.


In [4]:
# Remove columns we don't need for our research question:
columns_to_remove = ['isAdult', 'endYear']
movies = movies.drop(columns=columns_to_remove)

# Filter out rows where startYear is '\N'
df = movies[movies['startYear'] != '\\N'].copy()

# Convert startYear from string to int
df.loc[:, 'startYear'] = df['startYear'].astype(int)

# Filter out shows
df = df[df['titleType'] == "movie"]

# Filter out years before 2000
df = df[df['startYear'] >= 2000]

# Filter out years after 2024
df = df[df['startYear'] <= 2024]

cleaned_df = df


Here, we display a summary of our cleaned data. We see that the size of our newly cleaned data is much smaller than the original one with 338244 entries. 

In [5]:
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 339600 entries, 11632 to 11288207
Data columns (total 7 columns):
 #   Column          Non-Null Count   Dtype 
---  ------          --------------   ----- 
 0   tconst          339600 non-null  object
 1   titleType       339600 non-null  object
 2   primaryTitle    339598 non-null  object
 3   originalTitle   339598 non-null  object
 4   startYear       339600 non-null  object
 5   runtimeMinutes  339600 non-null  object
 6   genres          339600 non-null  object
dtypes: object(7)
memory usage: 20.7+ MB


Here, we display the first 10 entries of our new cleaned dataset to visualize it and make sure we did our cleaning right:

In [6]:
cleaned_df.head(10)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,startYear,runtimeMinutes,genres
11632,tt0011801,movie,Tötet nicht mehr,Tötet nicht mehr,2019,\N,"Action,Crime"
15172,tt0015414,movie,La tierra de los toros,La tierra de los toros,2000,60,\N
34795,tt0035423,movie,Kate & Leopold,Kate & Leopold,2001,118,"Comedy,Fantasy,Romance"
61105,tt0062336,movie,The Tango of the Widower and Its Distorting Mi...,El tango del viudo y su espejo deformante,2020,70,Drama
66398,tt0067758,movie,"Simón, contamos contigo","Simón, contamos contigo",2015,81,"Comedy,Drama"
67657,tt0069049,movie,The Other Side of the Wind,The Other Side of the Wind,2018,122,Drama
69144,tt0070596,movie,Socialist Realism,El realismo socialista,2023,78,Drama
76044,tt0077684,movie,Histórias de Combóios em Portugal,Histórias de Combóios em Portugal,2022,46,Documentary
80540,tt0082328,movie,Embodiment of Evil,Encarnação do Demônio,2008,94,Horror
86782,tt0088751,movie,The Naked Monster,The Naked Monster,2005,100,"Comedy,Horror,Sci-Fi"


Here we import another file which have the ratings and the number of votes for each movie.

In [7]:
ratings_df = pd.read_csv("IMBDData/title.ratings.tsv", delimiter = "\t")
print(ratings_df.shape)

(1508183, 3)


Here we combined both tables using SQL to do further analysis that depends on the ratings of each movie as well as the details in previous cleaned_movies_df.

In [8]:

joined_df = duckdb.sql("SELECT * FROM cleaned_df LEFT JOIN ratings_df ON cleaned_df.tconst = ratings_df.tconst").to_df()
joined_df.head(10)
print(joined_df.shape)

(339600, 10)


Here, we save our cleaned data into csv files: 

In [9]:
joined_df.to_csv('joined_data.csv', index=False)