# Data Cleaning 
The following demonstrates how we processed our data to make it analysis ready.

## Our python imports

In [1]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import duckdb

Here we import our title.basics.tsv file in its raw form.

In [3]:

movies = pd.read_csv('IMBDData/title.basics.tsv', delimiter="\t", low_memory=False)

FileNotFoundError: [Errno 2] No such file or directory: 'IMBDData/title.basics.tsv'

Here we look at  the first 10 rows of our dataset to decide how we are going to clean it.

In [None]:
print(movies.shape)
movies.head(10)

In the following cell we filter out many of the things that are irrilevant to our research question and clean our data by removing empty values marked as "\N" in the original dataset. Since our focus is on movies, we also filter out data related to short films or tv series or other types that the original dataset had. We also limit our analysis to movies that were released in the 21st century so that it is more relevant to our generation. 

Since our dataset is huge we had to do multiple filtering steps and cleaning to get a clean, nice, and relatively smaller dataset that we can easily run our analysis on for later parts of this project. We save the filtered out dataset in a csv file that we titled "cleaned_data" and we will refer to it later when we do more analysis and when we do combined analysis with the movie ratings dataset.


In [None]:
# Remove columns we don't need for our research question:
columns_to_remove = ['isAdult', 'endYear']
movies = movies.drop(columns=columns_to_remove)

# Filter out rows where startYear is '\N'
df = movies[movies['startYear'] != '\\N'].copy()

# Convert startYear from string to int
df.loc[:, 'startYear'] = df['startYear'].astype(int)

# Filter out shows
df = df[df['titleType'] == "movie"]

# Filter out years before 2000
df = df[df['startYear'] >= 2000]

# Filter out years after 2024
df = df[df['startYear'] <= 2024]

cleaned_df = df


Here we save our cleaned data to a new DataFrame that we will use in the following steps.

In [None]:
cleaned_movies_df = pd.read_csv('cleaned_data.csv')

Here, we display a summary of our cleaned data. We see that the size of our newly cleaned data is much smaller than the original one with 338244 entries. 

In [None]:
cleaned_movies_df.info()

Here, we display the first 10 entries of our new cleaned dataset to visualize it and make sure we did our cleaning right:

In [None]:
cleaned_movies_df.head(10)

Here we import another file which have the ratings and the number of votes for each movie.

In [None]:
ratings_df = pd.read_csv("IMBDData/title.ratings.tsv", delimiter = "\t")

Here we combined both tables using SQL to do further analysis that depends on the ratings of each movie as well as the details in previous cleaned_movies_df.

In [None]:

joined_df = duckdb.sql("SELECT * FROM cleaned_movies_df LEFT JOIN ratings_df ON cleaned_movies_df.tconst = ratings_df.tconst").to_df()
joined_df.head(10)

Here, we save our cleaned data into csv files: 

In [None]:
df.to_csv('cleaned_data.csv', index=False)
joined_df.to_csv('joined_data.csv', index=False)