In this notebook, I will walk you through the steps to be taken to clean the IMDB TV series data from kaggle https://www.kaggle.com/datasets/suraj520/imdb-tv-series-data
The dataset contains information about TV series from IMDb, including details such as title, IMDb ID, release year, genre, cast, synopsis, rating, runtime, certificate, number of votes, and gross revenue. The data is scraped from the IMDb website using web scraping techniques and is organized into separate CSV files for each genre.

Importing Required Libraries:
* glob: This library is used to retrieve file paths matching specified patterns.
* pandas: It provides data manipulation and analysis capabilities.
* os: This library provides a way to interact with the operating system, including file and directory operations.
* zipfile: This library offers tools to create, read, write, and extract files from ZIP archives.

In [1]:
import glob
import pandas as pd
import os
from zipfile import ZipFile

The imdb data is a zip file called archive.zip, to find it, we start with an empty list called zip_files to store the paths of the identified ZIP files.
Begin walking through the directory structure using os.walk("/home/anees/projects/data_cleaning_projects").
For each directory, examine the files within it.
If a file has a ".zip" extension, add its path to the zip_files list.
Continue the process until all directories have been traversed.
Finally, print the zip_files list to display the paths of all identified ZIP files.

In [2]:
zip_files = []
for root, dirs, files in os.walk("/home/anees/projects/data_cleaning_projects"):
    for file in files:
        if file.endswith(".zip"):
            zip_files.append(os.path.join(root, file))

print(zip_files)

['/home/anees/projects/data_cleaning_projects/.venv/lib/python3.8/site-packages/importlib_resources/tests/zipdata02/ziptestdata.zip', '/home/anees/projects/data_cleaning_projects/.venv/lib/python3.8/site-packages/importlib_resources/tests/zipdata01/ziptestdata.zip', '/home/anees/projects/data_cleaning_projects/raw_data/imdb/archive.zip']


Then, we need to open the last zip file in the zip_files list(that is the index of "archive.zip" file) in read mode.
Then, we need to extract all the files in "archive.zip" file to the /home/anees/projects/data_cleaning_projects/raw_data/imdb/imdb_files directory.
Next, we use the glob module to find all the CSV files in the "imdb_files" directory.
Finally, we print the names of all the CSV files.

In [3]:
with ZipFile(zip_files[-1], "r") as file:
    file.extractall(path="/home/anees/projects/data_cleaning_projects/raw_data/imdb/imdb_files")

csv_files = glob.glob("/home/anees/projects/data_cleaning_projects/raw_data/imdb/imdb_files/*.csv")
for csv_file in csv_files:
    print(csv_file)

/home/anees/projects/data_cleaning_projects/raw_data/imdb/imdb_files/adventure_series.csv
/home/anees/projects/data_cleaning_projects/raw_data/imdb/imdb_files/drama_series.csv
/home/anees/projects/data_cleaning_projects/raw_data/imdb/imdb_files/fantasy_series.csv
/home/anees/projects/data_cleaning_projects/raw_data/imdb/imdb_files/history_series.csv
/home/anees/projects/data_cleaning_projects/raw_data/imdb/imdb_files/biography_series.csv
/home/anees/projects/data_cleaning_projects/raw_data/imdb/imdb_files/crime_series.csv
/home/anees/projects/data_cleaning_projects/raw_data/imdb/imdb_files/sci-fi_series.csv
/home/anees/projects/data_cleaning_projects/raw_data/imdb/imdb_files/music_series.csv
/home/anees/projects/data_cleaning_projects/raw_data/imdb/imdb_files/western_series.csv
/home/anees/projects/data_cleaning_projects/raw_data/imdb/imdb_files/action_series.csv
/home/anees/projects/data_cleaning_projects/raw_data/imdb/imdb_files/thriller_series.csv
/home/anees/projects/data_cleaning_

To process the extracted CSV files and create a consolidated DataFrame, we can iterate through the list of CSV file paths, read each CSV file using pd.read_csv(), and append the resulting DataFrames to a list called dataframes. Finally, we use pd.concat() to concatenate the DataFrames into a single DataFrame called imdb_df.

In [14]:
dataframes = []
for csv_file in csv_files:
    df = pd.read_csv(csv_file)
    dataframes.append(df)
imdb_df = pd.concat(dataframes, ignore_index=True)

To retrieve information about the structure and summary of the "imdb_df", we can use the info() method.

In [15]:
imdb_df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 236828 entries, 0 to 236827
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   Title            236828 non-null  object 
 1   IMDb ID          236828 non-null  object 
 2   Release Year     236819 non-null  object 
 3   Genre            236828 non-null  object 
 4   Cast             235956 non-null  object 
 5   Synopsis         236828 non-null  object 
 6   Rating           236828 non-null  float64
 7   Runtime          216983 non-null  object 
 8   Certificate      169091 non-null  object 
 9   Number of Votes  236828 non-null  object 
 10  Gross Revenue    45611 non-null   object 
dtypes: float64(1), object(10)
memory usage: 19.9+ MB


The imdb_df.info() command provides us with information about the structure and summary of the dataframe.
The output provides the following details:
* The DataFrame has a RangeIndex with 236,828 entries, ranging from index 0 to index 236,827.
* There are 11 columns in the DataFrame.
* Each column is listed along with its non-null count and data type.
* The DataFrame contains a mix of data types, with 10 columns being of type object and 1 column being of type float64.
* The memory usage of the DataFrame is reported as approximately 19.9+ MB.

To enhance the DataFrame's structure, we can modify the column names to adhere to professional conventions.
We can make the column names consistent by replacing any spaces with underscores (_) and converting them to lowercase. We then use the info() method again to verify the changes have been effected

In [16]:
imdb_df.columns = imdb_df.columns.str.replace(' ', '_').str.lower()
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 236828 entries, 0 to 236827
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   title            236828 non-null  object 
 1   imdb_id          236828 non-null  object 
 2   release_year     236819 non-null  object 
 3   genre            236828 non-null  object 
 4   cast             235956 non-null  object 
 5   synopsis         236828 non-null  object 
 6   rating           236828 non-null  float64
 7   runtime          216983 non-null  object 
 8   certificate      169091 non-null  object 
 9   number_of_votes  236828 non-null  object 
 10  gross_revenue    45611 non-null   object 
dtypes: float64(1), object(10)
memory usage: 19.9+ MB


To preview the data in the DataFrame, we can use the head(), tail(), and sample() methods. By defualt the head() method shows the top 5 rows of the DataFrame, the tail() method shows the bottom 5 rows and sample() method returns a single random row from the DataFrame. These methods can take an argument specifying the number of rows to be returned.
For example, imdb_df.head(10) will display the top 10 rows, imdb_df.tail(3) will display the last 3 rows and imdb_df.sample(6) randomly selects and returns 6 rows from the DataFrame.

In [7]:
imdb_df.head(3)

Unnamed: 0,title,imdb_id,release_year,genre,cast,synopsis,rating,runtime,certificate,number_of_votes,gross_revenue
0,The Little Mermaid,tt5971474,I) (2023,"Adventure, Family, Fantasy","Director:, Rob Marshall, | , Stars:, Halle...",A young mermaid makes a deal with a sea witch ...,7.2,135 min,PG,69638,
1,Spider-Man: Across the Spider-Verse,tt9362722,2023,"Animation, Action, Adventure","Directors:, Joaquim Dos Santos, , Kemp Powers,...","Miles Morales catapults across the Multiverse,...",9.1,140 min,PG,71960,
2,FUBAR,tt13064902,2023–,"Action, Adventure, Thriller","Stars:, Arnold Schwarzenegger, , Monica Barbar...",A C.I.A. operative on the edge of retirement d...,6.5,,TV-MA,15422,


In [8]:
imdb_df.tail(2)

Unnamed: 0,title,imdb_id,release_year,genre,cast,synopsis,rating,runtime,certificate,number_of_votes,gross_revenue
236826,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364584,103046663
236827,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856


In [9]:
imdb_df.sample(5)

Unnamed: 0,title,imdb_id,release_year,genre,cast,synopsis,rating,runtime,certificate,number_of_votes,gross_revenue
121459,Dhaakad,tt10598156,2022,"Action, Thriller","Director:, Razneesh Ghai, | , Stars:, Kang...","Agent Agni, a highly trained and deadly field ...",3.9,135 min,,11843,
40418,Jai Bolo Telangana,tt1831688,2011,History,"Director:, Nimmala Shankar, | , Stars:, Sr...",Varshit is the son of a Telangana freedom figh...,7.0,141 min,,69,
49749,The Horde,tt2331066,2012,"Biography, Drama, History","Director:, Andrey Proshkin, | , Stars:, Ma...","In the 14th century, a Russian bishop is force...",6.2,129 min,Not Rated,1062,
52153,Mi hija Hildegart,tt0076390,1977,"Biography, Drama, History","Director:, Fernando Fernán Gómez, | , Star...",Based in the novel Aurora de Sangre (Eduardo d...,6.5,109 min,,214,
187843,American Ripper,tt7132032,2017,"Documentary, History, Mystery","Stars:, Jeff Mudgett, , Amaryllis Fox, , Conor...",H.H. Holmes was America's first serial killer ...,6.2,43 min,TV-14,716,


One thing we have to decide very early on in a data cleaning process is to identify our index column, an index column provides labels or names for the rows in the DataFrame. It allows us to uniquely identify and access specific rows based on their index values. If we don't specify an index column, pandas assigns a default integer index starting from 0. However, setting a meaningful and appropriate index can enhance data analysis and manipulation.

From the outputs of our examination of the contents of the dataframe earlier, it seems that the "imdb_id" column could potentially serve as an appropriate choice for an index column. We can check this further with the following code chunks.

In [10]:
imdb_df["imdb_id"].sample(10)

34865     tt15441160
129029     tt0027925
10696      tt2015381
227316     tt2396224
41510     tt10955020
213163    tt11145118
92472      tt0045737
24843      tt0200208
64523     tt13114942
125677     tt0065195
Name: imdb_id, dtype: object

The output looks promising. In order to ensure that the column can be used as a unique identifier for each row in the dataframe, it is important to check for any duplicate values in the column. The following code checks for the sum of the count of duplicate values in the column.

In [12]:
imdb_df["imdb_id"].duplicated().sum()

127631

We have identified that there are 127,631 duplicate values in the "imdb_id" column. To determine whether these duplicates represent actual duplicate records or if the same "imdb_id" is associated with different records, further investigation is required. Consider the following code, this code checks if the first few rows of the column are duplicates. If they are, it returns true, otherwise it returns false.

In [13]:
imdb_df["imdb_id"].duplicated().head()

0    False
1    False
2    False
3    False
4    False
Name: imdb_id, dtype: bool

The code returned "False", indicating that no duplicate values were found in the first few rows of the "imdb_id" column. Moving on, we can use the tail() method to examine the last few rows and determine if any "True" values are present, which would indicate the presence of duplicates.

In [None]:
imdb_df["imdb_id"].duplicated().tail()

The tail() method revealed that some records at the end of the DataFrame are duplicated. We use the iloc method to select and inspect the last record in the DataFrame

In [17]:
last_row = imdb_df.iloc[-1]
print(last_row)

title                                                      Evil Dead
imdb_id                                                    tt1288558
release_year                                                    2013
genre                                                         Horror
cast               Director:, Fede Alvarez, | ,     Stars:, Jane ...
synopsis           Five friends head to a remote cabin, where the...
rating                                                           6.5
runtime                                                       91 min
certificate                                                        R
number_of_votes                                               187215
gross_revenue                                             54,239,856
Name: 236827, dtype: object


Using the retuned information, we then check for every occurence of imdb_id "tt1288558" to confirm if they are duplicates

In [18]:
imdb_df[imdb_df["imdb_id"] == "tt1288558"]

Unnamed: 0,title,imdb_id,release_year,genre,cast,synopsis,rating,runtime,certificate,number_of_votes,gross_revenue
225299,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
234759,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
234806,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
234853,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
234900,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
234947,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
234994,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
235041,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
235088,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856
235135,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856


From the result we can confirm that the records are actually duplicates. To be safe, let's repeat the process for a different record.

In [19]:
second_last_row = imdb_df.iloc[-2]
print(second_last_row)

title                                                         Scream
imdb_id                                                    tt0117571
release_year                                                    1996
genre                                                Horror, Mystery
cast               Director:, Wes Craven, | ,     Stars:, Neve Ca...
synopsis           A year after the murder of her mother, a teena...
rating                                                           7.4
runtime                                                      111 min
certificate                                                        R
number_of_votes                                               364584
gross_revenue                                            103,046,663
Name: 236826, dtype: object


In [20]:
imdb_df[imdb_df["imdb_id"] == 'tt0117571']

Unnamed: 0,title,imdb_id,release_year,genre,cast,synopsis,rating,runtime,certificate,number_of_votes,gross_revenue
181618,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364588,103046663
225298,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364584,103046663
234758,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364584,103046663
234805,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364584,103046663
234852,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364584,103046663
234899,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364584,103046663
234946,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364584,103046663
234993,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364584,103046663
235040,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364584,103046663
235087,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364584,103046663


The output is the same. Now we can safely drop any duplicate record in the column with the following code:

In [21]:
imdb_df = imdb_df.drop_duplicates(subset="imdb_id")

We then verify if the duplicates have been dropped.

In [22]:
imdb_df["imdb_id"].duplicated().sum()

0

This time around, we got zero, meaning there are no duplicate value in the imdb_id column. Now we can safely use it as our index column.

In [23]:
imdb_df.set_index("imdb_id", inplace=True)

Moving on, we can continue our cleaning process by dropping any column(s) that is not needed. 

In [24]:
imdb_df.drop("synopsis", axis=1, inplace=True)
imdb_df.head()

Unnamed: 0_level_0,title,release_year,genre,cast,rating,runtime,certificate,number_of_votes,gross_revenue
imdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
tt5971474,The Little Mermaid,I) (2023,"Adventure, Family, Fantasy","Director:, Rob Marshall, | , Stars:, Halle...",7.2,135 min,PG,69638,
tt9362722,Spider-Man: Across the Spider-Verse,2023,"Animation, Action, Adventure","Directors:, Joaquim Dos Santos, , Kemp Powers,...",9.1,140 min,PG,71960,
tt13064902,FUBAR,2023–,"Action, Adventure, Thriller","Stars:, Arnold Schwarzenegger, , Monica Barbar...",6.5,,TV-MA,15422,
tt5433140,Fast X,2023,"Action, Adventure, Crime","Director:, Louis Leterrier, | , Stars:, Vi...",6.3,141 min,PG-13,39326,
tt6791350,Guardians of the Galaxy Vol. 3,2023,"Action, Adventure, Comedy","Director:, James Gunn, | , Stars:, Chris P...",8.2,150 min,PG-13,160447,


By now, we might have noticed that the "release_year" column contain special charcaters. Since we know what type of characters are supposed to be in the column, (digits), we can clean the column by removing all characters that are not digits. The following code accomplishes this.
The code uses the str.replace() method with the regular expression pattern r'[^\d]' to match any character that is not a digit. The regex=True parameter ensures that the replacement is performed using regular expressions. By replacing the matched characters with an empty string, we effectively remove all non-digit characters from the 'release_year' column.

In [25]:
imdb_df['release_year'] = imdb_df['release_year'].str.replace(r'[^\d]', '', regex=True)
imdb_df.head()

Unnamed: 0_level_0,title,release_year,genre,cast,rating,runtime,certificate,number_of_votes,gross_revenue
imdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
tt5971474,The Little Mermaid,2023,"Adventure, Family, Fantasy","Director:, Rob Marshall, | , Stars:, Halle...",7.2,135 min,PG,69638,
tt9362722,Spider-Man: Across the Spider-Verse,2023,"Animation, Action, Adventure","Directors:, Joaquim Dos Santos, , Kemp Powers,...",9.1,140 min,PG,71960,
tt13064902,FUBAR,2023,"Action, Adventure, Thriller","Stars:, Arnold Schwarzenegger, , Monica Barbar...",6.5,,TV-MA,15422,
tt5433140,Fast X,2023,"Action, Adventure, Crime","Director:, Louis Leterrier, | , Stars:, Vi...",6.3,141 min,PG-13,39326,
tt6791350,Guardians of the Galaxy Vol. 3,2023,"Action, Adventure, Comedy","Director:, James Gunn, | , Stars:, Chris P...",8.2,150 min,PG-13,160447,


The resulting 'release_year' column is expected to display cleaned numerical values upon executing the code.
However, it is important to note that the removal of non-digit characters also eliminated the separator (-) that differentiates the two years in some entries. We can confirm this by selecting all records with length of more than four from the column

In [26]:
imdb_df[imdb_df["release_year"].str.len() > 4]

Unnamed: 0_level_0,title,release_year,genre,cast,rating,runtime,certificate,number_of_votes,gross_revenue
imdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
tt0944947,Game of Thrones,20112019,"Action, Adventure, Drama","Stars:, Emilia Clarke, , Peter Dinklage, , Kit...",9.2,57 min,TV-MA,2166804,
tt3107288,The Flash,20142023,"Action, Adventure, Drama","Stars:, Grant Gustin, , Candice Patton, , Dani...",7.5,43 min,TV-PG,357769,
tt6111130,S.W.A.T.,20172023,"Action, Adventure, Crime","Stars:, Shemar Moore, , Alex Russell, , Kenny ...",7.2,43 min,TV-14,27212,
tt2560140,Attack on Titan,20132023,"Animation, Action, Adventure","Stars:, Josh Grelle, , Bryce Papenbrook, , Yûk...",9.1,24 min,TV-MA,413124,
tt0411008,Lost,20042010,"Adventure, Drama, Fantasy","Stars:, Jorge Garcia, , Josh Holloway, , Yunji...",8.3,44 min,TV-14,568788,
...,...,...,...,...,...,...,...,...,...
tt2016058,Ssshhhh... Phir Koi Hai,20062009,Horror,"Stars:, Vicky Ahuja, , Amrapali Gupta, , Shami...",6.8,,,530,
tt6739248,Infected,20162018,"Drama, Horror, Thriller","Stars:, Anthony Lucero, , Eric Martinez, , Bri...",5.0,30 min,TV-MA,31,
tt2089578,We're Alive,20092014,"Drama, Horror","Stars:, Jim Gleason, , Nate Geez, , Elisa Elio...",9.3,,,903,
tt1178757,Creature Features,19701976,Horror,"Stars:, Carl Greyson, , Marty McNeely",9.2,,,79,


To address this problem, we can use regular expressions to insert the hyphen at the desired position. Here's how:
In the following code, we use the str.replace() method with a regular expression pattern (\d{4})(\d{4}) to match the string of eight digits representing two years. The pattern captures the first four digits and the next four digits separately. Then, we replace the match with \1-\2, which inserts a hyphen '-' between the two captured groups.

In [27]:
imdb_df['release_year'] = imdb_df['release_year'].str.replace(r'(\d{4})(\d{4})', r'\1-\2', regex=True)
imdb_df[imdb_df["release_year"].str.len() > 4]

Unnamed: 0_level_0,title,release_year,genre,cast,rating,runtime,certificate,number_of_votes,gross_revenue
imdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
tt0944947,Game of Thrones,2011-2019,"Action, Adventure, Drama","Stars:, Emilia Clarke, , Peter Dinklage, , Kit...",9.2,57 min,TV-MA,2166804,
tt3107288,The Flash,2014-2023,"Action, Adventure, Drama","Stars:, Grant Gustin, , Candice Patton, , Dani...",7.5,43 min,TV-PG,357769,
tt6111130,S.W.A.T.,2017-2023,"Action, Adventure, Crime","Stars:, Shemar Moore, , Alex Russell, , Kenny ...",7.2,43 min,TV-14,27212,
tt2560140,Attack on Titan,2013-2023,"Animation, Action, Adventure","Stars:, Josh Grelle, , Bryce Papenbrook, , Yûk...",9.1,24 min,TV-MA,413124,
tt0411008,Lost,2004-2010,"Adventure, Drama, Fantasy","Stars:, Jorge Garcia, , Josh Holloway, , Yunji...",8.3,44 min,TV-14,568788,
...,...,...,...,...,...,...,...,...,...
tt2016058,Ssshhhh... Phir Koi Hai,2006-2009,Horror,"Stars:, Vicky Ahuja, , Amrapali Gupta, , Shami...",6.8,,,530,
tt6739248,Infected,2016-2018,"Drama, Horror, Thriller","Stars:, Anthony Lucero, , Eric Martinez, , Bri...",5.0,30 min,TV-MA,31,
tt2089578,We're Alive,2009-2014,"Drama, Horror","Stars:, Jim Gleason, , Nate Geez, , Elisa Elio...",9.3,,,903,
tt1178757,Creature Features,1970-1976,Horror,"Stars:, Carl Greyson, , Marty McNeely",9.2,,,79,


Moving on, we can rename the "runtime" column to "runtime(in_minutes)", also remove all appearence of "min" from the column, and change its datatype from object to numeric. This is to provide for more clarity and will allow us to perform calculations on the runtime values.

In [29]:
imdb_df = imdb_df.rename(columns={"runtime":"runtime(in_minutes)"})
imdb_df.head()

Unnamed: 0_level_0,title,release_year,genre,cast,rating,runtime(in_minutes),certificate,number_of_votes,gross_revenue
imdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
tt5971474,The Little Mermaid,2023,"Adventure, Family, Fantasy","Director:, Rob Marshall, | , Stars:, Halle...",7.2,135 min,PG,69638,
tt9362722,Spider-Man: Across the Spider-Verse,2023,"Animation, Action, Adventure","Directors:, Joaquim Dos Santos, , Kemp Powers,...",9.1,140 min,PG,71960,
tt13064902,FUBAR,2023,"Action, Adventure, Thriller","Stars:, Arnold Schwarzenegger, , Monica Barbar...",6.5,,TV-MA,15422,
tt5433140,Fast X,2023,"Action, Adventure, Crime","Director:, Louis Leterrier, | , Stars:, Vi...",6.3,141 min,PG-13,39326,
tt6791350,Guardians of the Galaxy Vol. 3,2023,"Action, Adventure, Comedy","Director:, James Gunn, | , Stars:, Chris P...",8.2,150 min,PG-13,160447,


In [30]:
imdb_df["runtime(in_minutes)"] = imdb_df["runtime(in_minutes)"].str.replace("min", "")
imdb_df.sample(5)

Unnamed: 0_level_0,title,release_year,genre,cast,rating,runtime(in_minutes),certificate,number_of_votes,gross_revenue
imdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
tt2547584,The Light Between Oceans,2016,"Drama, Romance","Director:, Derek Cianfrance, | , Stars:, M...",7.2,133.0,PG-13,58432,12545979.0
tt0098774,The Crystal Maze,1990-2020,"Adventure, Comedy, Family","Stars:, Richard O'Brien, , Richard Ayoade, , E...",8.1,60.0,,1789,
tt15120326,Sidhu Moose Wala: Goat,2021,Music,"Director:, Sukh Sanghera, | , Stars:, Juma...",9.8,4.0,,6,
tt0306065,Shagird,1967,"Comedy, Musical, Romance","Director:, Samir Ganguly, | , Stars:, Joy ...",7.1,,,142,
tt0904208,Californication,2007-2014,"Comedy, Drama","Stars:, David Duchovny, , Natascha McElhone, ,...",8.3,28.0,TV-MA,184539,


In [31]:
imdb_df["runtime(in_minutes)"] = imdb_df["runtime(in_minutes)"].astype("int64")
imdb_df["runtime(in_minutes)"].dtype

ValueError: cannot convert float NaN to integer

The error message "ValueError: cannot convert float NaN to integer" occurred because we tried tconvert a NaN value to an integer. Since NaN represents missing or undefined values, it cannot be directly converted to an integer.
To handle this situation, we can first handle the NaN values before converting the column to integers. One approach is to use the fillna() method to replace the NaN values with a suitable desired value. Another method is to use the dropna() method to remove the NaN values.

In [32]:
imdb_df.dropna(subset=["runtime(in_minutes)"], inplace=True)

Let's try the conversion again.

In [33]:
imdb_df["runtime(in_minutes)"] = imdb_df["runtime(in_minutes)"].astype("int64")
imdb_df["runtime(in_minutes)"].dtype

ValueError: invalid literal for int() with base 10: '1,256 '

Another error message. This time around because we are trying to convert the string '1,256' to an integer.
The issue is that the string contains a comma and a space, which are not valid characters in an integer representation. To resolve this, we can remove the comma and any leading or trailing whitespace from the column before converting its datatype to an integer.

In [34]:
imdb_df["runtime(in_minutes)"] = imdb_df["runtime(in_minutes)"].str.replace(r'\D', '', regex=True)

In [35]:
imdb_df["runtime(in_minutes)"] = imdb_df["runtime(in_minutes)"].astype("Int64")
imdb_df["runtime(in_minutes)"].dtype

Int64Dtype()

The conversion was done successfully this time around. Talking of datatypes, to know the datatypes of all our columns. we can use the dtypes attribute of the dataframe

In [36]:
imdb_df.dtypes

title                   object
release_year            object
genre                   object
cast                    object
rating                 float64
runtime(in_minutes)      Int64
certificate             object
number_of_votes         object
gross_revenue           object
dtype: object

Most columns are currently of object dtype, which is roughly analogous to str in native Python. It encapsulates any field that can’t be neatly fit as numerical or categorical data. We will like to have all columns with numeric data(number_of_votes and gross_revenue) to appropriate data type, this is to allow for proper manipulation and handling.

In [37]:
imdb_df["number_of_votes"] = imdb_df["number_of_votes"].astype("Int64")

In [38]:
imdb_df["gross_revenue"] = imdb_df["gross_revenue"].astype("Int64")

ValueError: invalid literal for int() with base 10: '190,241,310'

The conversion of the "number_of_votes" column was successful, but there seems to be an error with the "gross_revenue" column. We can further investigate the content of the column with head(), tail(), and sample() methods.

In [39]:
imdb_df["gross_revenue"].head()

imdb_id
tt5971474    NaN
tt9362722    NaN
tt5433140    NaN
tt6791350    NaN
tt2906216    NaN
Name: gross_revenue, dtype: object

In [40]:
imdb_df["gross_revenue"].tail()

imdb_id
tt0070445     NaN
tt1971558     NaN
tt13964002    NaN
tt5532370     NaN
tt15485388    NaN
Name: gross_revenue, dtype: object

In [41]:
imdb_df["gross_revenue"].sample(10)

imdb_id
tt8755868            NaN
tt18115352           NaN
tt11995882           NaN
tt0086308            NaN
tt5478478     29,819,114
tt0928413         25,526
tt5582392            NaN
tt0021377            NaN
tt0321059            NaN
tt0168052            NaN
Name: gross_revenue, dtype: object

We observed a considerable number of NaN values in the "gross_revenue" column. To assess the extent of missing data in the entire dataframe, we can calculate the percentage of NaN values for each column

In [42]:
imdb_df.isnull().mean() * 100

title                   0.000000
release_year            0.003064
genre                   0.000000
cast                    0.264507
rating                  0.000000
runtime(in_minutes)     0.000000
certificate            42.084193
number_of_votes         0.000000
gross_revenue          86.468269
dtype: float64

The "gross_revenue" column contains a substantial 86 percent of NaN values, Therefore, the best course of action is to completely drop the column from the dataset. This will ensure that the presence of missing data does not negatively impact the analysis.

In [43]:
imdb_df.drop("gross_revenue", axis=1, inplace=True)

While the "certificate" column also contains a notable number of NaN values, its percentage is not as high as the "gross_revenue" column. Instead of dropping the column entirely, we can employ the forward fill technique to handle the missing values. By using this method, the NaN values in the "certificate" column will be filled with the most recent valid value available in the column.

In [44]:
imdb_df["certificate"].fillna(method="ffill", inplace=True)


In [45]:
imdb_df.isnull().mean() * 100

title                  0.000000
release_year           0.003064
genre                  0.000000
cast                   0.264507
rating                 0.000000
runtime(in_minutes)    0.000000
certificate            0.000000
number_of_votes        0.000000
dtype: float64

With the dataframe now cleaned, missing and duplicate values handled, we are ready to save it as a CSV file for future use or analysis.

In [46]:
imdb_df.to_csv("/home/anees/projects/data_cleaning_projects/cleaned_data/imdb.csv", index=True)