In this notebook, I will walk you through the steps to be taken to clean and carry out exploratory data analysis on the IMDB TV series data from kaggle https://www.kaggle.com/datasets/suraj520/imdb-tv-series-data
The dataset contains information about TV series from IMDb, including details such as title, IMDb ID, release year, genre, cast, synopsis, rating, runtime, certificate, number of votes, and gross revenue. The data is scraped from the IMDb website using web scraping techniques and is organized into separate CSV files for each genre.

Importing Required Libraries:
* glob: This library is used to retrieve file paths matching specified patterns.
* pandas: It provides data manipulation and analysis capabilities.
* os: This library provides a way to interact with the operating system, including file and directory operations.
* zipfile: This library offers tools to create, read, write, and extract files from ZIP archives.

In [2]:
import glob
import pandas as pd
import os
from zipfile import ZipFile

The imdb data is a zip file called archive.zip, to find it, you start with an empty list called zip_files to store the paths of the identified ZIP files.
Begin walking through the directory structure using os.walk("/home/anees/projects/EDA_and_data_cleaning").
For each directory, examine the files within it.
If a file has a ".zip" extension, add its path to the zip_files list.
Continue the process until all directories have been traversed.
Finally, print the zip_files list to display the paths of all identified ZIP files.

In [3]:
zip_files = []
for root, dirs, files in os.walk("/home/anees/projects/EDA_and_data_cleaning"):
    for file in files:
        if file.endswith(".zip"):
            zip_files.append(os.path.join(root, file))

print(zip_files)

['/home/anees/projects/EDA_and_data_cleaning/venv/lib/python3.8/site-packages/importlib_resources/tests/zipdata02/ziptestdata.zip', '/home/anees/projects/EDA_and_data_cleaning/venv/lib/python3.8/site-packages/importlib_resources/tests/zipdata01/ziptestdata.zip', '/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/archive.zip']


After obtaining the list of ZIP file paths, you will notice that the "archive.zip" file is located at the last index. Utilizing the zipfile library, you can extract the contents of this ZIP file to a folder named "imdb_files" within the "raw_data/imdb" directory. Once extracted, you can verify the extraction process by printing the paths of the extracted CSV files.

In [4]:
with ZipFile(zip_files[-1], "r") as file:
    file.extractall(path="/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files")

csv_files = glob.glob("/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/*.csv")
for csv_file in csv_files:
    print(csv_file)

/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/adventure_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/drama_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/fantasy_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/history_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/biography_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/crime_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/sci-fi_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/music_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/western_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/action_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/thriller_series.csv
/home/anees/projects/EDA_and_data_cleaning/raw

To process the extracted CSV files and create a consolidated DataFrame, you can iterate through the list of CSV file paths, read each CSV file using pd.read_csv(), and append the resulting DataFrames to a list called dataframes. Finally, you can use pd.concat() to concatenate the DataFrames into a single DataFrame called imdb_df.
By executing this code, you will obtain the consolidated DataFrame imdb_df, which contains the data from all the extracted CSV files.

In [5]:
dataframes = []
for csv_file in csv_files:
    df = pd.read_csv(csv_file)
    dataframes.append(df)
imdb_df = pd.concat(dataframes, ignore_index=True)

To retrieve information about the structure and summary of the consolidated DataFrame imdb_df, you can use the info() method.
Here's the code:

In [6]:
imdb_df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 236828 entries, 0 to 236827
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   Title            236828 non-null  object 
 1   IMDb ID          236828 non-null  object 
 2   Release Year     236819 non-null  object 
 3   Genre            236828 non-null  object 
 4   Cast             235956 non-null  object 
 5   Synopsis         236828 non-null  object 
 6   Rating           236828 non-null  float64
 7   Runtime          216983 non-null  object 
 8   Certificate      169091 non-null  object 
 9   Number of Votes  236828 non-null  object 
 10  Gross Revenue    45611 non-null   object 
dtypes: float64(1), object(10)
memory usage: 19.9+ MB


The imdb_df.info() command provides you with information about the structure and summary of the consolidated DataFrame imdb_df. 
The output provides the following details:
* The DataFrame has a RangeIndex with 236,828 entries, ranging from index 0 to index 236,827.
* There are 11 columns in the DataFrame.
* Each column is listed along with its non-null count and data type.
* The DataFrame contains a mix of data types, with 10 columns being of type object and 1 column being of type float64.
* The memory usage of the DataFrame is reported as approximately 19.9+ MB.
This information helps you understand the composition and structure of the DataFrame, including the number of entries, data types, and missing values for each column.

To enhance the DataFrame's structure, you can modify the index and column names to adhere to professional conventions.

To update the index, you can set it to start at 1 instead of 0 by adding 1 to the existing index values. Additionally, you can make the column names consistent by replacing any spaces with underscores (_) and converting them to lowercase. You can then use the info() method again to verify the changes have been effected

Consider the following code:

In [7]:
imdb_df.index = imdb_df.index + 1
imdb_df.columns = imdb_df.columns.str.replace(' ', '_').str.lower()
imdb_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 236828 entries, 1 to 236828
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   title            236828 non-null  object 
 1   imdb_id          236828 non-null  object 
 2   release_year     236819 non-null  object 
 3   genre            236828 non-null  object 
 4   cast             235956 non-null  object 
 5   synopsis         236828 non-null  object 
 6   rating           236828 non-null  float64
 7   runtime          216983 non-null  object 
 8   certificate      169091 non-null  object 
 9   number_of_votes  236828 non-null  object 
 10  gross_revenue    45611 non-null   object 
dtypes: float64(1), object(10)
memory usage: 19.9+ MB


To preview the data in the DataFrame, you can use the head() and tail() methods. By defualt the head() method shows the top 5 rows of the DataFrame, while the tail() method shows the bottom 5 rows. Both methods can take an argument specifying the number of rows to preview.
For example, imdb_df.head(10) will display the top 10 rows, while imdb_df.tail(3) will display the last 3 rows of the DataFrame.

By utilizing these methods, you can preview the data in the DataFrame and get a sense of its contents and structure.

In [8]:
imdb_df.head()

Unnamed: 0,title,imdb_id,release_year,genre,cast,synopsis,rating,runtime,certificate,number_of_votes,gross_revenue
1,The Little Mermaid,tt5971474,I) (2023,"Adventure, Family, Fantasy","Director:, Rob Marshall, | , Stars:, Halle...",A young mermaid makes a deal with a sea witch ...,7.2,135 min,PG,69638,
2,Spider-Man: Across the Spider-Verse,tt9362722,2023,"Animation, Action, Adventure","Directors:, Joaquim Dos Santos, , Kemp Powers,...","Miles Morales catapults across the Multiverse,...",9.1,140 min,PG,71960,
3,FUBAR,tt13064902,2023–,"Action, Adventure, Thriller","Stars:, Arnold Schwarzenegger, , Monica Barbar...",A C.I.A. operative on the edge of retirement d...,6.5,,TV-MA,15422,
4,Fast X,tt5433140,2023,"Action, Adventure, Crime","Director:, Louis Leterrier, | , Stars:, Vi...",Dom Toretto and his family are targeted by the...,6.3,141 min,PG-13,39326,
5,Guardians of the Galaxy Vol. 3,tt6791350,2023,"Action, Adventure, Comedy","Director:, James Gunn, | , Stars:, Chris P...","Still reeling from the loss of Gamora, Peter Q...",8.2,150 min,PG-13,160447,


In [9]:
imdb_df.tail()

Unnamed: 0,title,imdb_id,release_year,genre,cast,synopsis,rating,runtime,certificate,number_of_votes,gross_revenue
236824,Hannibal,tt2243973,2013–2015,"Crime, Drama, Horror","Stars:, Hugh Dancy, , Mads Mikkelsen, , Caroli...",Explores the early relationship between renown...,8.5,44 min,TV-MA,262494,
236825,Alien,tt0078748,1979,"Horror, Sci-Fi","Director:, Ridley Scott, | , Stars:, Sigou...",The crew of a commercial spacecraft encounter ...,8.5,117 min,R,904145,78900000.0
236826,Tusk,tt3099498,I) (2014,"Comedy, Horror","Director:, Kevin Smith, | , Stars:, Justin...",A brash and arrogant podcaster gets more than ...,5.3,102 min,R,58435,1821983.0
236827,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364584,103046663.0
236828,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856.0


From the result, we can spot our first issue, the release_year column should contain only four digit, signifying a year but as you can see, we have cases of special charcaters (braces and dash), and a case of the roman numeral "I" making an appearence 
To gain a comprehensive understanding of the cleanliness of the "release_year" column, you can identify several issues by examining its unique values.
To obtain a clear picture of the unique values in the column and address these concerns, we can assign the unique values to a variable and print them. Here's the code to accomplish this:

In [10]:
unique_values = imdb_df['release_year'].unique()
print(unique_values)
# optionally, you can loop through the list object and print out each unique value
#for value in unique_values:
    #print(value)

['I) (2023' '2023' '2023– ' ... '2012 Podcast Series' 'II) (2018 TV Movie'
 'I) (1970–1976']


When addressing the task of cleaning the release_year column, you have two viable approaches. The first option entails directly cleaning the column itself, while the second option involves cleaning the unique_values and subsequently replacing the release_year column with the cleaned list.

The choice between these options largely depends on the specific requirements and constraints of the data analysis task at hand. Cleaning the column directly allows for immediate modifications within the DataFrame, which can be advantageous when there is a need to retain the original structure and integrity of the dataset. Conversely, opting to clean the unique_values list independently can offer the advantage of decoupling the cleaning process from the DataFrame, facilitating analysis and transformations on a reduced and sanitized dataset.

We will be going with the first approach, you can opt for the second one if you like.

You can clean the 'release_year' column by removing all characters that are not digits. The following code accomplishes this
The code uses the str.replace() method with the regular expression pattern r'[^\d]' to match any character that is not a digit. The regex=True parameter ensures that the replacement is performed using regular expressions. By replacing the matched characters with an empty string, you effectively remove all non-digit characters from the 'release_year' column.

After executing the code, the 'release_year' column will only contain the cleaned numerical values.

In [11]:
imdb_df['release_year'] = imdb_df['release_year'].str.replace(r'[^\d]', '', regex=True)
imdb_df.head()

Unnamed: 0,title,imdb_id,release_year,genre,cast,synopsis,rating,runtime,certificate,number_of_votes,gross_revenue
1,The Little Mermaid,tt5971474,2023,"Adventure, Family, Fantasy","Director:, Rob Marshall, | , Stars:, Halle...",A young mermaid makes a deal with a sea witch ...,7.2,135 min,PG,69638,
2,Spider-Man: Across the Spider-Verse,tt9362722,2023,"Animation, Action, Adventure","Directors:, Joaquim Dos Santos, , Kemp Powers,...","Miles Morales catapults across the Multiverse,...",9.1,140 min,PG,71960,
3,FUBAR,tt13064902,2023,"Action, Adventure, Thriller","Stars:, Arnold Schwarzenegger, , Monica Barbar...",A C.I.A. operative on the edge of retirement d...,6.5,,TV-MA,15422,
4,Fast X,tt5433140,2023,"Action, Adventure, Crime","Director:, Louis Leterrier, | , Stars:, Vi...",Dom Toretto and his family are targeted by the...,6.3,141 min,PG-13,39326,
5,Guardians of the Galaxy Vol. 3,tt6791350,2023,"Action, Adventure, Comedy","Director:, James Gunn, | , Stars:, Chris P...","Still reeling from the loss of Gamora, Peter Q...",8.2,150 min,PG-13,160447,


The resulting 'release_year' column is expected to display cleaned numerical values upon executing the code.
However, it is important to note that the 'release_year' column may still exhibit an issue as identified when inspecting the tail of the DataFrame. The removal of non-digit characters inadvertently eliminated the separator (-) that differentiates the two years in some entries.

In [12]:
imdb_df.tail()

Unnamed: 0,title,imdb_id,release_year,genre,cast,synopsis,rating,runtime,certificate,number_of_votes,gross_revenue
236824,Hannibal,tt2243973,20132015,"Crime, Drama, Horror","Stars:, Hugh Dancy, , Mads Mikkelsen, , Caroli...",Explores the early relationship between renown...,8.5,44 min,TV-MA,262494,
236825,Alien,tt0078748,1979,"Horror, Sci-Fi","Director:, Ridley Scott, | , Stars:, Sigou...",The crew of a commercial spacecraft encounter ...,8.5,117 min,R,904145,78900000.0
236826,Tusk,tt3099498,2014,"Comedy, Horror","Director:, Kevin Smith, | , Stars:, Justin...",A brash and arrogant podcaster gets more than ...,5.3,102 min,R,58435,1821983.0
236827,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364584,103046663.0
236828,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856.0


To address this problem, you can use regular expressions to insert the hyphen at the desired position. Here's an example:
In the following code, we use the str.replace() method with a regular expression pattern (\d{4})(\d{4}) to match the string of eight digits representing two years. The pattern captures the first four digits and the next four digits separately. Then, we replace the match with \1-\2, which inserts a hyphen '-' between the two captured groups.

In [18]:
imdb_df['release_year'] = imdb_df['release_year'].str.replace(r'(\d{4})(\d{4})', r'\1-\2', regex=True)
imdb_df.tail()

Unnamed: 0,title,imdb_id,release_year,genre,cast,synopsis,rating,runtime,certificate,number_of_votes,gross_revenue
236824,Hannibal,tt2243973,2013-2015,"Crime, Drama, Horror","Stars:, Hugh Dancy, , Mads Mikkelsen, , Caroli...",Explores the early relationship between renown...,8.5,44 min,TV-MA,262494,
236825,Alien,tt0078748,1979,"Horror, Sci-Fi","Director:, Ridley Scott, | , Stars:, Sigou...",The crew of a commercial spacecraft encounter ...,8.5,117 min,R,904145,78900000.0
236826,Tusk,tt3099498,2014,"Comedy, Horror","Director:, Kevin Smith, | , Stars:, Justin...",A brash and arrogant podcaster gets more than ...,5.3,102 min,R,58435,1821983.0
236827,Scream,tt0117571,1996,"Horror, Mystery","Director:, Wes Craven, | , Stars:, Neve Ca...","A year after the murder of her mother, a teena...",7.4,111 min,R,364584,103046663.0
236828,Evil Dead,tt1288558,2013,Horror,"Director:, Fede Alvarez, | , Stars:, Jane ...","Five friends head to a remote cabin, where the...",6.5,91 min,R,187215,54239856.0
