In this notebook, I will walk you through the steps to be taken to clean and carry out exploratory data analysis on the IMDB TV series data from kaggle https://www.kaggle.com/datasets/suraj520/imdb-tv-series-data
The dataset contains information about TV series from IMDb, including details such as title, IMDb ID, release year, genre, cast, synopsis, rating, runtime, certificate, number of votes, and gross revenue. The data is scraped from the IMDb website using web scraping techniques and is organized into separate CSV files for each genre.

Importing Required Libraries:
* glob: This library is used to retrieve file paths matching specified patterns.
* pandas: It provides data manipulation and analysis capabilities.
* os: This library provides a way to interact with the operating system, including file and directory operations.
* zipfile: This library offers tools to create, read, write, and extract files from ZIP archives.

In [None]:
import glob
import pandas as pd
import os
from zipfile import ZipFile

The imdb data is a zip file called archive.zip, to find it, you start with an empty list called zip_files to store the paths of the identified ZIP files.
Begin walking through the directory structure using os.walk("/home/anees/projects/EDA_and_data_cleaning").
For each directory, examine the files within it.
If a file has a ".zip" extension, add its path to the zip_files list.
Continue the process until all directories have been traversed.
Finally, print the zip_files list to display the paths of all identified ZIP files.

In [None]:
zip_files = []
for root, dirs, files in os.walk("/home/anees/projects/EDA_and_data_cleaning"):
    for file in files:
        if file.endswith(".zip"):
            zip_files.append(os.path.join(root, file))

print(zip_files)

After obtaining the list of ZIP file paths, you will notice that the "archive.zip" file is located at the last index. Utilizing the zipfile library, you can extract the contents of this ZIP file to a folder named "imdb_files" within the "raw_data/imdb" directory. Once extracted, you can verify the extraction process by printing the paths of the extracted CSV files.

In [None]:
with ZipFile(zip_files[-1], "r") as file:
    file.extractall(path="/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files")

csv_files = glob.glob("/home/anees/projects/EDA_and_data_cleaning/raw_data/imdb/imdb_files/*.csv")
for csv_file in csv_files:
    print(csv_file)

To process the extracted CSV files and create a consolidated DataFrame, you can iterate through the list of CSV file paths, read each CSV file using pd.read_csv(), and append the resulting DataFrames to a list called dataframes. Finally, you can use pd.concat() to concatenate the DataFrames into a single DataFrame called imdb_df.
By executing this code, you will obtain the consolidated DataFrame imdb_df, which contains the data from all the extracted CSV files.

In [None]:
dataframes = []
for csv_file in csv_files:
    df = pd.read_csv(csv_file)
    dataframes.append(df)
imdb_df = pd.concat(dataframes, ignore_index=True)

To retrieve information about the structure and summary of the consolidated DataFrame imdb_df, you can use the info() method.
Here's the code:

In [None]:
imdb_df.info() 

the imdb_df.info() command provides you with information about the structure and summary of the consolidated DataFrame imdb_df. 
The output provides the following details:
* The DataFrame has a RangeIndex with 236,828 entries, ranging from index 0 to index 236,827.
* There are 11 columns in the DataFrame.
* Each column is listed along with its non-null count and data type.
* The DataFrame contains a mix of data types, with 10 columns being of type object and 1 column being of type float64.
* The memory usage of the DataFrame is reported as approximately 19.9+ MB.
This information helps you understand the composition and structure of the DataFrame, including the number of entries, data types, and missing values for each column.

To enhance the DataFrame's structure, you can modify the index and column names to adhere to professional conventions.

To update the index, you can set it to start at 1 instead of 0 by adding 1 to the existing index values. Additionally, you can make the column names consistent by replacing any spaces with underscores (_) and converting them to lowercase. You can then use the info() method again to verify the changes have been effected

Consider the following code:

In [None]:
imdb_df.index = imdb_df.index + 1
imdb_df.columns = imdb_df.columns.str.replace(' ', '_').str.lower()
imdb_df.info()

To preview the data in the DataFrame, you can use the head() and tail() methods. By defualt the head() method shows the top 5 rows of the DataFrame, while the tail() method shows the bottom 5 rows. Both methods can take an argument specifying the number of rows to preview.
For example, imdb_df.head(10) will display the top 10 rows, while imdb_df.tail(3) will display the last 3 rows of the DataFrame.

By utilizing these methods, you can preview the data in the DataFrame and get a sense of its contents and structure.

In [None]:
imdb_df.head()

In [None]:
imdb_df.tail()

from the result, we can spot our first issue, the release_year column should contain only four digit, signifying a year but as you can see, we have cases of special charcaters (braces and dash), a case of the roman numeral "I" making an appearence, and some rows containing year ranges instead of single years. 
To gain a comprehensive understanding of the cleanliness of the "release_year" column, we can identify several issues by examining its unique values.
To obtain a clear picture of the unique values in the column and address these concerns, we can assign the unique values to a variable and print them. Here's the code to accomplish this:

In [49]:
unique_values = imdb_df['release_year'].unique()
print(unique_values)

['I ' '' '– ' ... '2012 Podcast Series' 'II 2018 TV Movie' 'I 1970–1976']


In [38]:
imdb_df['release_year'] = imdb_df['release_year'].str.replace(r'\(|\)', '')
imdb_df['release_year'] = imdb_df['release_year'].str.replace(r'[-–—]', '')

imdb_df.head()

Unnamed: 0,title,imdb_id,release_year,genre,cast,synopsis,rating,runtime,certificate,number_of_votes,gross_revenue
2,The Little Mermaid,tt5971474,I 2023,"Adventure, Family, Fantasy","Director:, Rob Marshall, | , Stars:, Halle...",A young mermaid makes a deal with a sea witch ...,7.2,135 min,PG,69638,
3,Spider-Man: Across the Spider-Verse,tt9362722,2023,"Animation, Action, Adventure","Directors:, Joaquim Dos Santos, , Kemp Powers,...","Miles Morales catapults across the Multiverse,...",9.1,140 min,PG,71960,
4,FUBAR,tt13064902,2023–,"Action, Adventure, Thriller","Stars:, Arnold Schwarzenegger, , Monica Barbar...",A C.I.A. operative on the edge of retirement d...,6.5,,TV-MA,15422,
5,Fast X,tt5433140,2023,"Action, Adventure, Crime","Director:, Louis Leterrier, | , Stars:, Vi...",Dom Toretto and his family are targeted by the...,6.3,141 min,PG-13,39326,
6,Guardians of the Galaxy Vol. 3,tt6791350,2023,"Action, Adventure, Comedy","Director:, James Gunn, | , Stars:, Chris P...","Still reeling from the loss of Gamora, Peter Q...",8.2,150 min,PG-13,160447,


In [46]:
import re

imdb_df['release_year'] = imdb_df['release_year'].str.replace(r'[^0-9]', '')
imdb_df['release_year'] = imdb_df['release_year'].str.replace('2023', '')


In [None]:
# change columns to numeric datatypes 
imdb_df[["release_year", "number_of_votes", "gross_revenue"]] = \
imdb_df[["release_year", "number_of_votes", "gross_revenue"]].apply(pd.to_numeric)

# verify the changes has been effected
imdb.dtypes

got a value error: "ValueError: Unable to parse string "I) (2023" at position 0"
lets remove all special characters from the release_year column 

In [None]:
imdb_df["release_year"] = imdb_df["release_year"].apply(lambda x: x[1:] if x.startswith('I') else x)

In [None]:
# change columns to numeric datatypes 
imdb_df[["release_year", "number_of_votes", "gross_revenue"]] = \
imdb_df[["release_year", "number_of_votes", "gross_revenue"]].apply(pd.to_numeric)

# verify the changes has been effected
imdb.dtypes

In [None]:
# check for duplicate in the title column as each title must be unique
imdb_df["synopsis"].duplicated().sum()


In [None]:
imdb_df["title"].duplicated().sum()