# The Python code imports the **pandas** library
reads a CSV file named "netflix_titles.csv" into a DataFrame. Essentially, it loads the Netflix Data into a DataFrame.

In [2]:
import pandas as pd
df = pd.read_csv("netflix_titles.csv")

# 1.Identify Missing Values
This step checks for null or blank values in each column. The output shows how many missing entries exist per column.

In [17]:
# Check for missing values in each column
print(df.isnull().sum())

show_id          0
type             0
title            0
director         0
cast             0
country          0
date_added      10
release_year     0
rating           4
duration         3
listed_in        0
description      0
dtype: int64


# 2. Handle Missing Values

This step replaces empty values with fixed labels like "Unknown" or "Not Available." This ensures no column remains with null data.

In [18]:
df['director'] = df['director'].fillna("Unknown")
df['cast'] = df['cast'].fillna("Not Available")
df['country'] = df['country'].fillna("Unknown")

print("Missing values after filling:")
print(df.isnull().sum())

Missing values after filling:
show_id          0
type             0
title            0
director         0
cast             0
country          0
date_added      10
release_year     0
rating           4
duration         3
listed_in        0
description      0
dtype: int64


# 3.Remove duplicate rows

This step deletes repeated rows in the dataset. Only unique records are kept.

In [19]:
print(df.isnull().sum())

show_id          0
type             0
title            0
director         0
cast             0
country          0
date_added      10
release_year     0
rating           4
duration         3
listed_in        0
description      0
dtype: int64


In [20]:
print("Before:", len(df))
df = df.drop_duplicates()
print("After:", len(df))

Before: 8807
After: 8807


# 4.Standardize text values

This step makes text values consistent in style. Country names, ratings, and type fields are formatted to uniform case.

In [22]:
# Standardize the 'country' column by stripping whitespace
df['country'] = df['country'].str.strip()
print(df['country'].head(10))

0                                        United States
1                                         South Africa
2                                              Unknown
3                                              Unknown
4                                                India
5                                              Unknown
6                                              Unknown
7    United States, Ghana, Burkina Faso, United Kin...
8                                       United Kingdom
9                                        United States
Name: country, dtype: object


# 5. Convert Date Formats

This step changes dates into a proper datetime format. All dates are then shown in the same dd-mm-yyyy style.

In [24]:
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')
df['date_added'] = df['date_added'].dt.strftime("%d-%m-%Y")

print(df['date_added'].head())

0    25-09-2021
1    24-09-2021
2    24-09-2021
3    24-09-2021
4    24-09-2021
Name: date_added, dtype: object


# 6. Rename Column Headers

This step cleans the column names for a uniform style. All headers become lowercase with underscores instead of spaces.

In [25]:
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")

print(df.columns)

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')


# 7. Check Data Types

This step checks the data type of each column. It shows if values are stored as integers, strings, or dates, which is vital for correct data handling.

In [26]:
print(df.dtypes)

show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object


# 8. Fix Data Types

This step converts the release year into integers. It ensures the column is ready for numeric operations, such as calculations or comparisons.

In [27]:
df['release_year'] = df['release_year'].astype(int)

print(df['release_year'].head())

0    2020
1    2021
2    2021
3    2021
4    2021
Name: release_year, dtype: int32


# 9. Split Duration Column

This step separates numeric values from text in the duration column. One column stores numbers, while the other stores units like "min" or "Seasons."

In [28]:
df[['duration_value', 'duration_unit']] = df['duration'].str.extract(r'(\d+)\s*(\w+)')
df['duration_value'] = pd.to_numeric(df['duration_value'], errors='coerce')

print(df[['duration', 'duration_value', 'duration_unit']].head())

    duration  duration_value duration_unit
0     90 min            90.0           min
1  2 Seasons             2.0       Seasons
2   1 Season             1.0        Season
3   1 Season             1.0        Season
4  2 Seasons             2.0       Seasons


# 10.Converting a Categorical Column into a Numerical One
This command transforms categorical text data into a numerical format, which is required for many machine learning models. We will convert the type column, which is either 'Movie' or 'TV Show', into 0s and 1s.

In [29]:
# Convert 'type' column to a binary numerical representation
df['is_tv_show'] = df['type'].apply(lambda x: 1 if x == 'TV Show' else 0)

print(df[['type', 'is_tv_show']].head())

      type  is_tv_show
0    Movie           0
1  Tv Show           0
2  Tv Show           0
3  Tv Show           0
4  Tv Show           0


# 11.Grouping and Aggregating Data
This command groups the data by country to count the number of titles from each. This provides a quick summary of the data and identifies the most frequent contributing countries.

In [30]:
# Group by 'country' and count the number of titles
country_counts = df['country'].value_counts()

print("Top 5 countries by title count:")
print(country_counts.head())

Top 5 countries by title count:
United States     2818
India              972
Unknown            831
United Kingdom     419
Japan              245
Name: country, dtype: int64


# 12. Save Cleaned Dataset

This final step writes the cleaned dataset to a new CSV file. The final file is ready for subsequent analysis or for use in other applications.

In [32]:
df.to_csv("netflix_data_cleaned.csv", index=False)

print("Dataset successfully saved to netflix_data_cleaned.csv")

Dataset successfully saved to netflix_data_cleaned.csv
