# Day 1 – Data Cleaning and Preparation (Netflix Dataset)



## Netflix Movies & TV Shows – Data Cleaning

This notebook focuses on cleaning the Netflix Movies & TV Shows dataset using Python and pandas.

### Objectives
- Handle missing values
- Standardize column names
- Clean text-based fields
- Convert data types
- Prepare a clean dataset for analysis




In [12]:
import pandas as pd

In [13]:
raw_titles = pd.read_csv("/content/netflix_cleaned.csv")
titles = raw_titles.copy()

In [14]:
titles = titles.rename(columns={
    "release_year": "year",
    "age_certification": "age_rating",
    "imdb_score": "imdb_rating",
    "imdb_votes": "imdb_votes_count"
})

In [15]:
# Drop rows where title is missing
titles = titles.dropna(subset=["title"])

# Fill text columns
text_cols = ["description", "age_rating", "genres", "production_countries"]

for col in text_cols:
    titles[col] = titles[col].fillna("Unknown")

In [16]:
titles["year"] = titles["year"].astype(int)
titles["runtime"] = titles["runtime"].astype(int)
titles["imdb_rating"] = titles["imdb_rating"].astype(float)

In [17]:
titles = titles.drop_duplicates()

In [18]:
titles["genres"] = (
    titles["genres"]
    .str.replace("[", "", regex=False)
    .str.replace("]", "", regex=False)
    .str.replace("'", "", regex=False)
)

In [19]:
def age_group(rating):
    if rating in ["G", "PG"]:
        return "Kids"
    elif rating in ["PG-13", "TV-14"]:
        return "Teens"
    elif rating in ["R", "TV-MA"]:
        return "Adults"
    else:
        return "Unknown"

titles["audience_group"] = titles["age_rating"].apply(age_group)

In [20]:
print("Before cleaning shape:", raw_titles.shape)
print("After cleaning shape:", titles.shape)

print("\nBefore cleaning missing values:")
print(raw_titles.isnull().sum())

print("\nAfter cleaning missing values:")
print(titles.isnull().sum())

Before cleaning shape: (5849, 16)
After cleaning shape: (5849, 16)

Before cleaning missing values:
id                         0
title                      0
type                       0
description                0
year                       0
age_rating                 0
runtime                    0
genres                    58
production_countries       0
seasons                 3743
imdb_id                  403
imdb_rating              481
imdb_votes_count         497
tmdb_popularity           90
tmdb_score               310
audience_group             0
dtype: int64

After cleaning missing values:
id                         0
title                      0
type                       0
description                0
year                       0
age_rating                 0
runtime                    0
genres                     0
production_countries       0
seasons                 3743
imdb_id                  403
imdb_rating              481
imdb_votes_count         497
tmdb_popularit

In [21]:
titles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5849 entries, 0 to 5848
Data columns (total 16 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    5849 non-null   object 
 1   title                 5849 non-null   object 
 2   type                  5849 non-null   object 
 3   description           5849 non-null   object 
 4   year                  5849 non-null   int64  
 5   age_rating            5849 non-null   object 
 6   runtime               5849 non-null   int64  
 7   genres                5849 non-null   object 
 8   production_countries  5849 non-null   object 
 9   seasons               2106 non-null   float64
 10  imdb_id               5446 non-null   object 
 11  imdb_rating           5368 non-null   float64
 12  imdb_votes_count      5352 non-null   float64
 13  tmdb_popularity       5759 non-null   float64
 14  tmdb_score            5539 non-null   float64
 15  audience_group       

In [22]:
titles.to_csv("netflix_titles_cleaned.csv", index=False)

### Conclusion
The dataset has been cleaned and standardized, making it ready for exploratory data analysis or machine learning tasks.
