## Netflix Movies & TV Shows â€“ Data Cleaning

This notebook focuses on cleaning the Netflix Movies & TV Shows dataset using Python and pandas.

### Objectives
- Handle missing values
- Standardize column names
- Clean text-based fields
- Convert data types
- Prepare a clean dataset for analysis




In [1]:
import pandas as pd

In [2]:
raw_titles = pd.read_csv("titles.csv")
titles = raw_titles.copy()

FileNotFoundError: [Errno 2] No such file or directory: 'titles.csv'

In [None]:
titles = titles.rename(columns={
    "release_year": "year",
    "age_certification": "age_rating",
    "imdb_score": "imdb_rating",
    "imdb_votes": "imdb_votes_count"
})

In [None]:
# Drop rows where title is missing
titles = titles.dropna(subset=["title"])

# Fill text columns
text_cols = ["description", "age_rating", "genres", "production_countries"]

for col in text_cols:
    titles[col] = titles[col].fillna("Unknown")

In [None]:
titles["year"] = titles["year"].astype(int)
titles["runtime"] = titles["runtime"].astype(int)
titles["imdb_rating"] = titles["imdb_rating"].astype(float)

In [None]:
titles = titles.drop_duplicates()

In [None]:
titles["genres"] = (
    titles["genres"]
    .str.replace("[", "", regex=False)
    .str.replace("]", "", regex=False)
    .str.replace("'", "", regex=False)
)

In [None]:
def age_group(rating):
    if rating in ["G", "PG"]:
        return "Kids"
    elif rating in ["PG-13", "TV-14"]:
        return "Teens"
    elif rating in ["R", "TV-MA"]:
        return "Adults"
    else:
        return "Unknown"

titles["audience_group"] = titles["age_rating"].apply(age_group)

In [None]:
print("Before cleaning shape:", raw_titles.shape)
print("After cleaning shape:", titles.shape)

print("\nBefore cleaning missing values:")
print(raw_titles.isnull().sum())

print("\nAfter cleaning missing values:")
print(titles.isnull().sum())

In [None]:
titles.info()

In [None]:
titles.to_csv("netflix_titles_cleaned.csv", index=False)

### Conclusion
The dataset has been cleaned and standardized, making it ready for exploratory data analysis or machine learning tasks.
