# Netflix Content Analysis - Part 1: Data Cleaning
## üìä Load, Clean, and Save Dataset

**Author:** DATA 606 - Capstone in Data Science  
**Objective:** Load the raw Netflix dataset, perform comprehensive data cleaning, and save the cleaned dataset for further analysis.

---

## 1. Import Required Libraries

In [None]:
# Install required packages (uncomment if needed)
# !pip install pandas numpy matplotlib seaborn scikit-learn

import pandas as pd
import numpy as np
import re
import warnings
from datetime import datetime

warnings.filterwarnings('ignore')

print("‚úÖ All libraries imported successfully!")
print(f"üìÖ Execution Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## 2. Load Dataset

Upload the `netflix_titles.csv` file to Colab or mount Google Drive.

In [None]:
# Option 1: Upload file directly (uncomment to use)
# from google.colab import files
# uploaded = files.upload()

# Option 2: Mount Google Drive (uncomment to use)
# from google.colab import drive
# drive.mount('/content/drive')
# df = pd.read_csv('/content/drive/MyDrive/netflix_titles.csv')

# Option 3: Load from current directory
df = pd.read_csv('netflix_titles.csv')

print("üì• Dataset loaded successfully!")
print(f"üìä Dataset shape: {df.shape}")
print(f"üìã Total records: {len(df):,}")

## 3. Initial Data Exploration

In [None]:
# Display column names
print("üìã Column Names:")
print(list(df.columns))
print("\n" + "="*60)

In [None]:
# Display first few rows
print("üîç First 5 Rows:")
df.head()

In [None]:
# Dataset information
print("‚ÑπÔ∏è Dataset Information:")
df.info()

In [None]:
# Content type distribution
print("üé≠ Content Type Distribution:")
content_counts = df['type'].value_counts()
for content_type, count in content_counts.items():
    percentage = (count / len(df)) * 100
    print(f"   {content_type}: {count:,} titles ({percentage:.1f}%)")

## 4. Missing Values Analysis

In [None]:
# Check missing values
print("üîç Missing Values Analysis:")
print("="*60)

missing_values = df.isnull().sum()
missing_percentage = (df.isnull().sum() / len(df)) * 100

missing_df = pd.DataFrame({
    'Column': missing_values.index,
    'Missing Count': missing_values.values,
    'Percentage': missing_percentage.values
})

# Display only columns with missing values
missing_df_filtered = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

if len(missing_df_filtered) > 0:
    print(missing_df_filtered.to_string(index=False))
else:
    print("‚úÖ No missing values found!")

print("\n" + "="*60)

## 5. Data Cleaning

### 5.1 Handle Missing Values

In [None]:
print("üßπ Starting data cleaning process...")
print("="*60)

# Store original shape
original_shape = df.shape

# Fill missing values with appropriate defaults
df['director'] = df['director'].fillna('Unknown Director')
df['cast'] = df['cast'].fillna('Unknown Cast')
df['country'] = df['country'].fillna('Unknown Country')
df['date_added'] = df['date_added'].fillna('Unknown Date')
df['rating'] = df['rating'].fillna('Not Rated')
df['duration'] = df['duration'].fillna('Unknown Duration')
df['listed_in'] = df['listed_in'].fillna('Unknown Genre')
df['description'] = df['description'].fillna('No description available')

print("‚úÖ Missing values filled successfully!")

### 5.2 Clean Release Year

In [None]:
# Convert release_year to numeric and remove invalid entries
df['release_year'] = pd.to_numeric(df['release_year'], errors='coerce')

# Drop rows where release_year couldn't be converted
before_drop = len(df)
df = df.dropna(subset=['release_year'])
after_drop = len(df)

df['release_year'] = df['release_year'].astype(int)

print(f"‚úÖ Release year cleaned!")
print(f"   Removed {before_drop - after_drop} invalid records")
print(f"   Year range: {df['release_year'].min()} - {df['release_year'].max()}")

### 5.3 Parse Duration

In [None]:
def parse_duration(duration_str, content_type):
    """
    Parse duration based on content type.
    - Movies: Extract minutes
    - TV Shows: Extract number of seasons
    """
    if pd.isna(duration_str) or duration_str == 'Unknown Duration':
        return np.nan
    
    if content_type == 'Movie':
        # Extract minutes for movies
        match = re.search(r'(\d+)', str(duration_str))
        return int(match.group(1)) if match else np.nan
    else:  # TV Show
        # Extract number of seasons for TV shows
        match = re.search(r'(\d+)', str(duration_str))
        return int(match.group(1)) if match else np.nan

# Apply duration parsing
df['duration_value'] = df.apply(lambda x: parse_duration(x['duration'], x['type']), axis=1)

print("‚úÖ Duration parsed successfully!")
print(f"   Movies with duration: {df[df['type']=='Movie']['duration_value'].notna().sum()}")
print(f"   TV Shows with seasons: {df[df['type']=='TV Show']['duration_value'].notna().sum()}")

### 5.4 Create Additional Features

In [None]:
# Create decade feature
df['decade'] = (df['release_year'] // 10) * 10

# Create content age categories
current_year = 2024
df['age'] = current_year - df['release_year']
df['age_category'] = pd.cut(df['age'], 
                           bins=[0, 5, 15, 30, float('inf')], 
                           labels=['Recent', 'Modern', 'Classic', 'Vintage'])

# Create a combined text field for recommendations
df['combined_features'] = (df['listed_in'].fillna('') + ' ' + 
                          df['description'].fillna('') + ' ' + 
                          df['director'].fillna('') + ' ' + 
                          df['cast'].fillna(''))

print("‚úÖ Additional features created:")
print("   - decade")
print("   - age")
print("   - age_category")
print("   - combined_features")

## 6. Data Quality Check

In [None]:
print("üîç Final Data Quality Check:")
print("="*60)

# Check for remaining missing values
remaining_missing = df.isnull().sum().sum()
print(f"Remaining missing values: {remaining_missing}")

# Display data types
print("\nData Types:")
print(df.dtypes)

# Display summary statistics
print("\nüìä Summary Statistics:")
print(f"   Total records: {len(df):,}")
print(f"   Movies: {len(df[df['type'] == 'Movie']):,}")
print(f"   TV Shows: {len(df[df['type'] == 'TV Show']):,}")
print(f"   Unique countries: {df['country'].nunique()}")
print(f"   Unique directors: {df['director'].nunique()}")
print(f"   Unique ratings: {df['rating'].nunique()}")

In [None]:
# Display sample of cleaned data
print("\nüìã Sample of Cleaned Data:")
df[['title', 'type', 'release_year', 'duration', 'duration_value', 
    'listed_in', 'rating', 'decade', 'age_category']].head(10)

## 7. Save Cleaned Dataset

In [None]:
# Save cleaned dataset
output_filename = 'netflix_cleaned.csv'
df.to_csv(output_filename, index=False)

print("üíæ Cleaned dataset saved successfully!")
print("="*60)
print(f"üìÅ Filename: {output_filename}")
print(f"üìä Total records: {len(df):,}")
print(f"üìã Total columns: {len(df.columns)}")
print(f"üìè File size: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")
print("\n‚úÖ Data cleaning completed successfully!")
print("üéØ Ready for EDA in the next notebook!")

In [None]:
# Download file (optional - for Google Colab)
# from google.colab import files
# files.download(output_filename)

---

## Summary

‚úÖ **Completed Tasks:**
1. Loaded raw Netflix dataset
2. Analyzed missing values
3. Filled missing values with appropriate defaults
4. Cleaned and validated release_year
5. Parsed duration for movies and TV shows
6. Created additional features (decade, age, age_category)
7. Created combined_features for recommendations
8. Saved cleaned dataset as `netflix_cleaned.csv`

**Next Step:** Proceed to `02_eda.ipynb` for Exploratory Data Analysis