# Netflix Metadata Enrichment
This notebook demonstrates the process of enriching the Netflix dataset with additional metadata, engineered features, text analysis, and data quality improvements. Each step is explained below.

In [6]:
import pandas as pd
import requests
import time
import os

# Load the cleaned dataset
df = pd.read_csv('data/netflix_titles_cleaned.csv')
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,month_added,month_name_added,year_added,duration_minutes,duration_seasons
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,No Data,United States,2021-09-25,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",9,September,2021,90,1
1,s2,TV Show,Blood & Water,No Data,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",9,September,2021,98,2
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",United States,2021-09-24,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,9,September,2021,98,1
3,s4,TV Show,Jailbirds New Orleans,No Data,No Data,United States,2021-09-24,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",9,September,2021,98,1
4,s5,TV Show,Kota Factory,No Data,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,9,September,2021,98,2


In [7]:
OMDB_API_KEY = '6fdfb278'

def fetch_omdb_data(title, year=None):
    """Fetch metadata from OMDb API for a given title (and optional year)."""
    base_url = 'http://www.omdbapi.com/'
    params = {
        'apikey': OMDB_API_KEY,
        't': title,
    }
    if year:
        params['y'] = year
    response = requests.get(base_url, params=params)
    if response.status_code == 200:
        return response.json()
    else:
        return None

## External Metadata Enrichment

In [11]:
# Path to save progress
progress_path = 'data/enriched_metadata_progress.csv'
DAILY_LIMIT = 1000

# Load already processed titles if file exists
if os.path.exists(progress_path):
    enriched_df = pd.read_csv(progress_path)
    processed_titles = set(enriched_df['title'])
else:
    enriched_df = pd.DataFrame()
    processed_titles = set()

# Filter unprocessed rows
to_process = df[~df['title'].isin(processed_titles)]

batch = to_process.head(DAILY_LIMIT)

enriched_data = []
for idx, row in batch.iterrows():
    omdb_info = fetch_omdb_data(row['title'], row.get('release_year'))
    if omdb_info and omdb_info.get('Response') == 'True':
        ratings = omdb_info.get('Ratings', [])
        imdb_rating = omdb_info.get('imdbRating')
        imdb_votes = omdb_info.get('imdbVotes')
        value=omdb_info.get('Value')
        metascore=omdb_info.get('Metascore')
        awards=omdb_info.get('Awards')
        language=omdb_info.get('Language')
        plot=omdb_info.get('Plot')
        poster=omdb_info.get('Poster')
    else:
        imdb_rating = None
        imdb_votes = None
    enriched_data.append({
        'title': row['title'],
        'ratings': ratings,
        'imdb_rating': imdb_rating,
        'imdb_votes': imdb_votes,
        'value': value,
        'metascore': metascore,
        'language': language,
        'short_plot': plot,
        'poster': poster,
    })
    time.sleep(0.2)  # Be polite to the API

# Convert to DataFrame and append to progress
batch_enriched_df = pd.DataFrame(enriched_data)
enriched_df = pd.concat([enriched_df, batch_enriched_df], ignore_index=True)
enriched_df.to_csv(progress_path, index=False)

# Merge with original for analysis
df_merged = df.merge(enriched_df, on='title', how='left')
df_merged.head()

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

## Feature Engineering and Geographical Enrichment

In [12]:
# Create a flag for Netflix Originals (if 'Netflix' is in the director or production columns, adjust as needed)
df['is_original'] = df['director'].fillna('').str.contains('Netflix', case=False)

# Create a flag for kids content based on rating
def is_kids(rating):
    kids_ratings = ['G', 'TV-Y', 'TV-Y7', 'TV-G', 'TV-PG']
    return rating in kids_ratings
df['is_kids_content'] = df['rating'].apply(is_kids)

# Calculate content age (current year - release year)
current_year = pd.Timestamp.now().year
df['content_age'] = current_year - df['release_year']

# Group countries into regions
region_map = {
    'United States': 'North America',
    'Canada': 'North America',
    'India': 'Asia',
    'United Kingdom': 'Europe',
    'France': 'Europe',
    'Japan': 'Asia',
    'Australia': 'Oceania',
    'New Zealand': 'Oceania'
}
def map_region(country):
    if pd.isna(country):
        return None
    for key in region_map:
        if key in country:
            return region_map[key]
    return 'Other'
df['region'] = df['country'].apply(map_region)

df[['title', 'added_year', 'added_month', 'is_original', 'is_kids_content', 'content_age', 'region']].head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,...,short_plot_x,poster_x,ratings_y,imdb_rating_y,imdb_votes_y,value_y,metascore_y,language_y,short_plot_y,poster_y
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,No Data,United States,2021-09-25,2020,PG-13,90 min,...,A daughter helps her father prepare for the en...,https://m.media-amazon.com/images/M/MV5BOTQyN2...,"[{'Source': 'Internet Movie Database', 'Value'...",7.4,7464.0,,89.0,English,A daughter helps her father prepare for the en...,https://m.media-amazon.com/images/M/MV5BOTQyN2...
1,s2,TV Show,Blood & Water,No Data,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021,TV-MA,2 Seasons,...,,https://m.media-amazon.com/images/M/MV5BNGE5YW...,[],,,,,,,https://m.media-amazon.com/images/M/MV5BNGE5YW...
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",United States,2021-09-24,2021,TV-MA,1 Season,...,"Mehdi, a qualified robber, and Liana, an appre...",https://m.media-amazon.com/images/M/MV5BYWJkOW...,"[{'Source': 'Internet Movie Database', 'Value'...",7.2,4880.0,,,French,"Mehdi, a qualified robber, and Liana, an appre...",https://m.media-amazon.com/images/M/MV5BYWJkOW...
3,s4,TV Show,Jailbirds New Orleans,No Data,No Data,United States,2021-09-24,2021,TV-MA,1 Season,...,"Feuds, flirtations and toilet talk go down amo...",https://m.media-amazon.com/images/M/MV5BNTI0OG...,"[{'Source': 'Internet Movie Database', 'Value'...",6.5,332.0,,,English,"Feuds, flirtations and toilet talk go down amo...",https://m.media-amazon.com/images/M/MV5BNTI0OG...
4,s5,TV Show,Kota Factory,No Data,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021,TV-MA,2 Seasons,...,"Feuds, flirtations and toilet talk go down amo...",https://m.media-amazon.com/images/M/MV5BNTI0OG...,"[{'Source': 'Internet Movie Database', 'Value'...",,,,,English,"Feuds, flirtations and toilet talk go down amo...",https://m.media-amazon.com/images/M/MV5BNTI0OG...


## Text Enrichment

In [13]:
import nltk
from textblob import TextBlob
nltk.download('punkt')

# Extract keywords using TextBlob noun phrases
def extract_keywords(text):
    if pd.isna(text):
        return []
    blob = TextBlob(text)
    return blob.noun_phrases

df['description_keywords'] = df['description'].apply(extract_keywords)

# Calculate sentiment polarity and subjectivity
def get_sentiment(text):
    if pd.isna(text):
        return 0.0, 0.0
    blob = TextBlob(text)
    return blob.sentiment.polarity, blob.sentiment.subjectivity

df[['desc_polarity', 'desc_subjectivity']] = df['description'].apply(lambda x: pd.Series(get_sentiment(x)))

df[['title', 'description_keywords', 'desc_polarity', 'desc_subjectivity']].head()

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/VincentCai/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data]   Unzipping tokenizers/punkt.zip.


MissingCorpusError: 
Looks like you are missing some required data for this feature.

To download the necessary data, simply run

    python -m textblob.download_corpora

or use the NLTK downloader to download the missing data: http://nltk.org/data.html
If this doesn't fix the problem, file an issue at https://github.com/sloria/TextBlob/issues.


## Data Quality Improvements

In [None]:
# Standardize genre, country, and rating fields
if 'listed_in' in df.columns:
    df['genre_standardized'] = df['listed_in'].str.title().str.strip()
df['country_standardized'] = df['country'].str.title().str.strip()
df['rating_standardized'] = df['rating'].str.upper().str.strip()

# Fill missing values for key fields
df['director'] = df['director'].fillna('Unknown')
df['cast'] = df['cast'].fillna('Unknown')
df['country'] = df['country'].fillna('Unknown')
df['rating'] = df['rating'].fillna('Unknown')
df['description'] = df['description'].fillna('')

df.head()

## Save the Enriched Dataset

In [None]:
# Save the Enriched Dataset
df.to_csv('data/netflix_titles_enriched.csv', index=False)
print('Enriched dataset saved to data/netflix_titles_enriched.csv')

# Summary and Next Steps

This notebook has enriched the Netflix dataset with external metadata, engineered features, text analysis, and data quality improvements. The enriched dataset is now ready for advanced analytics and machine learning applications.

## Next Steps: Project 2 - Netflix Content Recommendation Systems
You can now move on to Project 2, where you will use this enriched dataset to build and evaluate content recommendation systems for Netflix. For more details and code, see the repository:

[Netflix Content Recommendation Systems (GitHub)](https://github.com/hongwei-cai/Netflix-Content-Recommendation-Systems)

Potential directions include:
- Building collaborative filtering or content-based recommenders
- Using enriched metadata for hybrid recommendation models
- Evaluating recommendation quality with user or content metrics
- Visualizing recommendations and user/content relationships