# H·ªá th·ªëng g·ª£i √Ω phim v·ªõi TMDB 5000

**M√¥n**: Final Project - Recommendation System  
**M√¥ t·∫£**: X√¢y d·ª±ng h·ªá th·ªëng g·ª£i √Ω phim (movie recommender) d·ª±a tr√™n d·ªØ li·ªáu TMDB 5000 (`tmdb_5000_movies.csv`, `tmdb_5000_credits.csv`).

C√°c ph·∫ßn ch√≠nh:
1. Thu th·∫≠p & n·∫°p d·ªØ li·ªáu
2. L√†m s·∫°ch & chu·∫©n b·ªã d·ªØ li·ªáu
3. Ph√¢n t√≠ch & tr·ª±c quan h√≥a d·ªØ li·ªáu
4. X√¢y d·ª±ng h·ªá g·ª£i √Ω (content-based)
5. ƒê√°nh gi√° m√¥ h√¨nh (RMSE, MAE, Precision@K, Recall@K)
6. Giao di·ªán g·ª£i √Ω trong notebook (nh·∫≠p t√™n phim ƒë·ªÉ nh·∫≠n g·ª£i √Ω)



In [None]:
# C√†i ƒë·∫∑t v√† import th∆∞ vi·ªán

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error

pd.set_option('display.max_colwidth', 200)
plt.style.use('seaborn-v0_8')

print("Th∆∞ vi·ªán ƒë√£ ƒë∆∞·ª£c import.")


In [None]:
# 1. Thu th·∫≠p & n·∫°p d·ªØ li·ªáu

movies_path = 'tmdb_5000_movies.csv'
credits_path = 'tmdb_5000_credits.csv'

movies = pd.read_csv(movies_path)
credits = pd.read_csv(credits_path)

print('K√≠ch th∆∞·ªõc movies:', movies.shape)
print('K√≠ch th∆∞·ªõc credits:', credits.shape)

movies.head(3)


In [None]:
# 2. L√†m s·∫°ch & chu·∫©n b·ªã d·ªØ li·ªáu

# G·ªôp th√¥ng tin credits v√†o movies theo movie_id
credits_renamed = credits.rename(columns={'movie_id': 'id'})
movies_merged = movies.merge(credits_renamed[['id', 'cast', 'crew']], on='id', how='left')

print('K√≠ch th∆∞·ªõc sau khi merge:', movies_merged.shape)

# Lo·∫°i b·ªè duplicate theo title
before_dups = movies_merged.shape[0]
movies_merged = movies_merged.drop_duplicates(subset=['title'])
after_dups = movies_merged.shape[0]
print(f'Drop duplicate theo title: {before_dups} -> {after_dups}')

# X·ª≠ l√Ω missing values cho text: thay b·∫±ng chu·ªói r·ªóng
text_cols = ['overview', 'tagline', 'cast', 'crew', 'keywords', 'genres']
for col in text_cols:
    if col in movies_merged.columns:
        movies_merged[col] = movies_merged[col].fillna('')

# X·ª≠ l√Ω missing cho numeric: thay b·∫±ng median
num_cols = ['vote_average', 'vote_count', 'popularity', 'runtime']
for col in num_cols:
    if col in movies_merged.columns:
        movies_merged[col] = movies_merged[col].fillna(movies_merged[col].median())

# X·ª≠ l√Ω outlier ƒë∆°n gi·∫£n: clip vote_count ·ªü percentiles 1% - 99%
low, high = movies_merged['vote_count'].quantile([0.01, 0.99])
movies_merged['vote_count_clipped'] = movies_merged['vote_count'].clip(lower=low, upper=high)

# Chu·∫©n h√≥a m·ªôt s·ªë ƒë·∫∑c tr∆∞ng numeric
scaler = MinMaxScaler()
movies_merged[['vote_avg_scaled', 'popularity_scaled', 'vote_count_scaled']] = scaler.fit_transform(
    movies_merged[['vote_average', 'popularity', 'vote_count_clipped']]
)

movies_merged[['title', 'vote_average', 'vote_avg_scaled']].head(3)


## 3. Ph√¢n t√≠ch & Tr·ª±c quan h√≥a d·ªØ li·ªáu


In [None]:
### 3.1. Ph√¢n b·ªë Rating (vote_average)

fig, ax = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
ax[0].hist(movies_merged['vote_average'], bins=30, edgecolor='black', alpha=0.7)
ax[0].set_xlabel('Vote Average')
ax[0].set_ylabel('Frequency')
ax[0].set_title('Ph√¢n b·ªë ƒëi·ªÉm ƒë√°nh gi√° phim')
ax[0].grid(alpha=0.3)

# Boxplot
ax[1].boxplot(movies_merged['vote_average'], vert=True)
ax[1].set_ylabel('Vote Average')
ax[1].set_title('Boxplot c·ªßa Vote Average')
ax[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Mean vote: {movies_merged['vote_average'].mean():.2f}")
print(f"Median vote: {movies_merged['vote_average'].median():.2f}")
print(f"Std vote: {movies_merged['vote_average'].std():.2f}")


In [None]:
### 3.2. T·∫ßn su·∫•t th·ªÉ lo·∫°i phim (Genres)

import json
from collections import Counter

# Parse genres t·ª´ JSON string
def extract_genres(genres_str):
    try:
        genres_list = json.loads(genres_str)
        return [g['name'] for g in genres_list]
    except:
        return []

movies_merged['genres_list'] = movies_merged['genres'].apply(extract_genres)

# ƒê·∫øm t·∫ßn su·∫•t
all_genres = []
for genres in movies_merged['genres_list']:
    all_genres.extend(genres)

genre_counts = Counter(all_genres)
top_genres = genre_counts.most_common(15)

# V·∫Ω bar chart
genres_df = pd.DataFrame(top_genres, columns=['Genre', 'Count'])
plt.figure(figsize=(12, 6))
plt.barh(genres_df['Genre'], genres_df['Count'], color='steelblue')
plt.xlabel('S·ªë l∆∞·ª£ng phim')
plt.ylabel('Th·ªÉ lo·∫°i')
plt.title('Top 15 th·ªÉ lo·∫°i phim ph·ªï bi·∫øn nh·∫•t')
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

print(f"T·ªïng s·ªë th·ªÉ lo·∫°i kh√°c nhau: {len(genre_counts)}")


In [None]:
### 3.3. Top 10 phim c√≥ rating cao nh·∫•t

top_rated = movies_merged.nlargest(10, 'vote_average')[['title', 'vote_average', 'vote_count', 'popularity']]
print(top_rated.to_string(index=False))


In [None]:
### 3.4. Heatmap t∆∞∆°ng quan gi·ªØa c√°c bi·∫øn s·ªë

corr_cols = ['vote_average', 'vote_count', 'popularity', 'runtime', 'budget', 'revenue']
corr_data = movies_merged[corr_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_data, annot=True, fmt='.2f', cmap='coolwarm', square=True, linewidths=0.5)
plt.title('Heatmap t∆∞∆°ng quan gi·ªØa c√°c bi·∫øn s·ªë')
plt.tight_layout()
plt.show()


## 4. X√¢y d·ª±ng h·ªá g·ª£i √Ω phim (Content-Based Filtering)


In [None]:
### 4.1. Chu·∫©n b·ªã features cho m√¥ h√¨nh

# H√†m parse keywords
def extract_keywords(keywords_str):
    try:
        keywords_list = json.loads(keywords_str)
        return ' '.join([k['name'] for k in keywords_list])
    except:
        return ''

# H√†m parse cast (l·∫•y 5 di·ªÖn vi√™n ƒë·∫ßu ti√™n)
def extract_cast(cast_str):
    try:
        cast_list = json.loads(cast_str)
        return ' '.join([c['name'].replace(' ', '') for c in cast_list[:5]])
    except:
        return ''

# H√†m parse director t·ª´ crew
def extract_director(crew_str):
    try:
        crew_list = json.loads(crew_str)
        for person in crew_list:
            if person.get('job') == 'Director':
                return person['name'].replace(' ', '')
        return ''
    except:
        return ''

movies_merged['keywords_clean'] = movies_merged['keywords'].apply(extract_keywords)
movies_merged['cast_clean'] = movies_merged['cast'].apply(extract_cast)
movies_merged['director_clean'] = movies_merged['crew'].apply(extract_director)
movies_merged['genres_clean'] = movies_merged['genres_list'].apply(lambda x: ' '.join([g.replace(' ', '') for g in x]))

# K·∫øt h·ª£p c√°c features th√†nh m·ªôt chu·ªói duy nh·∫•t
movies_merged['combined_features'] = (
    movies_merged['overview'].fillna('') + ' ' +
    movies_merged['genres_clean'] + ' ' +
    movies_merged['keywords_clean'] + ' ' +
    movies_merged['cast_clean'] + ' ' +
    movies_merged['director_clean']
)

print("Sample combined features:")
print(movies_merged[['title', 'combined_features']].head(2))


In [None]:
### 4.2. Vector h√≥a v·ªõi TF-IDF v√† t√≠nh Cosine Similarity

# Kh·ªüi t·∫°o TF-IDF Vectorizer
tfidf = TfidfVectorizer(
    max_features=5000,
    stop_words='english',
    ngram_range=(1, 2)
)

# Fit v√† transform
tfidf_matrix = tfidf.fit_transform(movies_merged['combined_features'])

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")

# T√≠nh cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
print(f"Cosine similarity matrix shape: {cosine_sim.shape}")

# T·∫°o mapping t·ª´ title sang index
indices = pd.Series(movies_merged.index, index=movies_merged['title']).drop_duplicates()

print(f"\\nS·ªë phim trong h·ªá th·ªëng: {len(indices)}")


In [None]:
### 4.3. H√†m g·ª£i √Ω phim

def get_recommendations(title, top_n=10):
    """
    Tr·∫£ v·ªÅ top N phim t∆∞∆°ng t·ª± v·ªõi phim c√≥ title ƒë√£ cho
    """
    try:
        # L·∫•y index c·ªßa phim
        idx = indices[title]
        
        # L·∫•y similarity scores
        sim_scores = list(enumerate(cosine_sim[idx]))
        
        # S·∫Øp x·∫øp theo similarity (gi·∫£m d·∫ßn)
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
        
        # L·∫•y top N phim (b·ªè phim ƒë·∫ßu ti√™n v√¨ ch√≠nh n√≥)
        sim_scores = sim_scores[1:top_n+1]
        
        # L·∫•y indices c·ªßa phim
        movie_indices = [i[0] for i in sim_scores]
        
        # Tr·∫£ v·ªÅ th√¥ng tin phim
        result = movies_merged.iloc[movie_indices][['title', 'vote_average', 'vote_count', 'genres_clean', 'overview']]
        result['similarity_score'] = [score[1] for score in sim_scores]
        
        return result
    except KeyError:
        return f"Phim '{title}' kh√¥ng t·ªìn t·∫°i trong c∆° s·ªü d·ªØ li·ªáu."

# Test h√†m g·ª£i √Ω
print("=== G·ª£i √Ω phim t∆∞∆°ng t·ª± 'Avatar' ===")
recommendations = get_recommendations('Avatar', top_n=5)
print(recommendations[['title', 'similarity_score', 'vote_average']].to_string(index=False))


## 5. ƒê√°nh gi√° m√¥ h√¨nh


In [None]:
### 5.1. ƒê√°nh gi√° RMSE v√† MAE

# V·ªõi content-based filtering, ta ƒë√°nh gi√° b·∫±ng c√°ch:
# - D·ª± ƒëo√°n rating c·ªßa phim g·ª£i √Ω = trung b√¨nh c√≥ tr·ªçng s·ªë theo similarity
# - So s√°nh v·ªõi rating th·ª±c t·∫ø

def predict_rating(title, top_n=10):
    """D·ª± ƒëo√°n rating cho phim d·ª±a tr√™n c√°c phim t∆∞∆°ng t·ª±"""
    try:
        idx = indices[title]
        sim_scores = list(enumerate(cosine_sim[idx]))
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:top_n+1]
        
        # T√≠nh rating d·ª± ƒëo√°n = weighted average
        total_sim = sum([score[1] for score in sim_scores])
        if total_sim == 0:
            return movies_merged.iloc[idx]['vote_average']
        
        weighted_rating = sum([
            movies_merged.iloc[score[0]]['vote_average'] * score[1] 
            for score in sim_scores
        ]) / total_sim
        
        return weighted_rating
    except:
        return None

# L·∫•y sample ƒë·ªÉ ƒë√°nh gi√° (100 phim c√≥ vote_count > 100)
sample_movies = movies_merged[movies_merged['vote_count'] > 100].sample(min(100, len(movies_merged)), random_state=42)

y_true = []
y_pred = []

for title in sample_movies['title']:
    true_rating = sample_movies[sample_movies['title'] == title]['vote_average'].values[0]
    pred_rating = predict_rating(title, top_n=10)
    
    if pred_rating is not None:
        y_true.append(true_rating)
        y_pred.append(pred_rating)

# T√≠nh RMSE v√† MAE
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
mae = mean_absolute_error(y_true, y_pred)

print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"\\nS·ªë phim ƒë√°nh gi√°: {len(y_true)}")


In [None]:
### 5.2. ƒê√°nh gi√° Precision@K v√† Recall@K

# Gi·∫£ ƒë·ªãnh: Phim ƒë∆∞·ª£c coi l√† "relevant" n·∫øu c√≥ rating >= threshold v√† c√πng th·ªÉ lo·∫°i
def evaluate_precision_recall(test_movies, K=10, rating_threshold=7.0):
    """
    ƒê√°nh gi√° Precision@K v√† Recall@K
    - Relevant items: phim c√≥ rating >= threshold v√† chia s·∫ª √≠t nh·∫•t 1 th·ªÉ lo·∫°i v·ªõi phim g·ªëc
    """
    precisions = []
    recalls = []
    
    for title in test_movies['title']:
        movie_info = movies_merged[movies_merged['title'] == title].iloc[0]
        movie_genres = set(movie_info['genres_list'])
        
        # L·∫•y recommendations
        recs = get_recommendations(title, top_n=K)
        
        if isinstance(recs, str):  # Tr∆∞·ªùng h·ª£p kh√¥ng t√¨m th·∫•y
            continue
        
        # T√¨m relevant items trong to√†n b·ªô dataset
        relevant_items = movies_merged[
            (movies_merged['vote_average'] >= rating_threshold) &
            (movies_merged['title'] != title) &
            (movies_merged['genres_list'].apply(lambda x: len(set(x) & movie_genres) > 0))
        ]['title'].tolist()
        
        if len(relevant_items) == 0:
            continue
        
        # T√≠nh s·ªë recommended items l√† relevant
        recommended_titles = recs['title'].tolist()
        relevant_recommended = set(recommended_titles) & set(relevant_items)
        
        # Precision@K = (relevant items in top K) / K
        precision = len(relevant_recommended) / K
        
        # Recall@K = (relevant items in top K) / (total relevant items)
        recall = len(relevant_recommended) / len(relevant_items) if len(relevant_items) > 0 else 0
        
        precisions.append(precision)
        recalls.append(recall)
    
    return np.mean(precisions), np.mean(recalls)

# ƒê√°nh gi√° tr√™n sample
test_sample = movies_merged[movies_merged['vote_count'] > 50].sample(min(50, len(movies_merged)), random_state=42)
precision_at_10, recall_at_10 = evaluate_precision_recall(test_sample, K=10, rating_threshold=7.0)

print(f"Precision@10: {precision_at_10:.4f}")
print(f"Recall@10: {recall_at_10:.4f}")
print(f"\\nS·ªë phim test: {len(test_sample)}")
print(f"Ng∆∞·ª°ng rating cho relevant items: 7.0")


In [None]:
### 5.3. T√≥m t·∫Øt k·∫øt qu·∫£ ƒë√°nh gi√°

print("=" * 60)
print("T·ªîNG K·∫æT K·∫æT QU·∫¢ ƒê√ÅNH GI√Å M√î H√åNH")
print("=" * 60)
print(f"‚úì Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"‚úì Mean Absolute Error (MAE): {mae:.4f}")
print(f"‚úì Precision@10: {precision_at_10:.4f}")
print(f"‚úì Recall@10: {recall_at_10:.4f}")
print("=" * 60)
print(f"\\nM√¥ h√¨nh content-based filtering ƒë√£ ƒë∆∞·ª£c x√¢y d·ª±ng th√†nh c√¥ng!")
print(f"Dataset: {len(movies_merged)} phim t·ª´ TMDB 5000")
print(f"Features: overview, genres, keywords, cast, director")
print(f"Vectorization: TF-IDF (max_features=5000, ngram_range=(1,2))")


In [None]:
## 6. L∆∞u model v√† d·ªØ li·ªáu ƒë·ªÉ tri·ªÉn khai

import pickle

# L∆∞u c√°c object c·∫ßn thi·∫øt
data_to_save = {
    'movies_data': movies_merged[['id', 'title', 'vote_average', 'vote_count', 'popularity', 
                                   'genres_clean', 'overview', 'release_date', 'runtime']],
    'cosine_sim': cosine_sim,
    'indices': indices
}

with open('movie_recommender_model.pkl', 'wb') as f:
    pickle.dump(data_to_save, f)

print("‚úì ƒê√£ l∆∞u model v√† d·ªØ li·ªáu v√†o file 'movie_recommender_model.pkl'")
print(f"‚úì K√≠ch th∆∞·ªõc file: {os.path.getsize('movie_recommender_model.pkl') / (1024*1024):.2f} MB")


## 7. H∆∞·ªõng d·∫´n s·ª≠ d·ª•ng Web App

ƒê·ªÉ ch·∫°y giao di·ªán web Streamlit:

1. **ƒê·∫£m b·∫£o ƒë√£ ch·∫°y t·∫•t c·∫£ cell ·ªü tr√™n** ƒë·ªÉ t·∫°o file `movie_recommender_model.pkl`

2. **M·ªü terminal/command prompt** v√† ch·∫°y l·ªánh:
   ```
   streamlit run app.py
   ```

3. **Tr√¨nh duy·ªát s·∫Ω t·ª± ƒë·ªông m·ªü** t·∫°i `http://localhost:8501`

4. **S·ª≠ d·ª•ng app**:
   - Ch·ªçn phim y√™u th√≠ch t·ª´ dropdown
   - ƒêi·ªÅu ch·ªânh s·ªë l∆∞·ª£ng g·ª£i √Ω
   - Nh·∫•n n√∫t "T√¨m phim t∆∞∆°ng t·ª±"
   - Xem k·∫øt qu·∫£ v√† download CSV n·∫øu c·∫ßn

---

### üéâ Ho√†n th√†nh!

Project ƒë√£ bao g·ªìm:
- ‚úÖ Notebook ph√¢n t√≠ch ƒë·∫ßy ƒë·ªß (`tmdb_recommender.ipynb`)
- ‚úÖ Web App v·ªõi Streamlit (`app.py`)
- ‚úÖ Requirements file (`requirements.txt`)
- ‚úÖ H∆∞·ªõng d·∫´n chi ti·∫øt (`README.md`)
- ‚úÖ Model ƒë√£ l∆∞u (`movie_recommender_model.pkl`)

Ch√∫c b·∫°n ho√†n th√†nh t·ªët Final Project! üöÄ
