Let's walk through building a movie recommendation system at "StreamFlix," a fictional streaming service where you've just been hired as a junior data scientist. Your first project is to implement different types of recommendation algorithms using the Surprise library to see which performs best for their user base.

#### The Business Challenge ####

StreamFlix has a growing catalog of movies but struggles with user engagement. Customer research shows that users often spend more time searching for content than watching it: a problem known in the industry as "choice paralysis." Your manager has asked you to develop a recommendation system that will help users discover movies they'll enjoy, which should increase watch time and reduce churn.

##### Setting Up the Environment #####

In [None]:
# Install necessary packages
!pip install numpy pandas scikit-surprise

# Import libraries
import numpy as np
import pandas as pd
from surprise import Dataset, Reader, KNNBasic, SVD, accuracy
from surprise.model_selection import train_test_split, cross_validate

##### Load and Prepare the Data #####

In [None]:
# Load the dataset
ratings = pd.read_csv('streamflix_ratings.csv')
print(ratings.head())

# Output:
#    user_id  movie_id  rating  timestamp
# 0        1      1193     5.0  978300760
# 1        1       661     3.0  978302109
# 2        1       914     3.0  978301968
# 3        1      3408     4.0  978300275
# 4        1      2355     5.0  978824291

In [None]:
# Convert the data into a format Surprise can use
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings[['user_id', 'movie_id', 'rating']], reader)

# Create a training and testing set using surprise
trainset, testset = train_test_split(data, test_size=0.25)

##### Collaborative Filtering #####

*User-based collaborative filtering*

In [None]:
# User-based collaborative filtering
user_based = KNNBasic(k=50, sim_options={'name': 'pearson', 'user_based': True})
user_based.fit(trainset)

# Make predictions on the test set
user_based_predictions = user_based.test(testset)

# Evaluate the performance
user_accuracy = accuracy.rmse(user_based_predictions)
print(f"User-based CF RMSE: {user_accuracy}")

*Item-based collaborative fitering*

In [None]:
# Item-based collaborative filtering
item_based = KNNBasic(k=50, sim_options={'name': 'pearson', 'user_based': False})
item_based.fit(trainset)

# Make predictions on the test set
item_based_predictions = item_based.test(testset)

# Evaluate the performance
item_accuracy = accuracy.rmse(item_based_predictions)
print(f"Item-based CF RMSE: {item_accuracy}")

##### Matrix Factorization #####

In [None]:
# Matrix Factorization using SVD
svd_model = SVD(n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.02)
svd_model.fit(trainset)

# Make predictions on the test set
svd_predictions = svd_model.test(testset)

# Evaluate the performance
svd_accuracy = accuracy.rmse(svd_predictions)
print(f"Matrix Factorization RMSE: {svd_accuracy}")

##### Content-Based Filtering with Movie Features #####

In [None]:
# Load movie features
movie_features = pd.read_csv('movie_features.csv')

# Create a custom content-based algorithm
from surprise import AlgoBase
import numpy as np

class ContentBasedAlgorithm(AlgoBase):
    def __init__(self, movie_features_df):
        AlgoBase.__init__(self)
        self.movie_features = {}
        
        # Convert movie features to a dictionary for faster lookup
        for _, row in movie_features_df.iterrows():
            movie_id = row['movie_id']
            # Convert genre features to a numpy array
            features = row[['action', 'adventure', 'comedy', 'drama', 'fantasy', 
                            'horror', 'romance', 'sci_fi', 'thriller']].values
            self.movie_features[movie_id] = features
    
    def fit(self, trainset):
        AlgoBase.fit(self, trainset)
        
        # Create user profiles based on their rated movies
        self.user_profiles = {}
        for u in trainset.all_users():
            # Get all movies this user has rated
            items_rated = [j for (j, _) in trainset.ur[u]]
            
            if not items_rated:
                # Handle users with no ratings
                self.user_profiles[u] = np.zeros(9)  # 9 genre features
                continue
                
            # Average the features of all movies this user has rated
            features = [self.movie_features.get(self.trainset.to_raw_iid(j), np.zeros(9)) 
                       for j in items_rated if self.trainset.to_raw_iid(j) in self.movie_features]
            
            if features:
                self.user_profiles[u] = np.mean(features, axis=0)
            else:
                self.user_profiles[u] = np.zeros(9)
        
        return self
    
    def estimate(self, u, i):
        if not (self.trainset.knows_user(u) and self.trainset.knows_item(i)):
            return self.trainset.global_mean
        
        # Get the raw movie id
        raw_movie_id = self.trainset.to_raw_iid(i)
        
        # If we don't have features for this movie, return global mean
        if raw_movie_id not in self.movie_features:
            return self.trainset.global_mean
        
        # Calculate cosine similarity between user profile and movie features
        user_vector = self.user_profiles[u]
        movie_vector = self.movie_features[raw_movie_id]
        
        # Avoid division by zero
        if np.all(user_vector == 0) or np.all(movie_vector == 0):
            return self.trainset.global_mean
        
        cos_sim = np.dot(user_vector, movie_vector) / (np.linalg.norm(user_vector) * np.linalg.norm(movie_vector))
        
        # Convert similarity to rating scale (1-5)
        # Map similarity from [-1, 1] to [1, 5]
        predicted_rating = 1 + 2 * (cos_sim + 1)
        
        return predicted_rating

# Create an instance of our content-based algorithm
cb_algo = ContentBasedAlgorithm(movie_features)
cb_algo.fit(trainset)

# Make predictions
cb_predictions = cb_algo.test(testset)

# Evaluate
cb_accuracy = accuracy.rmse(cb_predictions)
print(f"Content-based filtering RMSE: {cb_accuracy}")

##### Building a Hybrid Recommendation System #####

In [None]:
# Simple weighted hybrid: Combine SVD and content-based predictions
class HybridRecommender:
    def __init__(self, cf_algo, cb_algo, cf_weight=0.7):
        self.cf_algo = cf_algo  # Collaborative filtering algorithm
        self.cb_algo = cb_algo  # Content-based algorithm
        self.cf_weight = cf_weight
    
    def predict(self, user_id, movie_id):
        # Get predictions from both algorithms
        try:
            cf_pred = self.cf_algo.predict(user_id, movie_id).est
        except:
            cf_pred = 3.0  # Default if prediction fails
        
        try:
            cb_pred = self.cb_algo.predict(user_id, movie_id).est
        except:
            cb_pred = 3.0  # Default if prediction fails
        
        # Combine predictions with weighted average
        hybrid_pred = (self.cf_weight * cf_pred) + ((1 - self.cf_weight) * cb_pred)
        
        return hybrid_pred

# Create the hybrid recommender
hybrid = HybridRecommender(svd_model, cb_algo, cf_weight=0.7)

# Evaluate on testset
hybrid_predictions = []
for uid, iid, true_r in testset:
    pred = hybrid.predict(uid, iid)
    hybrid_predictions.append((uid, iid, true_r, pred))

# Calculate RMSE manually
def calculate_rmse(predictions):
    sum_squared_diff = 0
    for _, _, true_r, pred_r in predictions:
        sum_squared_diff += (true_r - pred_r) ** 2
    return np.sqrt(sum_squared_diff / len(predictions))

hybrid_rmse = calculate_rmse(hybrid_predictions)
print(f"Hybrid recommender RMSE: {hybrid_rmse}")

##### Implementing Top-N Recommendations #####

In [None]:
# Function to get top N recommendations for a user
def get_top_n_recommendations(user_id, n=10, algorithm=svd_model, movies_df=None):
    """
    Generate top N movie recommendations for a specific user
    
    Parameters:
    user_id: The user ID
    n: Number of recommendations
    algorithm: Trained recommendation algorithm
    movies_df: DataFrame with movie information
    
    Returns:
    List of (movie_id, predicted_rating) tuples
    """
    # Get a list of all movies
    all_movie_ids = ratings['movie_id'].unique()
    
    # Get movies this user has already rated
    user_ratings = ratings[ratings['user_id'] == user_id]['movie_id'].unique()
    
    # Movies the user hasn't rated
    movies_to_predict = np.setdiff1d(all_movie_ids, user_ratings)
    
    # Predict ratings for all unrated movies
    predictions = [algorithm.predict(user_id, movie_id) for movie_id in movies_to_predict]
    
    # Sort predictions by estimated rating
    predictions.sort(key=lambda x: x.est, reverse=True)
    
    # Take top N
    top_n = predictions[:n]
    
    if movies_df is not None:
        # Return movie information along with predictions
        result = []
        for pred in top_n:
            movie_info = movies_df[movies_df['movie_id'] == pred.iid]
            if not movie_info.empty:
                result.append((movie_info['title'].values[0], pred.est))
        return result
    else:
        # Just return movie IDs and predicted ratings
        return [(pred.iid, pred.est) for pred in top_n]

# Load movie information
movies = pd.read_csv('movies.csv')

# Get recommendations for a specific user
user_recommendations = get_top_n_recommendations(
    user_id=42, n=5, algorithm=svd_model, movies_df=movies)

print("Top 5 recommendations for user 42:")
for title, est in user_recommendations:
    print(f"{title} (predicted rating: {est:.2f})")

##### Metrics to Evaluate Impact #####

In [None]:
# Evaluate the business impact using historical data
def estimate_business_impact(algorithm, holdout_data):
    """
    Estimate business impact of recommendations
    
    Parameters:
    algorithm: Trained recommendation algorithm
    holdout_data: Data held out for testing
    
    Returns:
    Dictionary with impact metrics
    """
    # Predict ratings for holdout data
    predictions = algorithm.test(holdout_data)
    
    # Calculate accuracy
    rmse = accuracy.rmse(predictions)
    
    # Estimate potential improvements
    # Assume that recommendations above 4.0 would lead to a watch
    potential_watches = sum(1 for pred in predictions if pred.est >= 4.0)
    watch_rate = potential_watches / len(predictions)
    
    # Assuming average watch time of 100 minutes
    additional_watch_minutes = potential_watches * 100
    
    # Assuming $10 monthly subscription and 30 hours of viewing per month being "worth it"
    # Calculate potential reduced churn
    monthly_minutes = 30 * 60  # 30 hours in minutes
    subscription_value = 10  # $10 per month
    churn_reduction_estimate = additional_watch_minutes / monthly_minutes * subscription_value
    
    return {
        'rmse': rmse,
        'potential_watches': potential_watches,
        'watch_rate': watch_rate,
        'additional_watch_minutes': additional_watch_minutes,
        'estimated_churn_reduction_value': churn_reduction_estimate
    }

# Calculate impact metrics for the SVD model
impact = estimate_business_impact(svd_model, testset)
print("Estimated Business Impact:")
for metric, value in impact.items():
    print(f"{metric}: {value}")

##### Summary and Recommendations #####

1. The Matrix Factorization (SVD) model performed best in terms of prediction accuracy with an RMSE of [value from your analysis].
2. The hybrid approach combining collaborative and content-based filtering showed promising results for addressing the cold start problem.
3. This analysis suggests the recommendation system could increase watch time by approximately [value] minutes per user and potentially reduce churn, translating to an estimated [value] in preserved subscription revenue.
4. For immediate implementation, recommend deploying the SVD model for established users and the content-based approach for new users.
5. In the next phase, suggest developing a more sophisticated hybrid system that also incorporates contextual information such as time of day and device type.