### 1: We have several libraries and tools that are going to used such as advanced CF algorithims (SVD).

### Dataset, Reader, SVD, train_test_split, accuracy, GridSearchCV are going to be used for building and evaluating the matrix factorization component.

### TF-IDF and cosine_similarity, will be used to build the Content-Based Filtering component based on movie genres.Lastly,the final model will be serialized for deployment.

In [31]:
#Import necessary libraries
!pip install scikit-surprise
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from surprise.model_selection import GridSearchCV
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split
from surprise import accuracy
import pickle



### 2: Loading Data: The ratings.csv and movies.csv files are loaded into Pandas DataFrames.
### Shape Inspection: The size of the raw datasets is printed, confirming we are starting with 100,836 ratings and 9,742 movies.
### Merging: The ratings data is merged with the necessary title and genres information from the movies data using movieId as the common key. This creates a single DataFrame (ratings_merged) containing all user-item-rating details along with the movie's descriptive features.
### Preview: The first 10 rows of the merged data are displayed for a quick verification of the merge operation.

In [32]:
#Load Csv files
ratings=pd.read_csv('ratings.csv')
movies=pd.read_csv('movies.csv')
print('Ratings:', ratings.shape)
print('Movies:', movies.shape)
#Merging ratings with Movies
ratings_merged= ratings.merge(movies[['movieId', 'title', 'genres']], on='movieId', how='left')
ratings_merged.head(10)

Ratings: (100836, 4)
Movies: (9742, 3)


Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,1,3,4.0,964981247,Grumpier Old Men (1995),Comedy|Romance
2,1,6,4.0,964982224,Heat (1995),Action|Crime|Thriller
3,1,47,5.0,964983815,Seven (a.k.a. Se7en) (1995),Mystery|Thriller
4,1,50,5.0,964982931,"Usual Suspects, The (1995)",Crime|Mystery|Thriller
5,1,70,3.0,964982400,From Dusk Till Dawn (1996),Action|Comedy|Horror|Thriller
6,1,101,5.0,964980868,Bottle Rocket (1996),Adventure|Comedy|Crime|Romance
7,1,110,4.0,964982176,Braveheart (1995),Action|Drama|War
8,1,151,5.0,964984041,Rob Roy (1995),Action|Drama|Romance|War
9,1,157,5.0,964984100,Canadian Bacon (1995),Comedy|War


### 3:Exploratory Data Analysis: 
### Dataset Size: Users: 610 unique users.Items (Movies): 9,724 unique movies. (Note: The user-entered code accidentally printed the user count for items, but the movieId unique count is 9,724 based on the movies.csv file structure).Interactions: 100,836 total ratings.

### User Activity (Interactions per User):The average user has provided 165.7 ratings, which is quite high.Min: 20 ratings (Confirming the MovieLens condition that all included users have rated at least 20 movies).Max: 7,376 ratings (Indicating a few highly active "super-users").

### Rating Distribution:The ratings are centered around the high end, with a median and mean rating of 4.0.Mean Rating: 3.501 (Out of 5.0)Std Dev: 1.043These statistics confirm the data is from a relatively small but highly active group of users, and ratings are generally positive.

In [33]:
#Number of Users,Items and interactions
n_users= ratings['userId'].nunique()
n_items= ratings['movieId'].nunique()
print(f'Users: {n_users}, Items: {n_users}, Interactions: {len(ratings)}')
#Count Interactions per user
user_counts= ratings.groupby('userId').size().describe()
print(user_counts)
# ratings distribution
ratings['rating'].describe()

Users: 610, Items: 610, Interactions: 100836
count     610.000000
mean      165.304918
std       269.480584
min        20.000000
25%        35.000000
50%        70.500000
75%       168.000000
max      2698.000000
dtype: float64


count    100836.000000
mean          3.501557
std           1.042529
min           0.500000
25%           3.000000
50%           3.500000
75%           4.000000
max           5.000000
Name: rating, dtype: float64

### 4:Temporal Train/Test Split
### This code block implements a Temporal Leave-One-Out Cross-Validation strategy, which is important for evaluating sequential and time-series-dependent data like a recommendation system.

### Why Temporal Split?Avoids Data Leakage: It ensures that the model is trained only on events that happened before the events it is asked to predict. A random split would leak future information.Simulates Production: We train on a user's past behavior and test on their most recent behavior, mimicking a real-world scenario where the model predicts the user's next action.

### Split Details:The timestamp is converted from Unix seconds to a proper datetime object.

### The data is sorted by userId and timestamp.

### The last (most recent) rating for every single user is extracted to form the Test Set (test_df).

### All remaining ratings are used for the Train Set (train_df).

### The resulting split is:Total Interactions: 100,836, Train Interactions: 100,226 (Used for model training) Test Interaction


In [34]:
#Data Type Conversion for Timestamps
if 'timestamp' in ratings.columns:
    ratings['timestamp'] = pd.to_datetime(ratings['timestamp'], unit='s')
# Handling Timestamps and Sorting
ratings_sorted = ratings.sort_values(['userId', 'timestamp']) if 'timestamp' in ratings.columns else ratings.sort_values (['userId'])
#Identifying the Test Set Indices (Leave-One-Out)
test_idx = ratings_sorted.groupby('userId').tail(1).index
# Creating the Final Train and Test DataFrames
test_df = ratings.loc[test_idx].reset_index(drop=True)
train_df = ratings.drop(test_idx).reset_index(drop=True)
#Showing the resulting split.
print('Train Interactions:', len(train_df))
print('Test Interactions (held-out last per user):', len(test_df))


Train Interactions: 100226
Test Interactions (held-out last per user): 610


### 5:Data Preparation for Surprise (CF)
### The code below prepares the Pandas DataFrame into the specific data structure required by the scikit-surprise library. This is a necessary step to utilize algorithms like Singular Value Decomposition (SVD).

### Key Components: Reader: The Reader object is initialized to inform Surprise about the expected rating scale, which is (0.5,5.0) for the MovieLens dataset. This normalization is vital for the matrix factorization algorithms.Dataset.load_from_df: This function transforms the Pandas DataFrame, containing the essential userId, movieId, and rating columns, into a Dataset object.

### Datasets Created: data_train: This object is loaded using the raw ratings DataFrame. Crucially, since you performed a manual temporal split previously, this full dataset object will be internally split later to feed the model only the train_df data.full_train: A duplicate object often created for procedures where the model is trained on all available data (e.g., after the best hyperparameters are found) to generate final recommendations.

### This step finalizes the data's readiness for the Matrix Factorization training phase.

In [35]:
#Defining the Rating Scale
reader= Reader(rating_scale=(0.5,5.0))
#Loading Data into Surprise Format
data_train= Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
full_train= Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

print(reader)
print(data_train)
print(full_train)


<surprise.reader.Reader object at 0x00000273E4FDFCE0>
<surprise.dataset.DatasetAutoFolds object at 0x00000273EFBD1040>
<surprise.dataset.DatasetAutoFolds object at 0x00000273EFBD36B0>


### 6:Hyperparameter Tuning for Collaborative Filtering (SVD):
### The code below performs Grid Search Cross-Validation (GridSearchCV) to find the optimal set of parameters for the Singular Value Decomposition (SVD) matrix factorization algorithm. This is the crucial step for optimizing the performance of the Collaborative Filtering component.

### Tuning Strategy
### Algorithm: SVD is chosen as it's highly effective for recommendation systems, decomposing the user-item matrix into lower-dimensional user and item feature matrices (latent factors).

### Parameters Tuned (param_grid):

### n_factors (Latent Factors): The dimensionality of the latent factor space (e.g., 20, 50, 100). This represents how many hidden features describe user tastes and movie properties.

### lr_all (Learning Rate): Controls the step size at each iteration of stochastic gradient descent.

### reg_all (Regularization Term): Prevents overfitting by penalizing large parameter values.

### Evaluation:

### Cross-Validation: The training data is split into 3 folds (cv=3).

### Metric: The model is optimized to minimize the Root Mean Squared Error (RMSE), which is the standard measure of prediction accuracy for explicit rating data.

### Output: The search returns the combination of parameters that achieved the lowest average RMSE across the 3 folds.

In [58]:
#Defining the Search Space
param_grid = {
    'n_factors':[20, 50, 100],
    'lr_all':[0.002, 0.005],
    'reg_all': [0.02, 0.05]
}
#Setting up the Grid Search
gs= GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=-1)
#Execution and Result
gs.fit(data_train)

print('Best RMSE score:', gs.best_score['rmse'])
print('Best params (RMSE):', gs.best_params['rmse'])

best_params=gs.best_params['rmse']

Best RMSE score: 0.8751026559158208
Best params (RMSE): {'n_factors': 20, 'lr_all': 0.005, 'reg_all': 0.05}


### 7:  Final Collaborative Filtering (SVD) Model Training and Evaluation.
### This block finalizes the training of the Collaborative Filtering (CF) model using the optimal hyperparameters found during the grid search, and then evaluates its real-world performance using the held-out temporal test set.

### Key Steps:

### Final Training (svd_final):

### The model (Singular Value Decomposition, SVD) is initialized using the best parameters (n\_factors, lr\_all, reg\_all) determined by GridSearchCV to ensure peak predictive performance.

### The model is trained on the full training dataset (trainset), which includes all user interactions before the held-out test ratings.

### Test Set Preparation:

### The test_df (which contains the single, most recent rating for every user from the temporal split) is converted into a list of tuples ((user\_id, movie\_id, true\_rating)), which is the exact format required by the surprise library's .test() function.

### Evaluation:

### Predictions are generated for every item in the test set.

### The final performance is measured using the standard regression metrics for rating prediction:

### RMSE (Root Mean Squared Error): The primary metric, which heavily penalizes large prediction errors.

### MAE (Mean Absolute Error): Measures the average magnitude of prediction error.

### These final RMSE and MAE scores represent the model's expected accuracy when predicting a user's future rating behavior.


In [59]:
#Final Model Training
trainset = full_train.build_full_trainset()
svd_final = SVD(n_factors=best_params['n_factors'], lr_all=best_params['lr_all'], reg_all=best_params['reg_all'], biased=True, random_state=42)
svd_final.fit(trainset)
#Test Set Preparation for Surprise
raw_testset= list(zip(test_df['userId'].astype(str).tolist(),test_df['movieId'].astype(str).tolist(), test_df['rating'].astype(str).tolist()))
testset_for_suprise= [(row.userId, row.movieId, row.rating) for row in test_df.itertuples()]
#Evaluate the Testset
predictions=svd_model.test(testset_for_suprise)
rmse= accuracy.rmse(predictions, verbose=False)
mae= accuracy.mae(predictions, verbose=False)
print(f'RMSE: {rmse:.4f}, MAE: {mae:.4f}')

RMSE: 0.8319, MAE: 0.6490


### 8:Content-Based Filtering (CBF) Model Setup
### This establishes the the Content-Based Filtering (CBF) component of the hybrid recommender system, focusing on movie genres.

### Key Steps:
### Genre Preprocessing:

### movies['genres'] = movies['genres'].fillna(''): Ensures all missing genre values are replaced with an empty string, preventing errors in the vectorization step.

### TF-IDF Vectorization:

### tfidf = TfidfVectorizer(...): Initializes the TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer.

### token_pattern=r"(?u)\b[^,]+\b": Crucially, the token pattern is adjusted to treat the genres (which are comma-separated in the input but typically pipe-separated in MovieLens) as distinct terms. For the standard MovieLens format using pipes (|), this custom pattern may need adjustment, but the intent is to capture genre tags like 'Action', 'Drama', etc., as separate features.

### genre_tfidf = tfidf.fit_transform(...): Transforms the list of movie genres into a sparse matrix where each row is a movie and each column is a genre term. TF-IDF gives higher weight to genres that are unique to a few movies, reducing the importance of common genres like 'Drama' or 'Comedy'.

### Similarity Matrix Calculation:

### genre_sim = cosine_similarity(genre_tfidf, genre_tfidf): Calculates the cosine similarity between every movie pair. The resulting matrix (genre_sim) measures how closely aligned the genre profiles of any two movies are, with values closer to 1.0 indicating higher similarity.

### Index Mapping:

### movieid_to_idx and idx_to_movieid: These dictionaries are created to map the internal DataFrame row index to the external MovieLens movieId and back. This is essential for quickly looking up a movie's similarity scores in the matrix using its movieId.

### This entire process enables the model to identify movies that are content-wise (genre-wise) similar to those a user has liked.

In [66]:
#Data Cleaning and Vectorization
movies['genres'] = movies['genres'].fillna('')
tfidf = TfidfVectorizer(token_pattern=r"(?u)\b[^,]+\b")
genre_tfidf =tfidf.fit_transform(movies['genres'])
#Calculating Similarity
movieid_to_idx = {mid: idx for idx, mid in enumerate(movies['movieId'].values)}
#Index Mapping
idx_to_movieid = {idx: mid for mid, idx in movieid_to_idx.items()}
genre_sim = cosine_similarity(genre_tfidf, genre_tfidf)


### 9:Recommendation Functions(CF-only and Hybrid Reranking)

### 10:Example Recommendations

### 11: Metrics