# Model Development for KuaiRec Recommender System

## Introduction
This notebook builds upon our data preprocessing and feature engineering work to develop effective recommendation models for the KuaiRec short video platform. Recommendation systems are critical for video platforms as they directly impact user engagement, satisfaction, and platform growth. By leveraging the rich features we've extracted, we aim to create personalized recommendations that match users with content they'll likely enjoy.

## Why Multiple Recommendation Approaches?
No single recommendation algorithm works best for all scenarios. Each approach has unique strengths:
- **Collaborative Filtering**: Captures shared preferences among similar users
- **Content-Based Filtering**: Leverages video characteristics and user preferences
- **Sequence-Aware Models**: Accounts for temporal patterns in viewing behavior
- **Hybrid Approaches**: Combines multiple techniques to overcome individual limitations

## Models We'll Implement

1. **Collaborative Filtering (SVD-based Matrix Factorization)**
   - Uses the dense interaction matrix (79.71% density)
   - Implements SVD (Singular Value Decomposition) via scipy.sparse.linalg.svds
   - Centers the data by subtracting the global mean
   - Extracts latent factors to capture underlying patterns in user-item interactions
   - Uses watch ratio as explicit feedback

2. **Content-Based Filtering**
   - Creates item profiles using category features and engagement metrics
   - Normalizes features using StandardScaler
   - Builds user profiles as weighted averages of item profiles based on watch ratios
   - Uses cosine similarity to find items matching user preferences
   - Particularly valuable for new videos with limited interaction data

3. **Sequence-Aware Recommendation (Sequential Rules)**
   - Tracks item sequences and transition patterns in user viewing history
   - Implements a sequential rules model that counts item-to-item transitions
   - Uses configurable sequence length (default: 5) and minimum support threshold
   - Predicts next items based on recent viewing sequences
   - Captures temporal patterns in video consumption behavior

4. **LightGBM Hybrid Model**
   - Implements a gradient boosting decision tree model using LightGBM
   - Uses all available features except ID columns and target variables
   - Automatically handles both numerical and categorical features
   - Predicts watch ratios directly as a regression task
   - Combines the strengths of all feature types for more accurate predictions

## Table of Contents
1. [Setup and Imports](#setup-and-imports)
2. [Loading Processed Data](#loading-processed-data)
3. [Collaborative Filtering Models](#collaborative-filtering-models)
4. [Content-Based Filtering Models](#content-based-filtering-models)
5. [Sequence-Aware Models](#sequence-aware-models)
6. [Hybrid Approaches](#hybrid-approaches)
7. [Model Development Summary](#model-development-summary)

## 1. Setup and Imports

First, we import all necessary libraries and set up the environment for model development.

In [12]:
# Import necessary libraries
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import sparse
import pickle
from datetime import datetime
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

# Add the src directory to the path so we can import our modules
sys.path.append(os.path.abspath("../"))
from src.models.collaborative import ALSModel
from src.models.content_based import ContentBasedModel
from src.models.sequence_aware import SequentialRules
from src.models.hybrid import LightGBMModel

# Set up directories
processed_dir = "../data/processed"
models_dir = "../models"
os.makedirs(models_dir, exist_ok=True)

## 2. Loading Processed Data

We load the processed data generated from the previous notebooks, including user features, item features, and training interaction data.

In [13]:
# Load the data
print("Loading processed data...")
train_features = pd.read_csv(os.path.join(processed_dir, "train_features.csv"), low_memory=True)
user_features = pd.read_csv(os.path.join(processed_dir, "user_features.csv"), low_memory=True)
item_features = pd.read_csv(os.path.join(processed_dir, "item_features.csv"), low_memory=True)

Loading processed data...


In [14]:
# Load/create the interaction matrix

interaction_matrix = sparse.load_npz(os.path.join(processed_dir, "interaction_matrix.npz"))
with open(os.path.join(processed_dir, "user_indices.pkl"), 'rb') as f:
    user_indices = pickle.load(f)
with open(os.path.join(processed_dir, "item_indices.pkl"), 'rb') as f:
    item_indices = pickle.load(f)
print("Loaded interaction matrix from file.")


print(f"Data loaded. Train features shape: {train_features.shape}")
print(f"Interaction matrix shape: {interaction_matrix.shape}, density: {interaction_matrix.nnz / (interaction_matrix.shape[0] * interaction_matrix.shape[1]) * 100:.2f}%")

Loaded interaction matrix from file.
Data loaded. Train features shape: (3741835, 21)
Interaction matrix shape: (1411, 3327), density: 79.71%


## 3. Collaborative Filtering Models

Collaborative filtering makes recommendations based on historical user-item interactions without requiring explicit content knowledge. Our implementation uses Singular Value Decomposition (SVD), a matrix factorization technique that:

- Decomposes the user-item interaction matrix into lower-dimensional latent factor matrices
- Captures hidden patterns in user preferences and item characteristics
- Centers the data by subtracting the global mean to account for rating biases
- Represents users and items as vectors in the same latent space
- Measures similarity through vector operations in this latent space

The high density of our interaction matrix (79.71%) makes this approach particularly effective, as it has sufficient data to identify meaningful patterns. The model extracts 100 latent factors, balancing expressiveness with computational efficiency.

Our implementation uses `scipy.sparse.linalg.svds` to perform the decomposition, which is well-suited for large sparse matrices like our interaction data.

In [15]:
print("\n--- Training Collaborative Filtering Models ---")

print("\nTraining ALS model...")
als_model = ALSModel(factors=100)
als_model.fit(interaction_matrix)
als_model.save(os.path.join(models_dir, "als_model.pkl"))
print("ALS model trained and saved.")


--- Training Collaborative Filtering Models ---

Training ALS model...
ALS model trained and saved.


## 4. Content-Based Filtering Models

Content-based filtering recommends items similar to those a user has previously enjoyed, based on item attributes rather than user interactions. Our implementation:

- Creates item profiles using category features and engagement metrics (49 features total)
- Normalizes features using StandardScaler to ensure equal contribution from different scales
- Builds user profiles as weighted averages of item profiles, with weights based on watch ratios
- Recommends items whose profiles are most similar to the user's profile
- Uses cosine similarity to measure profile matching

This approach excels at:
- Recommending new videos with limited interaction data
- Providing highly personalized recommendations based on specific content preferences
- Addressing the "cold start" problem for new items

Our model leverages both categorical features (video categories) and behavioral metrics (engagement score, watch ratio) to create comprehensive item profiles.

In [16]:
print("\n--- Training Content-Based Filtering Model ---")
content_model = ContentBasedModel()
content_model.fit(item_features, train_features)
content_model.save(os.path.join(models_dir, "content_model.pkl"))
print("Content-based model trained and saved.")


--- Training Content-Based Filtering Model ---
Content-based model trained and saved.


## 5. Sequence-Aware Models

Sequence-aware recommendation recognizes that the order of interactions matters, especially for short video platforms where viewing patterns often follow temporal sequences. Our Sequential Rules model:

- Tracks item-to-item transitions in user viewing histories
- Counts sequences of different lengths (up to 5 items)
- Applies a minimum support threshold (2) to filter out rare patterns
- Predicts next items based on the most recent items in a user's sequence
- Weights predictions by the frequency of observed transitions

This approach is particularly valuable because:
- It captures temporal dynamics in user preferences
- It models the context of recent interactions
- It can detect common viewing paths through content
- It adapts to evolving user interests over time

Unlike traditional collaborative or content-based methods that treat interactions as independent events, sequential models recognize that each viewing decision is influenced by what the user has just watched.

In [17]:
print("\n--- Training Sequence-Aware Model ---")
# Sort training data by user and timestamp to create sequences
if 'timestamp' in train_features.columns:
    train_features_sorted = train_features.sort_values(['user_id', 'timestamp'])
    
    seq_model = SequentialRules(max_sequence_length=5, min_support=2)
    seq_model.fit(train_features_sorted)
    seq_model.save(os.path.join(models_dir, "sequential_model.pkl"))
    print("Sequential model trained and saved.")
else:
    print("Timestamp column not found in training data. Skipping sequence-aware model.")


--- Training Sequence-Aware Model ---
Sequential model trained and saved.


## 6. Hybrid Approaches

### Understanding LightGBM Hybrid Model

Hybrid recommendation systems combine multiple techniques to overcome the limitations of individual approaches. Our implementation uses LightGBM, a gradient boosting framework that:

- Treats recommendation as a regression problem, predicting watch ratios directly
- Combines all feature types (user, item, interaction, and contextual features)
- Automatically handles both numerical and categorical features
- Uses gradient boosted decision trees to model complex non-linear relationships
- Balances accuracy with computational efficiency

The hybrid approach offers several advantages:
- Leverages all 41 user features and 49 item features simultaneously
- Captures complex interactions between different feature types
- Adapts to different recommendation scenarios within the same model
- Mitigates weaknesses of individual algorithms by combining their strengths

Our implementation uses carefully tuned parameters to prevent overfitting while maximizing predictive power, with separate training and validation sets to ensure generalization.

In [18]:
print("\n--- Training Hybrid Model ---")
hybrid_model = LightGBMModel()
hybrid_model.fit(train_features)
hybrid_model.save(os.path.join(models_dir, "hybrid_model"))
print("Hybrid model trained and saved.")


--- Training Hybrid Model ---
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Debug] Dataset::GetMultiBinFromSparseFeatures: sparse rate 0.793290
[LightGBM] [Info] Total Bins 1608
[LightGBM] [Info] Number of data points in the train set: 2993468, number of used features: 14
[LightGBM] [Info] Using GPU Device: NVIDIA GeForce RTX 2060, Vendor: NVIDIA Corporation
[LightGBM] [Info] Compiling OpenCL Kernel with 256 bins...
[LightGBM] [Info] GPU programs have been built
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 12 dense feature groups (34.26 MB) transferred to GPU in 0.061215 secs. 1 sparse feature groups
[LightGBM] [Debug] Use subset for bagging
[LightGBM] [Info] Start training from score 0.907228
[LightGBM] [Debug] Re-bagging, using 2395705 data to train
[LightGBM] [Info] Size of histogram bin entry: 8
[LightGBM] [Info] 12 dense feature groups (27.42 MB) transferred to GPU in 0.049773 secs. 1 sparse feature groups
[LightGBM] [Debug] Trained a tree with lea

## 7. Model Development Summary

Here we summarize the models developed and their characteristics.

In [19]:
print("\n=== Model Development Summary ===")
print("1. Collaborative Filtering:")
print("   - Trained ALS model with 100 factors")

print("\n2. Content-Based Filtering:")
print("   - Trained model using video categories and engagement metrics")

print("\n3. Sequence-Aware Model:")
if 'seq_model' in locals():
    print("   - Trained sequential rules model with max sequence length of 5")
else:
    print("   - Not trained due to missing timestamp data")

print("\n4. Hybrid Model:")
print("   - Trained LightGBM model combining user, item, and interaction features")

print("\nAll models have been saved to the models directory.")
print("Next step: Implement the recommendation algorithm to generate recommendations for users.")


=== Model Development Summary ===
1. Collaborative Filtering:
   - Trained ALS model with 100 factors

2. Content-Based Filtering:
   - Trained model using video categories and engagement metrics

3. Sequence-Aware Model:
   - Trained sequential rules model with max sequence length of 5

4. Hybrid Model:
   - Trained LightGBM model combining user, item, and interaction features

All models have been saved to the models directory.
Next step: Implement the recommendation algorithm to generate recommendations for users.


## 8. Analysis and Next Steps

Thanks to the rich features engineered in notebook 2, we successfully implemented four complementary recommendation models. These models have been trained and stored, ready for testing in notebook 4 and formal evaluation in notebook 5. This systematic assessment will guide us in selecting the optimal model configuration for the KuaiRec recommendation system, balancing accuracy, diversity, and computational efficiency.