## Imports

In [1]:
import polars as pl
import numpy as np
from scipy.sparse import coo_matrix
from implicit.als import AlternatingLeastSquares
from sklearn.metrics import root_mean_squared_error
import random
from sklearn.model_selection import train_test_split
import gc
from collections import defaultdict

# User-Item Interaction Preparation

## Objective
Prepare a user-item interaction matrix for recommendation system training by:
1. Filtering active users
2. Creating dense indices for users/items
3. Extracting core rating features

In [None]:
DATASET = "C:/Users/anees/Desktop/datasets/unified_dataset"
df = (
    pl.scan_parquet(DATASET).select(["user_id", "asin", "rating"])
      .filter(pl.len().over("user_id") >= 5)
      .with_columns([
          pl.col("user_id").rank("dense").cast(pl.Int32).alias("user_idx"),
          pl.col("asin").rank("dense").cast(pl.Int32).alias("item_idx"),
      ])
      .select(["user_idx","item_idx","rating"])
      .collect()
)

### Data Preparation for User Recommendation System

The following steps prepare the user-item ratings data for building a recommendation system:

1. **Convert Columns to NumPy Arrays**  
   The user_idx, item_idx, and rating columns are extracted from the DataFrame and converted to NumPy arrays for efficient computation
2. **Determine Matrix Shape**
    Calculate the total number of unique users and items to define the shape of the user-item interaction matrix
3. **Train-Test Split**
    Split the dataset into training and testing sets using index-based selection to preserve correspondence between users, items, and ratings
4. Extract the corresponding user, item, and rating values for both training and testing sets:

In [3]:
# Convert columns to numpy arrays
user_idx = df["user_idx"].to_numpy()
item_idx = df["item_idx"].to_numpy()
ratings = df["rating"].cast(pl.Float32).to_numpy()

# Determine matrix shape
num_users = user_idx.max() + 1
num_items = item_idx.max() + 1

# Train-test split on indices
indices = np.arange(len(df))
train_indices, test_indices = train_test_split(indices, test_size=0.2, random_state=42)

train_user_idx = user_idx[train_indices]
train_item_idx = item_idx[train_indices]
train_ratings = ratings[train_indices]

test_user_idx = user_idx[test_indices]
test_item_idx = item_idx[test_indices]
test_ratings = ratings[test_indices]

### Building the Recommendation Model

1. **Create Sparse User-Item Matrix**
Construct a sparse matrix using the training data, where rows represent users, columns represent items, and values represent ratings. The matrix is converted to CSR format for efficient access
2. **Train ALS Model**
Initialize and train an Alternating Least Squares (ALS) model on the training matrix. ALS is a matrix factorization algorithm commonly used for collaborative filtering

In [None]:
del df
gc.collect()

# Create sparse matrix
train_matrix = coo_matrix((train_ratings, (train_user_idx, train_item_idx)), shape=(num_users, num_items)).tocsr()

# Train ALS model
model = AlternatingLeastSquares(
    factors=50,
    regularization=0.01,
    iterations=15,
    use_gpu=False,
    random_state=42
)
model.fit(train_matrix)

  check_blas_config()


### Model Evaluation on Test Set

1. **Generate Predictions**  
   Loop through the test set and compute predicted ratings using the dot product of user and item latent factors, but only if both the user and item exist in the model
2. **Compute RMSE**
Evaluate model performance using Root Mean Squared Error (RMSE), which measures the average deviation between predicted and actual ratings

In [5]:
preds, actuals = [], []
for u, i, r in zip(test_user_idx, test_item_idx, test_ratings):
    if u < model.user_factors.shape[0] and i < model.item_factors.shape[0]:
        pred = model.user_factors[u] @ model.item_factors[i]
        preds.append(pred)
        actuals.append(r)

rmse = root_mean_squared_error(actuals, preds)
print(f" RMSE on test set: {rmse:.4f}")

 RMSE on test set: 4.5054


## Results

* The RSME was found to be 4.5
* This means the average distance between the perdicted score and actual score is almost as large as the entire scale
* This indicates poor model perfromace
* Model is not capturing patterns well, perhaps due to:
    - Sparse data
    - Inadequate feature learning (too few factors or iterations)
    - Not enough regularization