<a href="https://colab.research.google.com/github/hawa1983/DATA-612/blob/main/Project_1_%E2%80%93_Baseline_Recommender_System_using_Global_Averages_and_Biases.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1 – Baseline Recommender System using Global Averages and Biases

In this project for DATA 612, we explore how to make rating predictions using minimal information.
We build a simple recommender system that estimates ratings using global averages, user biases,
and item biases. The goal is to evaluate how well these baseline predictors perform in comparison
to using just the global mean.

**Business Perspective:**
This recommender system is designed to help an online movie streaming platform provide personalized
movie recommendations to users. By leveraging patterns in existing user ratings, the system aims to
predict how much a user might like a movie they haven't rated yet — even when only limited data is available.

**Objectives:**
- Simulate a small user-item ratings dataset with missing values.
- Split the data into training and testing sets.
- Calculate global average rating and evaluate its prediction error.
- Estimate user and item biases based on deviations from the global average.
- Use those biases to make baseline predictions.
- Compare the prediction accuracy of the two approaches using RMSE.



## Step 1: Create a Toy Dataset

We simulate ratings for 5 users and 5 items, with some missing values.


In [2]:
import pandas as pd
import numpy as np

# Create a toy dataset
import numpy as np

data = {
    'user': [
        'U1','U1','U1','U1','U1','U1','U1','U1','U1','U1',
        'U2','U2','U2','U2','U2','U2','U2','U2','U2','U2',
        'U3','U3','U3','U3','U3','U3','U3','U3','U3','U3',
        'U4','U4','U4','U4','U4','U4','U4','U4','U4','U4',
        'U5','U5','U5','U5','U5','U5','U5','U5','U5','U5',
        'U6','U6','U6','U6','U6','U6','U6','U6','U6','U6',
        'U7','U7','U7','U7','U7','U7','U7','U7','U7','U7',
        'U8','U8','U8','U8','U8','U8','U8','U8','U8','U8',
        'U9','U9','U9','U9','U9','U9','U9','U9','U9','U9',
        'U10','U10','U10','U10','U10','U10','U10','U10','U10','U10',
    ],
    'item': [
        *(['I1','I2','I3','I4','I5','I6','I7','I8','I9','I10'] * 10)
    ],
    'rating': list(np.random.choice([1, 2, 3, 4, 5, np.nan], size=100, p=[0.1, 0.15, 0.2, 0.2, 0.15, 0.2]))
}


df = pd.DataFrame(data)
df.head()


Unnamed: 0,user,item,rating
0,U1,I1,4.0
1,U1,I2,1.0
2,U1,I3,1.0
3,U1,I4,2.0
4,U1,I5,


## Step 2: Split into Train and Test Sets

We randomly split the non-missing ratings into training and testing datasets (80/20 split).


In [3]:
from sklearn.model_selection import train_test_split
import pandas as pd

# Drop rows with missing ratings
df_non_missing = df.dropna(subset=['rating']).copy()

# Use sklearn to split into train/test
train_idx, test_idx = train_test_split(df_non_missing.index, test_size=0.2, random_state=42)

# Assign train/test labels
df['set'] = 'train'
df.loc[test_idx, 'set'] = 'test'

# Create separate train/test DataFrames
train = df[(df['set'] == 'train') & df['rating'].notna()].copy()
test = df[(df['set'] == 'test') & df['rating'].notna()].copy()



## Step 3: Global Average Prediction

We calculate the global average from the training set and use it to predict all ratings.


In [4]:
# Global average rating
global_avg = train['rating'].mean()

train['pred_avg'] = global_avg
test['pred_avg'] = global_avg

def rmse(true, pred):
    return np.sqrt(np.mean((true - pred) ** 2))

rmse_train_avg = rmse(train['rating'], train['pred_avg'])
rmse_test_avg = rmse(test['rating'], test['pred_avg'])


## Step 4: Calculate User and Item Biases

We compute how much each user and each item deviates from the global average.


In [5]:
user_bias = train.groupby('user')['rating'].mean() - global_avg
item_bias = train.groupby('item')['rating'].mean() - global_avg


## Step 5: Baseline Predictors

We add global average, user bias, and item bias to create baseline predictions.


In [6]:
# Predict on training set
train['user_bias'] = train['user'].map(user_bias)
train['item_bias'] = train['item'].map(item_bias)
train['pred_base'] = global_avg + train['user_bias'] + train['item_bias']

# Predict on test set (fill unknown biases with 0)
test['user_bias'] = test['user'].map(user_bias).fillna(0)
test['item_bias'] = test['item'].map(item_bias).fillna(0)
test['pred_base'] = global_avg + test['user_bias'] + test['item_bias']

rmse_train_base = rmse(train['rating'], train['pred_base'])
rmse_test_base = rmse(test['rating'], test['pred_base'])


## Step 6: Summary of Results

We compare RMSE from raw average prediction vs. baseline predictors.


In [7]:
summary = pd.DataFrame({
    'Metric': ['Raw Average RMSE', 'Baseline Predictor RMSE'],
    'Train RMSE': [rmse_train_avg, rmse_train_base],
    'Test RMSE': [rmse_test_avg, rmse_test_base]
})

print("RMSE Summary:")
print(summary)


RMSE Summary:
                    Metric  Train RMSE  Test RMSE
0         Raw Average RMSE    1.286111   1.457835
1  Baseline Predictor RMSE    1.076531   1.406671
