## Introduction: Memory-Based User-Based Collaborative Filtering

In this notebook, we implement a **memory-based User-Based Collaborative Filtering (CF)** approach to build a simple recommendation system. Collaborative Filtering is a widely used method for personalized recommendations, relying on the principle that **users with similar preferences in the past will have similar preferences in the future**.

### How It Works
- **User Similarity**: The system identifies users with similar interaction patterns using a similarity metric (e.g., cosine similarity).
- **Neighbors and Aggregation**: Recommendations are generated by aggregating preferences from a target user's most similar neighbors.
- **No Training**: Unlike modern machine learning-based approaches, memory-based CF does not involve explicit training. Instead, it directly uses observed user-item interactions to make predictions.

### Limitations of Memory-Based User-Based CF
1. **Scalability**: 
   - Computing user-user similarities becomes computationally expensive as the number of users grows.
   - This limits its use in large-scale systems.
2. **Sparsity**:
   - User-item interaction matrices in real-world datasets are highly sparse.
   - With limited overlap in interactions, finding meaningful similarities between users is challenging, leading to poor recommendation quality.
3. **Cold-Start Problem**:
   - New or inactive users with little interaction history cannot be effectively recommended items.
   - Similarly, new items with no ratings cannot be suggested.
4. **No Generalization**:
   - The method does not learn latent patterns or adapt to new data. It only relies on explicit interactions, making it less robust in sparse or noisy datasets.

### Comparison to Modern Approaches
Modern approaches, such as **matrix factorization** and **deep learning**, address these issues by:
- Learning latent user and item features.
- Capturing relationships beyond direct user-user similarities.
- Generalizing better to sparse and cold-start scenarios.

Despite its limitations, memory-based CF is conceptually important and serves as a foundation for understanding recommendation systems. It is often used as a baseline or for small-scale systems.

---

### Concepts Behind User-Based CF

#### 1. **Core Idea**:
- Users with similar preferences (e.g., rating patterns) are likely to enjoy the same items.
- For a target user, the system identifies other users with similar interactions and recommends items they have liked but the target user has not yet interacted with.

#### 2. **Steps in User-Based CF**:

##### Step 1: Build the User-Item Interaction Matrix
- Create a matrix where rows represent users, columns represent items, and values represent interactions (e.g., ratings).
- This matrix is usually sparse, as most users interact with only a small subset of items.

##### Step 2: Compute User Similarity
- Calculate the similarity between all pairs of users based on their interaction patterns.
- Common similarity measures include:
  - **Cosine Similarity**: Measures the cosine of the angle between two user vectors.
  - **Pearson Correlation**: Measures the linear correlation between two user vectors.

##### Step 3: Find Similar Users
- For a target user, identify the most similar users (neighbors).
- Similar users are those with the highest similarity scores.

##### Step 4: Generate Recommendations
- Aggregate the preferences of similar users to recommend items to the target user.
- Common aggregation methods:
  - Weighted average of ratings from similar users.
  - Top-N items interacted with by similar users but not by the target user.

##### Step 5: Evaluate the Model
- Compare the recommendations to the ground truth from the test set.
- Metrics to evaluate:
  - **Precision@K**: Proportion of recommended items in the top-K that are relevant.
  - **Recall@K**: Proportion of relevant items that are recommended in the top-K.

---

## Step 1: Create the User-Item Interaction Matrix
- Convert the dataset into a matrix format suitable for similarity computations.



In [21]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from scipy.sparse import csr_matrix

data_path = 'data/ml-latest-small/'

In [22]:

# Load ratings dataset
ratings = pd.read_csv(data_path + '/ratings.csv')

# Step 1: Split the data into training and test sets
train_data, test_data = train_test_split(ratings, test_size=0.2, random_state=42)

# Step 2: Build the User-Item Interaction Matrix from the training set
user_item_matrix_train = train_data.pivot(index='userId', columns='movieId', values='rating').fillna(0)
sparse_user_item_matrix_train = csr_matrix(user_item_matrix_train.values)

# Display the User-Item matrix
print("User-Item Interaction Matrix:")
print(user_item_matrix_train.head())
print(f"Matrix Shape: {user_item_matrix_train.shape}")

User-Item Interaction Matrix:
movieId  1       2       3       4       5       6       7       8       \
userId                                                                    
1           4.0     0.0     4.0     0.0     0.0     4.0     0.0     0.0   
2           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
3           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
4           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   
5           0.0     0.0     0.0     0.0     0.0     0.0     0.0     0.0   

movieId  9       10      ...  191005  193565  193571  193573  193579  193581  \
userId                   ...                                                   
1           0.0     0.0  ...     0.0     0.0     0.0     0.0     0.0     0.0   
2           0.0     0.0  ...     0.0     0.0     0.0     0.0     0.0     0.0   
3           0.0     0.0  ...     0.0     0.0     0.0     0.0     0.0     0.0   
4           0.0     0.0  ...     0.0     0.0

### Explanation
1. **Pivot Table**:
   - `pivot()` reshapes the data to create rows for users, columns for movies, and values as ratings.

2. **Fill Missing Values**:
   - Missing values are filled with `0` to indicate no interaction. Alternatively, other imputation methods can be used.

3. **Sparse Matrix**:
   - Converting to a sparse matrix reduces memory usage and speeds up similarity computations.

---

## Step 2: Compute User Similarities
- Use a similarity measure (e.g., cosine similarity) to calculate the pairwise similarity between users.

In [23]:
# Step 3: Compute User Similarities from the training set
user_similarities = cosine_similarity(sparse_user_item_matrix_train)
user_similarity_df = pd.DataFrame(user_similarities, index=user_item_matrix_train.index, columns=user_item_matrix_train.index)

# Display a sample of the similarity matrix
print("User-User Similarity Matrix:")
user_similarity_df.head()

User-User Similarity Matrix:


userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1.0,0.016314,0.049021,0.165799,0.123392,0.118556,0.112563,0.142135,0.056088,0.012906,...,0.070901,0.152097,0.187324,0.067264,0.151517,0.139042,0.198771,0.232811,0.112174,0.143902
2,0.016314,1.0,0.0,0.004627,0.0,0.013391,0.029067,0.032754,0.0,0.080739,...,0.170123,0.020395,0.014415,0.0,0.0,0.019846,0.016076,0.05561,0.032404,0.07581
3,0.049021,0.0,1.0,0.0,0.00577,0.004833,0.0,0.005911,0.0,0.0,...,0.006401,0.005889,0.015344,0.0,0.012783,0.008884,0.004642,0.009433,0.0,0.031309
4,0.165799,0.004627,0.0,1.0,0.133565,0.090914,0.094497,0.050417,0.0,0.021991,...,0.075828,0.090252,0.241155,0.054366,0.081585,0.162277,0.083074,0.107276,0.02672,0.068325
5,0.123392,0.0,0.00577,0.133565,1.0,0.238812,0.071386,0.393773,0.0,0.006245,...,0.050523,0.343953,0.101064,0.159651,0.111464,0.086797,0.073278,0.09704,0.205395,0.05309


### Explanation
1. **Cosine Similarity**:
   - Measures the cosine of the angle between two user vectors in the interaction matrix.
   - Ranges from `-1` (opposite) to `1` (identical), with `0` meaning no correlation.

2. **Similarity Matrix**:
   - The matrix is symmetric, where entry `(i, j)` represents the similarity between user `i` and user `j`.

3. **Finding Similar Users**:
   - For a target user, sort the similarity scores to identify the most similar users.

---

### Cosine Similarity

Cosine similarity measures the angle between two vectors in a multi-dimensional space. It is defined as:

$$
cosine\_similarity(A, B) = \frac{A \cdot B}{||A|| \, ||B||}
$$

Where:
- $A \cdot B$ is the dot product of vectors $A$ and $B$.
- $||A||$ is the magnitude (norm) of vector $A$.
- $||B||$ is the magnitude (norm) of vector $B$.

### Example Calculation

#### User-Item Interaction Matrix:
| User | Movie 1 | Movie 2 | Movie 3 | Movie 4 |
|------|---------|---------|---------|---------|
| 1    | 5       | 0       | 3       | 0       |
| 2    | 4       | 0       | 3       | 0       |

For **User 1**: $ A = [5, 0, 3, 0] $  
For **User 2**: $ B = [4, 0, 3, 0] $  

The cosine similarity is calculated as:
$$
cosine\_similarity(A, B) = \frac{(5 \cdot 4) + (0 \cdot 0) + (3 \cdot 3) + (0 \cdot 0)}{\sqrt{(5^2 + 0^2 + 3^2 + 0^2)} \cdot \sqrt{(4^2 + 0^2 + 3^2 + 0^2)}}
$$

Simplifying:
$$
cosine\_similarity(A, B) = \frac{20 + 9}{\sqrt{34} \cdot \sqrt{25}} = \frac{29}{\sqrt{850}} \approx 0.995
$$

This high similarity score indicates that the users have very similar preferences.

---

## Step 3: Identify Neighbors
- For a target user, find the top-N most similar users.



In [24]:
# Function to find top-N neighbors for a given user
def find_top_neighbors(user_id, user_similarity_df, n=5):
    """Find top-N similar users to the given user."""
    similar_users = user_similarity_df[user_id].sort_values(ascending=False).iloc[1:n+1]
    return similar_users

# Example usage: Find top-5 neighbors for user 1
target_user_id = 1
top_neighbors = find_top_neighbors(target_user_id, user_similarity_df, n=5)

print(f"Top 5 neighbors for user {target_user_id}:")
print(top_neighbors)

Top 5 neighbors for user 1:
userId
266    0.324689
368    0.297333
45     0.288969
480    0.288702
217    0.282339
Name: 1, dtype: float64


### Explanation
1. **Identify Neighbors**:
   - Sort the similarity scores for the target user.
   - Exclude the target user from the list.
   - Select the top-N most similar users (neighbors).

2. **Adjustable Number of Neighbors**:
   - Use the `n` parameter to control how many neighbors to consider.

---

## Step 4: Generate Recommendations
- Recommend items that similar users have interacted with but the target user has not.

In [25]:
def generate_recommendations(user_id, user_item_matrix, user_similarity_df, n_neighbors=5, top_n=10):
    """Generate recommendations for a given user."""
    neighbors = find_top_neighbors(user_id, user_similarity_df, n=n_neighbors)
    neighbor_ids = neighbors.index
    neighbor_weights = neighbors.values

    # Aggregate ratings from neighbors
    neighbor_ratings = user_item_matrix.loc[neighbor_ids].T
    weighted_ratings = neighbor_ratings.dot(neighbor_weights)

    # Exclude items the user has already interacted with
    user_interactions = user_item_matrix.loc[user_id]
    recommendations = weighted_ratings[user_interactions == 0]

    # Sort recommendations and return top-N items
    top_recommendations = recommendations.sort_values(ascending=False).head(top_n)
    return top_recommendations

# Example usage: Generate recommendations for user 1
target_user_id = 1
recommendations = generate_recommendations(target_user_id, user_item_matrix_train, user_similarity_df, n_neighbors=5, top_n=10)

print(f"Top recommendations for user {target_user_id}:")
print(recommendations)

Top recommendations for user 1:
movieId
1242    6.217099
2115    6.048990
1036    5.494809
1610    5.490178
1198    5.412163
474     5.306376
1527    4.879418
589     4.482618
1221    4.353398
1370    4.332176
dtype: float64


### Explanation
1. **Aggregate Ratings**:
   - Use the ratings from the top-N neighbors, weighted by their similarity to the target user.

2. **Exclude Existing Interactions**:
   - Remove items the user has already interacted with to focus on new recommendations.

3. **Rank and Recommend**:
   - Sort items by their aggregated scores and recommend the top-N items.

---

### Math Behind `generate_recommendations`

The `generate_recommendations` function predicts items for a target user based on their neighbors' ratings. The steps are:

---

#### Step 1: Neighbor Selection
We select the top $ N $ neighbors for the target user $ u $, based on their similarity scores:
$$
S(u, v)
$$
Where:
- $ S(u, v) $ is the similarity score between user $ u $ and user $ v $ (e.g., cosine similarity).

---

#### Step 2: Aggregating Ratings
The predicted rating for an item $ i $ for user $ u $ is calculated as:

$$
R(u, i) = \sum_{v \in \text{Neighbors}} S(u, v) \cdot R(v, i)
$$

Where:
- $ R(u, i) $: Predicted rating of user $ u $ for item $ i $.
- $ R(v, i) $: Rating of neighbor $ v $ for item $ i $.
- $ S(u, v) $: Similarity between user $ u $ and neighbor $ v $.


---

#### Step 3: Exclude Items Already Rated
To ensure we recommend new items, the function excludes items already rated by the target user:

$$
\text{Exclude items where } R(u, i) > 0
$$


---

#### Step 4: Ranking and Recommendation
The remaining items are sorted by their predicted scores \( R(u, i) \), and the top \( N \) items are recommended:

$$    
\text{Top-N Recommendations} = \text{Items with highest } R(u, i)
$$


---

### Advantages
- **Weighted Contribution**: Neighbors contribute proportionally to their similarity scores.
- **Personalized Recommendations**: Items are tailored to the target user based on similar users' preferences.

---

### Example: Generate Recommendations for User 1

#### Given Data:

1. **User-Item Interaction Matrix** (partial):
   Each row represents a user, and each column represents an item. Values are ratings, and `0` means no interaction.

   | User \\ Item | Item 1 | Item 2 | Item 3 | Item 4 |
   |--------------|--------|--------|--------|--------|
   | **User 1**   | 5      | 0      | 0      | 0      |
   | **User 2**   | 0      | 4      | 0      | 3      |
   | **User 3**   | 3      | 4      | 0      | 0      |



2. **Similarity Scores**:
   The similarity of User 1 with others:
   - \( S(1, 2) = 0.8 \)
   - \( S(1, 3) = 0.6 \)

---

#### Step 1: Neighbor Selection

We select the top-2 neighbors for User 1:
- Neighbors: **User 2** and **User 3**
- Similarity scores:
  - \( S(1, 2) = 0.8 \)
  - \( S(1, 3) = 0.6 \)

---

#### Step 2: Aggregating Ratings

We compute the predicted rating \( R(1, i) \) for each item \( i \) that User 1 has not interacted with:

For **Item 2**:
$$
R(1, 2) = S(1, 2) \cdot R(2, 2) + S(1, 3) \cdot R(3, 2)
$$
Substitute the values:
$$
R(1, 2) = (0.8 \cdot 4) + (0.6 \cdot 4) = 3.2 + 2.4 = 5.6
$$

For **Item 4**:
$$
R(1, 4) = S(1, 2) \cdot R(2, 4) + S(1, 3) \cdot R(3, 4)
$$
Substitute the values:
$$
R(1, 4) = (0.8 \cdot 3) + (0.6 \cdot 0) = 2.4 + 0 = 2.4
$$

---

#### Step 3: Exclude Items Already Rated

User 1 has already rated **Item 1**, so it is excluded from recommendations.

---

#### Step 4: Ranking and Recommendation

The predicted ratings for the remaining items are:
- **Item 2**: \( R(1, 2) = 5.6 \)
- **Item 4**: \( R(1, 4) = 2.4 \)

Sort by predicted rating and recommend the top-N items:
1. **Item 2** (Predicted Rating: 5.6)
2. **Item 4** (Predicted Rating: 2.4)

---

#### Final Recommendation for User 1:
- **Item 2**
- **Item 4**


## Step 5: Evaluate the Model
- Compare recommendations to the ground truth and compute evaluation metrics.

In [26]:
# Function to evaluate recommendations with a proper test set
def evaluate_recommendations(user_id, recommendations, test_data, k=10):
    """Evaluate recommendations using Precision@K and Recall@K."""
    true_items = test_data[test_data['userId'] == user_id]['movieId'].tolist()
    if not true_items:
        return 0.0, 0.0  # No test data for this user

    recommended_items = recommendations.index[:k].tolist()
    relevant_items = set(recommended_items) & set(true_items)

    precision = len(relevant_items) / k
    recall = len(relevant_items) / len(true_items) if true_items else 0

    return precision, recall

# Example: Using a chronological split for evaluation
target_user_id = 1
user_item_matrix_test = test_data.pivot(index='userId', columns='movieId', values='rating').fillna(0)

recommendations = generate_recommendations(target_user_id, user_item_matrix_train, user_similarity_df, n_neighbors=5, top_n=10)
precision, recall = evaluate_recommendations(target_user_id, recommendations, test_data, k=10)

print(f"Top recommendations for user {target_user_id}:")
print(recommendations)
print(f"Precision@10: {precision:.2f}")
print(f"Recall@10: {recall:.2f}")


Top recommendations for user 1:
movieId
1242    6.217099
2115    6.048990
1036    5.494809
1610    5.490178
1198    5.412163
474     5.306376
1527    4.879418
589     4.482618
1221    4.353398
1370    4.332176
dtype: float64
Precision@10: 0.20
Recall@10: 0.05


When we want to evaluate it for all users:

In [27]:
# Evaluate recommendations for all users
def evaluate_all_users(user_item_matrix_train, user_similarity_df, test_data, n_neighbors=5, k=10):
    """
    Evaluate the recommendation system for all users in the test set.
    Args:
        user_item_matrix_train (pd.DataFrame): User-Item matrix built from training data.
        user_similarity_df (pd.DataFrame): User similarity matrix.
        test_data (pd.DataFrame): Test data containing ground truth interactions.
        n_neighbors (int): Number of neighbors to consider.
        k (int): Number of top recommendations to evaluate.
    Returns:
        avg_precision (float): Average Precision@K across all users.
        avg_recall (float): Average Recall@K across all users.
    """
    precisions = []
    recalls = []
    
    # Iterate through all unique users in the test set
    for user_id in test_data['userId'].unique():
        # Generate recommendations for the user
        try:
            recommendations = generate_recommendations(
                user_id, user_item_matrix_train, user_similarity_df, n_neighbors=n_neighbors, top_n=k
            )
        except KeyError:
            # If the user does not exist in the training data, skip
            continue
        
        # Evaluate recommendations for the user
        precision, recall = evaluate_recommendations(user_id, recommendations, test_data, k=k)
        
        # Append metrics
        precisions.append(precision)
        recalls.append(recall)
    
    # Compute average metrics
    avg_precision = sum(precisions) / len(precisions) if precisions else 0.0
    avg_recall = sum(recalls) / len(recalls) if recalls else 0.0
    
    return avg_precision, avg_recall

# Example: Evaluate over all users
avg_precision, avg_recall = evaluate_all_users(user_item_matrix_train, user_similarity_df, test_data, n_neighbors=5, k=10)

print(f"Average Precision@10: {avg_precision:.2f}")
print(f"Average Recall@10: {avg_recall:.2f}")


Average Precision@10: 0.23
Average Recall@10: 0.14


## Conclusion and Outlook: Toward Better Collaborative Filtering

In this notebook, we explored a **memory-based User-Based Collaborative Filtering** approach to building a recommendation system. While simple and intuitive, this method has significant limitations when applied to large, sparse datasets. The poor Precision@K and Recall@K values highlight its struggles with sparsity, scalability, and generalization.

### Limitations of User-Based CF
1. **Scalability**: Poor performance as the number of users grows.
2. **Sparsity**: Difficulty finding meaningful user-user similarities in sparse interaction matrices.
3. **Cold-Start Problem**: Inability to recommend effectively for new users or items.
