# Collaborative Filtering

1. ***Simple Collaborative Filtering***
- **User-based CF**
- **Item-based CF**
- **Matrix Factorization** (w/Pacakges)

2. **Advanced Collaborative Filtering**
- Neural Collaborative Filtering
- Deep Matrix Factorization



In [122]:
# reset directory
%reset -f

# load libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [123]:
# load data from csv file (AMAZON REVIEWS)
# amazon = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/amz_with_senti_1.csv')

# load sample data from csv file (AMAZON REVIEWS)
# amazon = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/amz_with_senti_sample_1.csv')

# amazon = amazon[amazon['asin'].isin(amazon['asin'].value_counts()[amazon['asin'].value_counts() > 5].index)] # asin with more than 5 reviews (keep reviews with these products)
# amazon = amazon.groupby('reviewerID').filter(lambda x: len(x) >= 10) # how many users with more than 5 reviews (keep those)
# amazon = amazon.drop_duplicates(subset=['reviewerID', 'asin']) # drop duplicates


# load data (AMAZON BOOK REVIEWS)
# amazon = pd.read_csv('/Users/pavansingh/Desktop/Amazon Review Data/Books/Books_rating.csv')
# amazon_1 = pd.read_csv('/Users/pavansingh/Desktop/Amazon Review Data/Books/books_data.csv')
# amazon = pd.merge(amazon, amazon_1, on='Title', how='left')
# amazon = amazon.rename(columns={'User_id': 'reviewerID', 'Title': 'asin', 'review/score': 'overall'})
# amazon = amazon[amazon['asin'].isin(amazon['asin'].value_counts()[amazon['asin'].value_counts() >= 20].index)] # Title with more than 5 reviews (keep reviews with these products)
# amazon = amazon.groupby('reviewerID').filter(lambda x: len(x) >= 20) # how many users with more than 5 reviews (keep those)
# amazon = amazon.drop_duplicates(subset=['reviewerID', 'asin']) # drop duplicates
# amazon = amazon[['reviewerID', 'asin', 'overall']]
amazon = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Masters-Dissertation/Code/Data/book_data.csv', index_col=0)

In [124]:
# check data 
print("Shape of data", amazon.shape)
print("Number of unique users", amazon.reviewerID.nunique())
print("Number of unique products", amazon.asin.nunique())
display(amazon.head(4))

Shape of data (292303, 3)
Number of unique users 7557
Number of unique products 19717


Unnamed: 0,reviewerID,asin,overall
140,A281NPSIMI1C2R,Eyewitness Travel Guide to Europe,5.0
141,A2TAPL67U2A5HM,Eyewitness Travel Guide to Europe,5.0
142,AT9YSY20RJUDX,Eyewitness Travel Guide to Europe,4.0
413,A2KBHSK5BS35BH,Night World: Daughters Of Darkness,1.0


***
#  Simple Collaborative Filtering

## Models

1. User-based CF
2. Item-based CF
3. Matrix Factorization

## Data

Using Amazon Product reviews data which we have cleaned and conducted EDA, feature engineering, and feature selection.

## Evaluation

Model Evaluation: evaluate the performance of our user-based collaborative filtering model by comparing the predicted ratings with the actual ratings provided by users. We split the data into training and testing sets. In the evaluation process of collaborative filtering, we simulate the real-world scenario of making predictions for unseen or unrated items by temporarily hiding some of the user's rated items. These hidden items serve as the testing set, and we use the remaining rated items as the training set.

The process can be summarized as follows:

1. For each user:
- Identify their rated items.
- Randomly split their rated items into two sets: training set and testing set.
- The testing set contains some of the rated items (usually a small percentage), which we temporarily hide from the model.
- The training set contains the remaining rated items, which the model uses to learn about user preferences and calculate user similarity.

2. Using the training set, calculate user similarity and predict ratings for the items in the testing set. We do this by finding similar users and leveraging their ratings to make predictions for the hidden items.

3. Compare the predicted ratings with the actual ratings in the testing set, which were temporarily hidden. Calculate evaluation metrics such as Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) to assess how well the model is predicting user-item ratings for unseen items.

## Applications

### User-Based Collaborative Filtering

1. **change data to user-item matrix**

- We shall be converting the original dataframe into a user-item matrix where each row represents a user, each column represents an item (product), and the cell values are the corresponding ratings given by users.

2. **calculate similarity between users**

- We calculate the similarity between users based on their rating patterns. This is typically done using similarity metrics like cosine similarity or Pearson correlation. The similarity metric measures how similar two users are in terms of their preferences for items.

3. **predict rating for each user-item pair**

- Once we have calculated user similarity, we can use this information to predict the missing (unrated) ratings in the user-item matrix. The idea is to find users who are similar to the target user and use their ratings for unrated items to make predictions.
- we used a weighted sum of the user's past ratings for similar items to predict the ratings

4. **evaluate the performance of the algorithm**

- We need to evaluate how well our predicted ratings match the actual ratings provided by users. This is typically done by splitting the data into training and testing sets, making predictions on the test set, and comparing the predicted ratings with the actual ratings. 

5. **recommend items with highest predicted ratings**

- By identifying the unrated items with the highest predicted ratings, we can suggest new items to users based on their preferences and the preferences of similar users.

**TLDR**: User-based collaborative filtering involves constructing a user-item matrix, calculating user similarity, predicting missing ratings, evaluating the model's performance, and using the model to make personalized item recommendations. 

In [13]:
# import libraries
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [14]:
# user item matrix
user_item_matrix = amazon.pivot(index='reviewerID', columns='asin', values='overall').fillna(0)
print("Shape of user_item_matrix: ", user_item_matrix.shape)

Shape of user_item_matrix:  (7557, 19717)


In [15]:
# normalise ratings
scaler = MinMaxScaler()
normalized_user_item_matrix = pd.DataFrame(scaler.fit_transform(user_item_matrix), columns=user_item_matrix.columns, index=user_item_matrix.index)

In [16]:
# cosine similarity
user_similarity = pd.DataFrame(cosine_similarity(normalized_user_item_matrix), columns=normalized_user_item_matrix.index, index=normalized_user_item_matrix.index)

In [45]:
# make recommendations
def get_user_recommendations(user_id, num_recommendations='all'):
    similar_users = user_similarity[user_id].sort_values(ascending=False)[1:]
    user_reviews = user_item_matrix.loc[user_id]
    recommendations = []

    for product in user_reviews[user_reviews == 0].index:
        similar_users_ratings = normalized_user_item_matrix.loc[similar_users.index, product]
        weighted_rating = (similar_users * similar_users_ratings).sum() / similar_users.sum()
        recommendations.append((product, weighted_rating))

    recommendations.sort(key=lambda x: x[1], reverse=True)

    if num_recommendations == 'all':
        return recommendations
    else:
        return recommendations[:num_recommendations]

# apply recommender function to user
user_id = 'A281NPSIMI1C2R'
recommendations = get_user_recommendations(user_id = user_id, num_recommendations = 5)
print(f"Recommendations for {user_id}:")
for product, rating in recommendations:
    print(f"Product: {product}, Weighted Rating: {rating}")

Recommendations for A281NPSIMI1C2R:
Product: Jane Eyre, Weighted Rating: 0.07768270846022797
Product: Jane Eyre (Everyman's Classics), Weighted Rating: 0.07768270846022797
Product: Jane Eyre (Large Print), Weighted Rating: 0.07768270846022797
Product: Jane Eyre (New Windmill), Weighted Rating: 0.07768270846022797
Product: Jane Eyre (Penguin Classics), Weighted Rating: 0.07768270846022797
Product: Jane Eyre (Signet classics), Weighted Rating: 0.07768270846022797
Product: Jane Eyre (Simple English), Weighted Rating: 0.07768270846022797
Product: Jane Eyre: Complete and Unabridged (Puffin Classics), Weighted Rating: 0.07768270846022797
Product: Wuthering Heights, Weighted Rating: 0.06628760257227284
Product: Wuthering Heights (Penguin Audiobooks), Weighted Rating: 0.06628760257227284
Product: Wuthering Heights (Riverside editions), Weighted Rating: 0.06628760257227284
Product: Wuthering Heights (Signet classics), Weighted Rating: 0.06628760257227284
Product: Wuthering Heights., Weighted Ra

***

In [121]:
# User-based collaborative filtering with Accuracy Metrics
user_id = 'A2TAPL67U2A5HM'
user_item_matrix = amazon.pivot(index='reviewerID', columns='asin', values='overall').fillna(0)

# Scale the ratings in User-Item Matrix
scaler = MinMaxScaler()
normalized_user_item_matrix = pd.DataFrame(scaler.fit_transform(user_item_matrix),
                                            columns=user_item_matrix.columns,
                                            index=user_item_matrix.index)

# Split data into training and testing sets
train_ratings = normalized_user_item_matrix.loc[user_id]

# Identify ratings for testing set
test_ratings = train_ratings[train_ratings != 0].sample(frac=0.2)

# Update the training set by removing the ratings present in the testing set
train_ratings = train_ratings[~train_ratings.index.isin(test_ratings.index) & (train_ratings != 0)]
print("Num Ratings", len(train_ratings))
print("Num ratings (Test)", len(test_ratings))

# Set the testing ratings to 0 in the original user_item_matrix
user_item_matrix.loc[user_id, test_ratings.index] = 0

# get user item matrix for training se
user_similarity = pd.DataFrame(cosine_similarity(normalized_user_item_matrix),
                                columns=normalized_user_item_matrix.index,
                                index=normalized_user_item_matrix.index)

mae_scores = []
rmse_scores = []

similar_users = user_similarity[user_id].sort_values(ascending=False)[1:]
user_reviews = normalized_user_item_matrix.loc[user_id]
recommendations = []

for product in user_reviews[user_reviews == 0].index:
    similar_users_ratings = normalized_user_item_matrix.loc[similar_users.index, product]
    numerator = (similar_users * similar_users_ratings).sum()
    denominator = similar_users.sum()

    if denominator == 0:
        predicted_rating = 0
    else:
        predicted_rating = numerator / denominator

    recommendations.append((product, predicted_rating))

recommendations = pd.DataFrame(recommendations, columns=['product_id', 'predicted_rating'])

# Reverse the scaling of the predicted ratings
recommendations['predicted_rating'] = scaler.inverse_transform(recommendations[['predicted_rating']])

Num Ratings 42
Num ratings (Test) 11


ValueError: non-broadcastable output operand with shape (19664,1) doesn't match the broadcast shape (19664,19717)

In [119]:
recommendations.sort_values(by='predicted_rating', ascending=False).head(10)

Unnamed: 0,product_id,predicted_rating
15289,The Lord of the Rings Trilogy: Three Volumes i...,0.116706
15283,The Lord Of The Rings THREE VOLUME BOXED SET (...,0.116706
15284,The Lord of the Rings (3 Volume Set),0.116706
15286,The Lord of the Rings Box Set,0.116324
15287,The Lord of the Rings Trilogy (The Fellowship ...,0.115725
15285,The Lord of the Rings - Boxed Set,0.115559
15288,The Lord of the Rings Trilogy 3 Volumes,0.115481
5896,Harry Potter and The Sorcerer's Stone,0.096567
15291,The Lord of the Rings: The Fellowship of the R...,0.077487
4845,Fellowship of the Ring 2ND Edition,0.077487


In [116]:
def predict_rating(user_id, item_id, user_similarity, user_item_matrix):
    similar_users = user_similarity[user_id].sort_values(ascending=False)[1:]
    user_reviews = user_item_matrix.loc[user_id]

    numerator = 0
    denominator = 0

    for other_user_id, similarity_score in similar_users.items():
        if user_reviews[other_user_id] != 0:
            numerator += similarity_score * user_reviews[other_user_id]
            denominator += abs(similarity_score)

    if denominator == 0:
        return 0

    predicted_rating = numerator / denominator
    return predicted_rating

In [113]:
recommendations

Unnamed: 0,product_id,weighted_rating
0,"""A"" IS FOR ALIBI",0.004238
1,"""Beatles"" Illustrated Lyrics",0.001426
2,"""C"" is for Corpse (A Kinsey Millhone mystery, ...",0.003137
3,"""Cool Stuff"" They Should Teach in School: Crui...",0.000000
4,"""Could Be Worse!"" (Reading Rainbow Library)",0.000437
...,...,...
19659,the three little pigs,0.000000
19660,the winter prince,0.000669
19661,ttyl,0.002238
19662,using what you got,0.000000


In [111]:
test_ratings

asin
ALICE'S ADVENTURES IN WONDERLAND & THROUGH THE LOOKING GLASS and What Alice Found There (2 volumes)          1.0
Alice's Adventures in Wonderland Throught the Looking Glass                                                  1.0
Twenty Thousand Leagues Under the Sea (Caxton Edition)                                                       1.0
The annotated Jules Verne, Twenty thousand leagues under the sea                                             1.0
Alice's Adventures in Wonderland / Through the Looking-Glass                                                 1.0
Alice's Adventures in Wonderland & Through The Looking Glass (2 Volumes in Slipcase by The Folio Society)    1.0
How to Be a Pirate: The Heroic Misadventures of Hiccup the Viking                                            1.0
Alice's Adventures in Wonderland and Through the Looking Glass (Classic Collection)                          1.0
Twenty Thousand leagues Under the Sea                                                      

In [112]:
predictions

Unnamed: 0,product_id,weighted_rating


In [108]:
predictions = recommendations[recommendations['product_id'].isin(test_ratings.index)]

# merge with test_ratings to get the actual rating
predictions.merge(pd.DataFrame(test_ratings), left_on='product_id', right_on='asin')

KeyError: 'asin'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  predictions['actual'] = test_ratings


Unnamed: 0,product_id,weighted_rating,actual
1022,Alice in Wonderland and Through the Looking Glass,0.202009,
1027,Alice's Adventures in Wonderland Throught the ...,0.202009,
1028,Alice's Adventures in Wonderland and Through t...,0.202009,
1029,Alice's Adventures in Wonderland and Through t...,0.202009,
1030,"Alice's Adventures in Wonderland, and, Through...",0.202009,
2752,"Ceres: Celestial Legend, Vol. 1: Aya",0.001105,
8349,Magic Kingdom for Sale - Sold! (Landover Series),0.008518,
14788,"The Hobbit; Or, There and Back Again",0.392613,
17167,The Wonderful Wizard Of Oz,0.050588,
17168,The Wonderful Wizard of Oz (Lrs Large Print He...,0.013109,


In [94]:
test_ratings

asin
The Wonderful Wizard Of Oz                                                                                1.0
The Wonderful Wizard of Oz (Lrs Large Print Heritage Series)                                              1.0
The Wonderful Wizard of Oz (Oxford World's Classics)                                                      1.0
Ceres: Celestial Legend, Vol. 1: Aya                                                                      1.0
The Hobbit; Or, There and Back Again                                                                      1.0
Alice's Adventures in Wonderland Throught the Looking Glass                                               1.0
Alice's Adventures in Wonderland, and, Through the Looking-Glass                                          1.0
Alice in Wonderland and Through the Looking Glass                                                         1.0
Alice's Adventures in Wonderland and Through the Looking Glass (Classic Collection (Brilliance Audio))    1.0
Alice

Jane Eyre                          0.077683
Jane Eyre (Everyman's Classics)    0.077683
Jane Eyre (Large Print)            0.077683
Jane Eyre (New Windmill)           0.077683
Jane Eyre (Penguin Classics)       0.077683
                                     ...   
the Enemy Within                   0.000000
the New Breed                      0.000000
the lion's paw                     0.000000
using what you got                 0.000000
with an everlasting love           0.000000
Length: 19558, dtype: float64

In [78]:
print("Test:", len(test_ratings))
print("\nPredicted\n", len(predicted_ratings))

Test: 32

Predicted
 0


In [77]:
test_ratings.index
predicted_ratings.index

Index([], dtype='object')

In [74]:
predicted_ratings = predicted_ratings[predicted_ratings.index.isin(test_ratings.index)]
# concat predicted ratings with test ratings (on asin)
acc_df = pd.concat([test_ratings, predicted_ratings], axis=1)
acc_df



Unnamed: 0,A281NPSIMI1C2R,0
"Alice's Adventures in Wonderland, and, Through the Looking-Glass",1.0,
The Portable Dorothy Parker,1.0,
How to Break Your Addiction to a Person,1.0,
"The Perricone Promise: Look Younger, Live Longer in Three Easy Steps",1.0,
Charlotte's web,1.0,
Intimate Issues: 21 Questions Christian Women Ask About Sex,1.0,
Awaken The Giant Within (CD),1.0,
How to Protect Your Children from the National Assault on Innocence,1.0,
How to Really Love Your Child,1.0,
I Can Do It 2005 Calendar,1.0,


In [49]:
# items in predicted that appear in test ratings
predicted_ratings[predicted_ratings.index.isin(test_ratings.index)]

# get predicted and actual ratings
acc_df = 

Series([], dtype: float64)

In [20]:
for user_id in user_item_matrix.index:
    # Split data into training and testing sets
    train_ratings = user_item_matrix.loc[user_id]
    test_ratings = train_ratings[train_ratings != 0].sample(frac=0.2)
    train_ratings = train_ratings[~train_ratings.index.isin(test_ratings.index) & (train_ratings != 0)]

    if len(test_ratings) == 0 or len(train_ratings) == 0:
        print(f'Not enough data for user {user_id}')

    recommendations = get_user_recommendations(user_id, num_recommendations=5)
    predicted_ratings = pd.Series(dict(recommendations))

    common_items = train_ratings.index.intersection(predicted_ratings.index)

    mae_scores.append(mean_absolute_error(train_ratings[common_items], predicted_ratings[common_items]))
    rmse_scores.append(mean_squared_error(train_ratings[common_items], predicted_ratings[common_items], squared=False))

ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.

In [48]:
mean_mae = sum(mae_scores) / len(mae_scores)
mean_rmse = sum(rmse_scores) / len(rmse_scores)

print("Mean MAE: ", mean_mae)
print("Mean RMSE: ", mean_rmse)

ZeroDivisionError: division by zero

In [28]:
# function to evaluate model
def evaluate_collaborative_filtering(reviews_df, num_recommendations=5):
    user_item_matrix = reviews_df.pivot(index='reviewerID', columns='asin', values='overall').fillna(0)

    scaler = MinMaxScaler()
    normalized_user_item_matrix = pd.DataFrame(scaler.fit_transform(user_item_matrix),
                                               columns=user_item_matrix.columns,
                                               index=user_item_matrix.index)

    user_similarity = pd.DataFrame(cosine_similarity(normalized_user_item_matrix),
                                   columns=normalized_user_item_matrix.index,
                                   index=normalized_user_item_matrix.index)

    mae_scores = []
    rmse_scores = []

    for user_id in user_item_matrix.index:
        # Split data into training and testing sets
        train_ratings = user_item_matrix.loc[user_id]
        test_ratings = train_ratings[train_ratings == 0]
        train_ratings = train_ratings[train_ratings > 0]

        if len(test_ratings) == 0:
            continue

        recommendations = get_user_recommendations(user_id, num_recommendations)
        predicted_ratings = pd.Series(dict(recommendations))

        common_items = train_ratings.index.intersection(predicted_ratings.index)
        
        if len(common_items) == 0:
            continue


        mae_scores.append(mean_absolute_error(train_ratings[common_items], predicted_ratings[common_items]))
        rmse_scores.append(mean_squared_error(train_ratings[common_items], predicted_ratings[common_items], squared=False))

    mean_mae = sum(mae_scores) / len(mae_scores)
    mean_rmse = sum(rmse_scores) / len(rmse_scores)

    return mean_mae, mean_rmse

# evaluate model
mean_mae, mean_rmse = evaluate_collaborative_filtering(amazon)
print(f"Mean Absolute Error (MAE): {mean_mae}")
print(f"Root Mean Squared Error (RMSE): {mean_rmse}")

ValueError: Found array with 0 sample(s) (shape=(0,)) while a minimum of 1 is required.

### Item-Based Collaborative Filtering

1. change the data to user-item matrix
2. calculate the similarity between items
3. predict the rating of the user for the item
4. recommend the items with the highest rating
5. evaluate the performance of the model

Summary: item-based collaborative filtering  involves constructing a user-item matrix, calculating the similarity between items, predicting the rating of the user for the item, recommending the items with the highest rating, and evaluating the performance of the model. So **instead of finding similar users, we will find similar items based on their rating patterns and use that information to make item recommendations for a target user.**



In [None]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error

In [None]:

# Function to calculate item similarity
def calculate_item_similarity(user_item_matrix):
    item_similarity = pd.DataFrame(cosine_similarity(user_item_matrix.T), columns=user_item_matrix.columns, index=user_item_matrix.columns)
    return item_similarity

# Function to get item recommendations for a user
def get_item_recommendations(user_id, item_similarity, num_recommendations=5):
    user_ratings = user_item_matrix.loc[user_id]
    recommendations = []

    for item in user_ratings[user_ratings == 0].index:
        similar_items = item_similarity[item].sort_values(ascending=False)[1:]
        weighted_rating = (user_ratings * similar_items).sum() / similar_items.sum()
        recommendations.append((item, weighted_rating))

    recommendations.sort(key=lambda x: x[1], reverse=True)

    return recommendations[:num_recommendations]

# Function to evaluate item-based collaborative filtering
def evaluate_item_based_collaborative_filtering(reviews_df, num_recommendations=5):
    user_item_matrix = reviews_df.pivot(index='reviewerID', columns='asin', values='overall').fillna(0)

    scaler = MinMaxScaler()
    normalized_user_item_matrix = pd.DataFrame(scaler.fit_transform(user_item_matrix),
                                               columns=user_item_matrix.columns,
                                               index=user_item_matrix.index)

    item_similarity = calculate_item_similarity(normalized_user_item_matrix)

    mae_scores = []
    rmse_scores = []

    for user_id in user_item_matrix.index:
        # Split data into training and testing sets
        train_ratings = user_item_matrix.loc[user_id]
        test_ratings = train_ratings[train_ratings == 0]
        train_ratings = train_ratings[train_ratings > 0]

        if len(test_ratings) == 0:
            continue

        recommendations = get_item_recommendations(user_id, item_similarity, num_recommendations)
        predicted_ratings = pd.Series(dict(recommendations))

        common_items = train_ratings.index.intersection(predicted_ratings.index)

        mae_scores.append(mean_absolute_error(train_ratings[common_items], predicted_ratings[common_items]))
        rmse_scores.append(mean_squared_error(train_ratings[common_items], predicted_ratings[common_items], squared=False))

    mean_mae = sum(mae_scores) / len(mae_scores)
    mean_rmse = sum(rmse_scores) / len(rmse_scores)

    return mean_mae, mean_rmse

# Load data (replace 'amazon' with your dataframe)
amazon = pd.read_csv('your_data.csv')  # Replace 'your_data.csv' with the actual file path

# Evaluate item-based collaborative filtering
mean_mae, mean_rmse = evaluate_item_based_collaborative_filtering(amazon)
print(f"Mean Absolute Error (MAE): {mean_mae}")
print(f"Root Mean Squared Error (RMSE): {mean_rmse}")

### Matrix Factorization

1. convert the data into a matrix
    - the rows represent the users
    - the columns represent the items
    - the values represent the ratings

<br>

2. decompose the matrix into two matrices
    - one representing the users and the latent features (user matrix)
    - the other representing the items and the latent features (item matrix)

<br>

3. reconstruct the original matrix by taking the dot product of the two matrices from step 2
    - the reconstructed matrix will have the ratings for all users and items

<br>

4. use the reconstructed matrix from step 3 to make predictions
    - the predictions will be the missing ratings in the original matrix

<br>

5. evaluate the performance of the algorithm
    - compare the predictions to the actual values in the original matrix


**Summary**: Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. The rows of the first matrix represent the users and the columns of the second matrix represent the items. The lower dimensionality is chosen such that the product of the two matrices is equal to the original matrix. The values in the reconstructed matrix are the predicted ratings for the missing values in the original matrix.

In [135]:
# user item matrix
user_item_matrix = amazon.pivot(index='reviewerID', columns='asin', values='overall')
print("Shape of user_item_matrix: ", user_item_matrix.shape)

# convert user item matrix values to binary
used_items = user_item_matrix.applymap(lambda x: 1 if x > 0 else 0)
used_items.head()

asin,"""A"" IS FOR ALIBI","""Beatles"" Illustrated Lyrics","""C"" is for Corpse (A Kinsey Millhone mystery, Book 3)","""Cool Stuff"" They Should Teach in School: Cruise into the Real World...with styyyle (jobs/people skills/attitude/goals/money)","""Could Be Worse!"" (Reading Rainbow Library)","""D"" is for Deadbeat","""F"" is for fugitive: A Kinsey Millhone mystery","""Hey, Whipple, Squeeze This"": A Guide to Creating Great Ads (Adweek Magazine Series)","""I, the Jury""","""Let's Face it, Men are @$#%"": What Women Can Do About It",...,the jinx ship,the land i Lost: Adventures of a Boy in Vietnam,the lion's paw,the rebels,the story of ferdinand,the three little pigs,the winter prince,ttyl,using what you got,with an everlasting love
reviewerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A01038432MVI9JXYTTK5T,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A100V1W0C8BWOL,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A101DG7P9E26PW,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A103U0Q3IKSXHE,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A105E427BB6J65,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [138]:
user_item_matrix.shape[0], 5, user_item_matrix.shape[1]

(7557, 5, 19717)

In [139]:
def  recommender_accuracy(x, observed_rating):
    
    #extract user and movie factors from parameter vector
    n_users, n_factors, n_items = user_item_matrix.shape[0], 5, user_item_matrix.shape[1]
    user_factors = x[:n_users * n_factors].reshape(n_users, n_factors)
    item_factors = x[n_users * n_factors:].reshape(n_factors, n_items)


    # get predictions from dot products of respective user and movie factor
    predicted_ratings = np.dot(user_factors, item_factors)
    
    # convert ratings matrix to numpy array
    observed_rating = observed_rating.to_numpy()
    
    # model accuracy is sum of squared errors over all rated movies
    errors = np.power(observed_rating - predicted_ratings, 2)
    
    # only use rated movies
    rated_items = ~np.isnan(observed_rating)
    mean_error = np.mean(errors[rated_items])
    
    return np.sqrt(mean_error)

# See if it works
recommender_accuracy(x=np.random.rand(7557 * 5 + 5 * 19717), observed_rating=user_item_matrix)

3.190706009187354

In [141]:
def general_recommender_accuracy(x, observed_ratings, n_users, n_items, n_factors):
    user_factors = x[:n_users * n_factors].reshape(n_users, n_factors)
    item_factors = x[n_users * n_factors:].reshape(n_factors, n_items)

    predicted_ratings = np.dot(user_factors, item_factors)
    observed_ratings = observed_ratings.to_numpy()

    errors = (observed_ratings - predicted_ratings) ** 2
    rated_items = ~np.isnan(observed_ratings)

    return  np.sqrt(np.mean(errors[rated_items]))

# See if it works
general_recommender_accuracy(x=np.random.rand(7557 * 5 + 5 * 19717), observed_ratings=user_item_matrix, n_users=7557, n_items=19717, n_factors=5)

3.1873223630866905

In [None]:
# Optimize
from scipy.optimize import minimize
np.random.seed(10)

# BFGS Method
n_users, n_factors, n_items = user_item_matrix.shape[0], 5, user_item_matrix.shape[1]
result = minimize(fun=general_recommender_accuracy,
 x0=np.random.randn(n_users * n_factors + n_factors * n_items),
  args=(user_item_matrix, n_users, n_items, n_factors),
   method='BFGS',
   options={"maxiter":1000})

# see result
print("Convergence using BFGS:", result.success)
print("Minimized Value: ", result.fun)

# different method
result = minimize(fun=general_recommender_accuracy,
 x0=np.random.randn(n_users * n_factors + n_factors * n_items),
  args=(user_item_matrix, n_users, n_items, n_factors),
   method='Nelder-Mead',
   options={"maxiter":1000})

print("\nConvergence using Nelder-Mead:", result.success)
print("Minimized Value: ", result.fun)

In [None]:
# Extract user factors and item factors
user_factors = np.reshape(result.x[:n_users * n_factors], (n_users, n_factors))
item_factors = np.reshape(result.x[n_users * n_factors:], (n_factors, n_items))

# See result
print("User Factor: \n", pd.DataFrame(user_factors).head(6), "\n Shape:", user_factors.shape, "\n")
print("Item Factor: \n", pd.DataFrame(item_factors).head(6), "\n Shape:", item_factors.shape, "\n")


In [None]:
# Check Predictions for User 1
predicted_ratings = np.dot(user_factors, item_factors)

# Get DataFrame for visibility
predicted_ratings_df = pd.DataFrame(predicted_ratings, index=user_item_matrix.index, columns=user_item_matrix.columns)
predicted_ratings_df = np.round(predicted_ratings_df, 1)

# See for User 1 and compare to observed
user_id = 'User 1'  # Replace 'User 1' with the actual ID of the user you want to check
user_1_predictions = predicted_ratings_df.loc[user_id]
user_1_observed = user_item_matrix.loc[user_id]

comparison_df = pd.DataFrame({"Predicted Rating": user_1_predictions, "Observed Rating": user_1_observed})
print(comparison_df)

# Advanced Collaborative Filtering

## Models

1. Neural Collaborative Filtering (NCF)
2. Neural Matrix Factorization (NeuMF)

## Application

### Neural Collaborative Filtering

This piece is a TensorFlow implementation of Neural Collaborative Filtering (NCF) from the paper [He et al. (2017)](https://arxiv.org/pdf/1708.05031.pdf).

Summary: NCF uses neural networks to model the interactions between users and items. 

### Neural Matrix Factorization

This piece is an implementation of a neural matrix factorization model for collaborative filtering. 

The neural network is trained to learn the user and item embeddings that best reconstruct the observed ratings in the interaction matrix. NMF leverages the power of neural networks to capture non-linear relationships between users and items, enabling better representation of the user and item latent features.

**I.e., a neural network-based matrix factorization approach**. NMF directly factorizes the user-item interaction matrix, and the latent features are learned by a neural network.