**COLLABORATIVE FILTER FOR TEST SAMPLE OF 500 UNIQUE USERS**

1. Dataset Sampling:
    - The dataset is sampled to 5000 rows to make the processing manageable:
        (data_sample = data.sample(n=5000, random_state=42))

2. User-Item Matrix:
    - The size of the user-item matrix depends on how many unique users (user_ID) and products (product_ID) exist in the sampled dataset. 
        
        If, like in our example, the sampled dataset has: 500 unique users and 10,000 unique products: The matrix would have a shape of (500, 10000).

3. Test Example for RMSE Calculation:
    - The RMSE function iterates through all users in the user_item_matrix for evaluation. It does not limit to 5 users, but predictions are made for a subset of items (5 recommendations) per user:

**Step 1: Import Necessary Libraries**


In [1]:
# Install any missing libraries
!pip install pandas numpy scikit-learn scipy matplotlib




**Step 2: Load the Dataset**


In [14]:
import pandas as pd
import numpy as np

# Load the dataset
file_path = r'.\..\data\data_clean\user_clean_data_ecommerce.csv'

# Read a small sample of the data to avoid memory issues
data = pd.read_csv(file_path)

# Select only the columns needed for collaborative filtering
selected_columns = ['user_ID', 'product_ID', 'rating']
data = data[selected_columns]

# Drop rows with missing values in these columns
data = data.dropna()

# Sample a manageable portion (e.g., 5000 rows) for analysis
data_sample = data.sample(n=5000, random_state=42)

print(data_sample.head())


                             user_ID  product_ID  rating
158     AFTWZJUP2224KGWPBCBBLHS7573A  B07NGCSZYY       5
77549   AETIWS5ZNO2BDWPYIKIH27GKWL2Q  B07CJLJZG9       5
463962  AHXT6J3MAV3SSVYLTBJ326JZH7VQ  B07Z5MJ4Q3       5
345085  AEOGI3A7QFFPGOMEK6Z65X5MV4UA  B007JLYEYQ       5
274505  AGKUCVQPPOXFR5AOTK6FZEZUEGOQ  B07BR1J7HG       5


**Step 3: Create a User-Item Matrix**


We need a matrix where rows represent users, columns represent products, and the values are the ratings:

In [15]:
# Create the User-Item matrix
user_item_matrix = data_sample.pivot_table(index='user_ID', columns='product_ID', values='rating')

# Fill missing values with 0 or use a sparse representation
user_item_matrix = user_item_matrix.fillna(0)

print(user_item_matrix.shape)


(4994, 4222)


**Step 4: Use Collaborative Filtering (Singular Value Decomposition - SVD)**


We'll use TruncatedSVD from scikit-learn for dimensionality reduction:

In [16]:
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics.pairwise import cosine_similarity

# Perform SVD
svd = TruncatedSVD(n_components=500, random_state=42)
decomposed_matrix = svd.fit_transform(user_item_matrix)

# Calculate similarity
similarity_matrix = cosine_similarity(decomposed_matrix)

# Convert to a DataFrame for interpretability
similarity_df = pd.DataFrame(similarity_matrix, index=user_item_matrix.index, columns=user_item_matrix.index)

print(similarity_df.head())


user_ID                       AE234757Z3N6KU76N3GUKO73IJDA   
user_ID                                                      
AE234757Z3N6KU76N3GUKO73IJDA                      1.000000  \
AE23D7HHJAMENM7IKA4IOIHNS7OA                      0.032573   
AE25FOZEGSDS6SD342N7M7ZGAN6Q                     -0.021844   
AE25UKBBGGC72TTWTI5RKAJEC3FQ                     -0.000738   
AE26DIN5RJVLKQWNO2RF4PKX33LA                      0.040270   

user_ID                       AE23D7HHJAMENM7IKA4IOIHNS7OA   
user_ID                                                      
AE234757Z3N6KU76N3GUKO73IJDA                      0.032573  \
AE23D7HHJAMENM7IKA4IOIHNS7OA                      1.000000   
AE25FOZEGSDS6SD342N7M7ZGAN6Q                     -0.123499   
AE25UKBBGGC72TTWTI5RKAJEC3FQ                     -0.000105   
AE26DIN5RJVLKQWNO2RF4PKX33LA                     -0.055539   

user_ID                       AE25FOZEGSDS6SD342N7M7ZGAN6Q   
user_ID                                                      
AE2347

**Step 5: Generate Recommendations**


Using the similarity matrix, recommend products for a given user:

In [17]:
def recommend_products(user_id, similarity_df, user_item_matrix, n_recommendations=5):
    # Get similarity scores for the user
    similar_users = similarity_df[user_id].sort_values(ascending=False)
    
    # Get the products rated by similar users
    similar_users_ratings = user_item_matrix.loc[similar_users.index]
    
    # Sum the ratings for each product
    recommended_products = similar_users_ratings.sum(axis=0).sort_values(ascending=False)
    
    # Exclude products already rated by the target user
    user_products = user_item_matrix.loc[user_id]
    recommended_products = recommended_products[user_products[user_products == 0].index]
    
    return recommended_products.head(n_recommendations)

# Example: Recommend products for a specific user
user_id = user_item_matrix.index[0]  # Replace with a valid user ID
recommendations = recommend_products(user_id, similarity_df, user_item_matrix)
print("Recommended Products:", recommendations)


Recommended Products: product_ID
1453085815     3.0
B000050FDE    14.0
B00005JKQ4     5.0
B000068PBM     5.0
B00009RB0X    10.0
dtype: float64


**Step 6: Evaluation**


Evaluate our model using metrics like precision and recall or RMSE:

Due to an uneven lenght of y_true and y_pred we need to modify the calculate_rmse function to compare predictions only for items that exist in both 'actual_ratings' and 'predicted_ratings'

In [18]:
from sklearn.metrics import mean_squared_error

def calculate_rmse(user_item_matrix, similarity_df, n_recommendations=5):
    mse = 0
    count = 0
    
    for user_id in user_item_matrix.index:
        # Predict ratings
        recommended_products = recommend_products(user_id, similarity_df, user_item_matrix, n_recommendations)
        
        # Get actual ratings for recommended products
        actual_ratings = user_item_matrix.loc[user_id, recommended_products.index]
        
        # Filter out missing values (in case some items are not rated by the user)
        actual_ratings = actual_ratings.dropna()
        predicted_ratings = recommended_products.loc[actual_ratings.index]
        
        # Calculate MSE for the current user (only if there are common items)
        if not actual_ratings.empty:
            mse += mean_squared_error(actual_ratings, predicted_ratings)
            count += 1

    # Calculate RMSE
    rmse = np.sqrt(mse / count) if count > 0 else None
    return rmse

# Call the function
rmse = calculate_rmse(user_item_matrix, similarity_df)
print("RMSE:", rmse)


RMSE: 8.424612092357014


**Explanation of Modification**


1. Align Actual and Predicted Ratings:
    - actual_ratings contains only the items rated by the user.
    - predicted_ratings includes only items from recommended_products that overlap with actual_ratings.
    - This alignment ensures both arrays have the same length.


2. Handle Missing Ratings:
    - We filter out items from actual_ratings and predicted_ratings where actual ratings are missing.


3. Avoid Division by Zero:
    - If no overlapping items exist for a user, we skip the RMSE calculation for that user.


4. Aggregate RMSE:
    - We calculate RMSE over all users who have valid overlapping ratings.