<a href="https://colab.research.google.com/github/reckn/super-disco/blob/main/choosing_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Surprise Library Evaluation 🎁

### Overview
This code utilizes the Surprise library to evaluate various recommendation algorithms on synthetic data. It loads customer interactions, purchase history, and product details datasets, merges them, and splits them into train and test sets. Then, it trains different algorithms and evaluates their performance in terms of Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and execution time.

### Functions and Workflow Explained:

#### 1. `process_data_in_batches(filename, chunksize)`
   - **Purpose**: Process data in batches to handle large datasets efficiently.
   - **Inputs**:
       - `filename` (str): Path to the data file.
       - `chunksize` (int): Size of each batch.
   - **Output**: Generator yielding data chunks.

#### 2. Processing and Splitting Data
   - **Purpose**: Merge datasets, replace NaN values with 0, and filter relevant columns.
   - **Workflow**:
       - Read data in chunks.
       - Merge datasets and handle missing values.
       - Filter relevant columns (Customer ID, Product ID, Ratings).
       - Define the Reader object and load data into Surprise Dataset.
       - Split the data into train and test sets.

#### 3. Algorithm Evaluation
   - **Purpose**: Train various recommendation algorithms and evaluate their performance.
   - **Workflow**:
       - Define a list of algorithms.
       - Iterate over algorithms and train/test each.
       - Calculate RMSE, MAE, and execution time for each algorithm.

### How to Use:
1. Ensure you have the necessary CSV files containing customer interactions, purchase history, and product details.
2. Adjust the chunk size for batch processing based on memory constraints.
3. Run the script to evaluate recommendation algorithms.
4. Check the printed output for algorithm performance metrics and execution times.

### Additional Notes:
- Surprise library provides a convenient framework for building and evaluating recommendation systems.
- Batch processing is useful for handling large datasets efficiently.
- Experiment with different algorithms and parameters for optimal performance.


In [1]:
pip install surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise (from surprise)
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3163007 sha256=db81d743c9d5d2028bdbd3ba01cdbbb093d7ce9c516823c5c6d6c109db42a82b
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.3 surprise-0.1


This part of notebook is experiment using 'purchase_history' dataframe of size 300K

In [2]:
import pandas as pd
from surprise import Dataset, Reader, KNNBasic, KNNWithMeans, KNNWithZScore, BaselineOnly, CoClustering, SlopeOne, NMF, SVD, SVDpp
from surprise.model_selection import train_test_split
from surprise import accuracy
import time

# Define function to process data in batches
def process_data_in_batches(filename, chunksize):
    reader = pd.read_csv(filename, chunksize=chunksize)
    for chunk in reader:
        yield chunk

# Set the chunk size for batch processing
chunksize = 10000  # Adjust based on your memory constraints

# Process and split data into train and test sets in batches
customer_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/customer_interactions_synthetic.csv', chunksize)
purchase_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/purchase_history_synthetic.csv', chunksize)
product_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/product_details_synthetic.csv', chunksize)

for customer_chunk, purchase_chunk, product_chunk in zip(customer_chunks, purchase_chunks, product_chunks):
    # Merge datasets
    merged_data = pd.merge(customer_chunk, purchase_chunk, on='Customer ID', how='right')
    merged_data = pd.merge(merged_data, product_chunk, on='Product ID', how='left')

    # Replace all NaN values with 0
    merged_data.fillna(0, inplace=True)

    # Filter relevant columns
    data = merged_data[['Customer ID', 'Product ID', 'Ratings']]

    # Define the Reader object
    reader = Reader(rating_scale=(1, 5))

    # Load the data into Surprise Dataset
    data = Dataset.load_from_df(data, reader)

    # Split the data into train and test sets
    trainset, testset = train_test_split(data, test_size=0.2)

    # List of algorithms
    algorithms = [
        KNNBasic(),
        KNNWithMeans(),
        KNNWithZScore(),
        BaselineOnly(),
        CoClustering(),
        SlopeOne(),
        NMF(),
        SVD(),
        SVDpp()
    ]

    # Iterate over algorithms
    for algo in algorithms:
        start_time = time.time()
        # Train the algorithm on the trainset
        algo.fit(trainset)

        # Test the algorithm on the testset
        predictions = algo.test(testset)

        # Evaluate the model
        rmse = accuracy.rmse(predictions, verbose=False)
        mae = accuracy.mae(predictions, verbose=False)
        execution_time = time.time() - start_time
        print(f"{algo.__class__.__name__} RMSE: {rmse:.4f}, MAE: {mae:.4f}, Execution Time: {execution_time:.2f} seconds")


Computing the msd similarity matrix...
Done computing similarity matrix.
KNNBasic RMSE: 1.1491, MAE: 0.9811, Execution Time: 0.58 seconds
Computing the msd similarity matrix...
Done computing similarity matrix.
KNNWithMeans RMSE: 1.3657, MAE: 1.1316, Execution Time: 0.61 seconds
Computing the msd similarity matrix...
Done computing similarity matrix.
KNNWithZScore RMSE: 1.3668, MAE: 1.1333, Execution Time: 0.76 seconds
Estimating biases using als...
BaselineOnly RMSE: 0.6968, MAE: 0.5888, Execution Time: 0.03 seconds
CoClustering RMSE: 1.1539, MAE: 0.9562, Execution Time: 0.44 seconds
SlopeOne RMSE: 1.3644, MAE: 1.1286, Execution Time: 0.06 seconds
NMF RMSE: 0.9789, MAE: 0.7770, Execution Time: 0.44 seconds
SVD RMSE: 0.6480, MAE: 0.5401, Execution Time: 0.11 seconds
SVDpp RMSE: 0.5929, MAE: 0.4910, Execution Time: 0.12 seconds


In [3]:
import pandas as pd
from surprise import Dataset, Reader, KNNBasic, KNNWithMeans, KNNWithZScore, BaselineOnly, CoClustering, SlopeOne, NMF, SVD, SVDpp
from surprise.model_selection import train_test_split
from surprise import accuracy
import time

# Define function to process data in batches
def process_data_in_batches(filename, chunksize):
    reader = pd.read_csv(filename, chunksize=chunksize)
    for chunk in reader:
        yield chunk

# Set the chunk size for batch processing
chunksize = 10000  # Adjust based on your memory constraints

# Process and split data into train and test sets in batches
customer_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/customer_interactions_synthetic.csv', chunksize)
purchase_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/purchase_history_synthetic.csv', chunksize)
product_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/product_details_synthetic.csv', chunksize)

for customer_chunk, purchase_chunk, product_chunk in zip(customer_chunks, purchase_chunks, product_chunks):
    # Merge datasets
    merged_data = pd.merge(customer_chunk, purchase_chunk, on='Customer ID', how='right')
    merged_data = pd.merge(merged_data, product_chunk, on='Product ID', how='left')

    # Replace all NaN values with 0
    merged_data.fillna(0, inplace=True)

    # Filter relevant columns
    data = merged_data[['Customer ID', 'Product ID', 'Ratings', 'Price_x', 'Page Views_x', 'Page Views_y', 'Time Spent (minutes)']]

    # Define the Reader object
    reader = Reader(rating_scale=(1, 5), line_format='user item rating')

    # Load the data into Surprise Dataset
    data = Dataset.load_from_df(data[['Customer ID', 'Product ID', 'Ratings']], reader)

    # Split the data into train and test sets
    trainset, testset = train_test_split(data, test_size=0.2)

    # List of algorithms
    algorithms = [
        KNNBasic(),
        KNNWithMeans(),
        KNNWithZScore(),
        BaselineOnly(),
        CoClustering(),
        SlopeOne(),
        NMF(),
        SVD(),
        SVDpp()
    ]

    # Iterate over algorithms
    for algo in algorithms:
        start_time = time.time()
        # Train the algorithm on the trainset
        algo.fit(trainset)

        # Test the algorithm on the testset
        predictions = algo.test(testset)

        # Evaluate the model
        rmse = accuracy.rmse(predictions, verbose=False)
        mae = accuracy.mae(predictions, verbose=False)
        execution_time = time.time() - start_time
        print(f"{algo.__class__.__name__} RMSE: {rmse:.4f}, MAE: {mae:.4f}, Execution Time: {execution_time:.2f} seconds")


Computing the msd similarity matrix...
Done computing similarity matrix.
KNNBasic RMSE: 1.1412, MAE: 0.9802, Execution Time: 0.64 seconds
Computing the msd similarity matrix...
Done computing similarity matrix.
KNNWithMeans RMSE: 1.3684, MAE: 1.1261, Execution Time: 0.64 seconds
Computing the msd similarity matrix...
Done computing similarity matrix.
KNNWithZScore RMSE: 1.3688, MAE: 1.1267, Execution Time: 0.74 seconds
Estimating biases using als...
BaselineOnly RMSE: 0.6873, MAE: 0.5854, Execution Time: 0.03 seconds
CoClustering RMSE: 1.1330, MAE: 0.8916, Execution Time: 0.43 seconds
SlopeOne RMSE: 1.3678, MAE: 1.1245, Execution Time: 0.06 seconds
NMF RMSE: 0.9501, MAE: 0.7514, Execution Time: 0.45 seconds
SVD RMSE: 0.6445, MAE: 0.5407, Execution Time: 0.12 seconds
SVDpp RMSE: 0.5870, MAE: 0.4888, Execution Time: 0.11 seconds


In [4]:
merged_data.head()

Unnamed: 0,Customer ID,Page Views_x,Time Spent (minutes),Avatar,Product ID,Purchase Date,Category_x,Price_x,Page Views_y,Category_y,Price_y,Ratings,Product Icon
0,1217225,178,134.0,https://raw.githubusercontent.com/reckn/super-...,942,2023-10-05 06:51:34.943239,Electronics,3787.672607,3,Electronics,3787.672607,2.6,https://raw.githubusercontent.com/reckn/super-...
1,3670929,182,150.0,https://raw.githubusercontent.com/reckn/super-...,884,2023-12-23 10:09:31.943314,Electronics,3988.595379,1,Electronics,3988.595379,4.8,https://raw.githubusercontent.com/reckn/super-...
2,3103312,63,50.0,https://raw.githubusercontent.com/reckn/super-...,232,2023-08-27 14:17:02.943330,Consumer Electronics Accessories,140.166011,1,Consumer Electronics Accessories,140.166011,1.2,https://raw.githubusercontent.com/reckn/super-...
3,980994,104,89.0,https://raw.githubusercontent.com/reckn/super-...,631,2023-12-06 02:52:11.943345,Home and Kitchen Appliances,272.764911,3,Home and Kitchen Appliances,272.764911,2.4,https://raw.githubusercontent.com/reckn/super-...
4,4941220,178,142.0,https://raw.githubusercontent.com/reckn/super-...,1005,2023-09-17 08:53:39.943360,Beauty and Personal Care Products,164.574978,3,Beauty and Personal Care Products,164.574978,3.2,https://raw.githubusercontent.com/reckn/super-...


In [5]:
len(data.raw_ratings)

10000

In [6]:
import pandas as pd
from surprise import Dataset, Reader, KNNBasic, KNNWithMeans, KNNWithZScore, BaselineOnly, CoClustering, SlopeOne, NMF, SVD, SVDpp
from surprise.model_selection import train_test_split
from surprise import accuracy
import time

# Define function to process data in batches
def process_data_in_batches(filename, chunksize):
    reader = pd.read_csv(filename, chunksize=chunksize)
    for chunk in reader:
        yield chunk

# Set the chunk size for batch processing
chunksize = 10000  # Adjust based on your memory constraints

# Load customer interactions data
customer_interactions = pd.read_csv('/content/drive/MyDrive/fair_dataset/customer_interactions_synthetic.csv')

# Process and split data into train and test sets in batches
customer_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/customer_interactions_synthetic.csv', chunksize)
purchase_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/purchase_history_synthetic.csv', chunksize)
product_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/product_details_synthetic.csv', chunksize)

for customer_chunk, purchase_chunk, product_chunk in zip(customer_chunks, purchase_chunks, product_chunks):
    # Merge datasets
    merged_data = pd.merge(customer_chunk, purchase_chunk, on='Customer ID', how='right')
    merged_data = pd.merge(merged_data, product_chunk, on='Product ID', how='left')

    # Replace NaN values of 'Page Views_x' and 'Time Spent (minutes)' with values from customer interactions
    merged_data['Page Views_x'] = merged_data['Page Views_x'].fillna(merged_data['Customer ID'].map(customer_interactions.set_index('Customer ID')['Page Views']))
    merged_data['Time Spent (minutes)'] = merged_data['Time Spent (minutes)'].fillna(merged_data['Customer ID'].map(customer_interactions.set_index('Customer ID')['Time Spent (minutes)']))

    # Replace all other NaN values with 0
    merged_data.fillna(0, inplace=True)

    # Filter relevant columns
    data = merged_data[['Customer ID', 'Product ID', 'Ratings', 'Price_x', 'Page Views_x', 'Page Views_y', 'Time Spent (minutes)']]

    # Define the Reader object
    reader = Reader(rating_scale=(1, 5), line_format='user item rating')

    # Load the data into Surprise Dataset
    data = Dataset.load_from_df(data[['Customer ID', 'Product ID', 'Ratings']], reader)

    # Split the data into train and test sets
    trainset, testset = train_test_split(data, test_size=0.2)

    # List of algorithms
    algorithms = [
        KNNBasic(),
        KNNWithMeans(),
        KNNWithZScore(),
        BaselineOnly(),
        CoClustering(),
        SlopeOne(),
        NMF(),
        SVD(),
        SVDpp()
    ]

    # Iterate over algorithms
    for algo in algorithms:
        start_time = time.time()
        # Train the algorithm on the trainset
        algo.fit(trainset)

        # Test the algorithm on the testset
        predictions = algo.test(testset)

        # Evaluate the model
        rmse = accuracy.rmse(predictions, verbose=False)
        mae = accuracy.mae(predictions, verbose=False)
        execution_time = time.time() - start_time
        print(f"{algo.__class__.__name__} RMSE: {rmse:.4f}, MAE: {mae:.4f}, Execution Time: {execution_time:.2f} seconds")


Computing the msd similarity matrix...
Done computing similarity matrix.
KNNBasic RMSE: 1.1137, MAE: 0.9452, Execution Time: 0.62 seconds
Computing the msd similarity matrix...
Done computing similarity matrix.
KNNWithMeans RMSE: 1.3375, MAE: 1.0958, Execution Time: 0.62 seconds
Computing the msd similarity matrix...
Done computing similarity matrix.
KNNWithZScore RMSE: 1.3376, MAE: 1.0960, Execution Time: 0.81 seconds
Estimating biases using als...
BaselineOnly RMSE: 0.6685, MAE: 0.5620, Execution Time: 0.04 seconds
CoClustering RMSE: 1.1038, MAE: 0.9051, Execution Time: 0.49 seconds
SlopeOne RMSE: 1.3373, MAE: 1.0952, Execution Time: 0.06 seconds
NMF RMSE: 0.9258, MAE: 0.7256, Execution Time: 0.49 seconds
SVD RMSE: 0.6219, MAE: 0.5156, Execution Time: 0.14 seconds
SVDpp RMSE: 0.5735, MAE: 0.4720, Execution Time: 0.18 seconds


In [7]:
merged_data.head()

Unnamed: 0,Customer ID,Page Views_x,Time Spent (minutes),Avatar,Product ID,Purchase Date,Category_x,Price_x,Page Views_y,Category_y,Price_y,Ratings,Product Icon
0,1217225,178,134.0,https://raw.githubusercontent.com/reckn/super-...,942,2023-10-05 06:51:34.943239,Electronics,3787.672607,3,Electronics,3787.672607,2.6,https://raw.githubusercontent.com/reckn/super-...
1,3670929,182,150.0,https://raw.githubusercontent.com/reckn/super-...,884,2023-12-23 10:09:31.943314,Electronics,3988.595379,1,Electronics,3988.595379,4.8,https://raw.githubusercontent.com/reckn/super-...
2,3103312,63,50.0,https://raw.githubusercontent.com/reckn/super-...,232,2023-08-27 14:17:02.943330,Consumer Electronics Accessories,140.166011,1,Consumer Electronics Accessories,140.166011,1.2,https://raw.githubusercontent.com/reckn/super-...
3,980994,104,89.0,https://raw.githubusercontent.com/reckn/super-...,631,2023-12-06 02:52:11.943345,Home and Kitchen Appliances,272.764911,3,Home and Kitchen Appliances,272.764911,2.4,https://raw.githubusercontent.com/reckn/super-...
4,4941220,178,142.0,https://raw.githubusercontent.com/reckn/super-...,1005,2023-09-17 08:53:39.943360,Beauty and Personal Care Products,164.574978,3,Beauty and Personal Care Products,164.574978,3.2,https://raw.githubusercontent.com/reckn/super-...


In [8]:
len(data.raw_ratings)

10000

In [9]:
print(customer_interactions[customer_interactions['Customer ID'] == 2582589])


Empty DataFrame
Columns: [Customer ID, Page Views, Time Spent (minutes), Avatar]
Index: []


In [10]:
# Assuming 'merged_data' is your DataFrame containing the merged data
ratings_min = merged_data['Ratings'].min()
ratings_max = merged_data['Ratings'].max()

print(f"Minimum Rating: {ratings_min}")
print(f"Maximum Rating: {ratings_max}")


Minimum Rating: 1.0
Maximum Rating: 5.0


In [11]:
import pandas as pd
from surprise import Dataset, Reader, KNNBasic, KNNWithMeans, KNNWithZScore, BaselineOnly, CoClustering, SlopeOne, NMF, SVD, SVDpp
from surprise.model_selection import train_test_split
from surprise import accuracy
import time

# Define function to process data in batches
def process_data_in_batches(filename, chunksize):
    reader = pd.read_csv(filename, chunksize=chunksize)
    for chunk in reader:
        yield chunk

# Set the chunk size for batch processing
chunksize = 1000  # Adjust based on your memory constraints

# Load customer interactions data
customer_interactions = pd.read_csv('/content/drive/MyDrive/fair_dataset/customer_interactions_synthetic.csv')

# Process and split data into train and test sets in batches
customer_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/customer_interactions_synthetic.csv', chunksize)
purchase_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/purchase_history_synthetic.csv', chunksize)
product_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/product_details_synthetic.csv', chunksize)

for idx, (customer_chunk, purchase_chunk, product_chunk) in enumerate(zip(customer_chunks, purchase_chunks, product_chunks)):
    # Print the chunk size
    print(f"Processing Chunk {idx + 1} with Chunk Size: {len(customer_chunk)}")

    # Merge datasets
    merged_data = pd.merge(customer_chunk, purchase_chunk, on='Customer ID', how='right')
    merged_data = pd.merge(merged_data, product_chunk, on='Product ID', how='left')

    # Replace NaN values of 'Page Views_x' and 'Time Spent (minutes)' with values from customer interactions
    merged_data['Page Views_x'] = merged_data['Page Views_x'].fillna(merged_data['Customer ID'].map(customer_interactions.set_index('Customer ID')['Page Views']))
    merged_data['Time Spent (minutes)'] = merged_data['Time Spent (minutes)'].fillna(merged_data['Customer ID'].map(customer_interactions.set_index('Customer ID')['Time Spent (minutes)']))

    # Replace all NaN values with 0
    merged_data.fillna(0, inplace=True)

    # Filter relevant columns
    data = merged_data[['Customer ID', 'Product ID', 'Ratings']]

    # Define the Reader object
    reader = Reader(rating_scale=(1, 5))

    # Load the data into Surprise Dataset
    data = Dataset.load_from_df(data, reader)

    # Split the data into train and test sets
    trainset, testset = train_test_split(data, test_size=0.2)

    # List of algorithms
    algorithms = [
        KNNBasic(),
        KNNWithMeans(),
        KNNWithZScore(),
        BaselineOnly(),
        CoClustering(),
        SlopeOne(),
        NMF(),
        SVD(),
        SVDpp()
    ]

    # Iterate over algorithms
    for algo in algorithms:
        start_time = time.time()
        # Train the algorithm on the trainset
        algo.fit(trainset)

        # Test the algorithm on the testset
        predictions = algo.test(testset)

        # Evaluate the model
        rmse = accuracy.rmse(predictions, verbose=False)
        mae = accuracy.mae(predictions, verbose=False)
        execution_time = time.time() - start_time
        print(f"{algo.__class__.__name__} RMSE: {rmse:.4f}, MAE: {mae:.4f}, Execution Time: {execution_time:.2f} seconds")


Processing Chunk 1 with Chunk Size: 1000
Computing the msd similarity matrix...
Done computing similarity matrix.
KNNBasic RMSE: 1.3479, MAE: 1.1207, Execution Time: 0.01 seconds
Computing the msd similarity matrix...
Done computing similarity matrix.
KNNWithMeans RMSE: 1.3701, MAE: 1.1385, Execution Time: 0.01 seconds
Computing the msd similarity matrix...
Done computing similarity matrix.
KNNWithZScore RMSE: 1.3701, MAE: 1.1385, Execution Time: 0.03 seconds
Estimating biases using als...
BaselineOnly RMSE: 1.2680, MAE: 1.0543, Execution Time: 0.01 seconds
CoClustering RMSE: 1.3268, MAE: 1.1046, Execution Time: 0.16 seconds
SlopeOne RMSE: 1.3701, MAE: 1.1385, Execution Time: 0.01 seconds
NMF RMSE: 1.3304, MAE: 1.1054, Execution Time: 0.09 seconds
SVD RMSE: 1.2708, MAE: 1.0561, Execution Time: 0.01 seconds
SVDpp RMSE: 1.2407, MAE: 1.0316, Execution Time: 0.01 seconds
Processing Chunk 2 with Chunk Size: 1000
Computing the msd similarity matrix...
Done computing similarity matrix.
KNNBas

Let's push the chunksize to 5000 and save the model!

In [12]:
import pandas as pd
from surprise import Dataset, Reader, KNNBasic, KNNWithMeans, KNNWithZScore, BaselineOnly, CoClustering, SlopeOne, NMF, SVD, SVDpp
from surprise.model_selection import train_test_split
from surprise import accuracy
import time

# Define function to process data in batches
def process_data_in_batches(filename, chunksize):
    reader = pd.read_csv(filename, chunksize=chunksize)
    for chunk in reader:
        yield chunk

# Set the chunk size for batch processing
chunksize = 50000  # Adjust based on your memory constraints

# Load customer interactions data
customer_interactions = pd.read_csv('/content/drive/MyDrive/fair_dataset/customer_interactions_synthetic.csv')

# Process and split data into train and test sets in batches
customer_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/customer_interactions_synthetic.csv', chunksize)
purchase_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/purchase_history_synthetic.csv', chunksize)
product_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/product_details_synthetic.csv', chunksize)

for customer_chunk, purchase_chunk, product_chunk in zip(customer_chunks, purchase_chunks, product_chunks):
    # Merge datasets
    merged_data = pd.merge(customer_chunk, purchase_chunk, on='Customer ID', how='right')
    merged_data = pd.merge(merged_data, product_chunk, on='Product ID', how='left')

    # Replace NaN values of 'Page Views_x' and 'Time Spent (minutes)' with values from customer interactions
    merged_data['Page Views_x'] = merged_data['Page Views_x'].fillna(merged_data['Customer ID'].map(customer_interactions.set_index('Customer ID')['Page Views']))
    merged_data['Time Spent (minutes)'] = merged_data['Time Spent (minutes)'].fillna(merged_data['Customer ID'].map(customer_interactions.set_index('Customer ID')['Time Spent (minutes)']))

    # Replace all other NaN values with 0
    merged_data.fillna(0, inplace=True)

    # Filter relevant columns
    data = merged_data[['Customer ID', 'Product ID', 'Ratings', 'Price_x', 'Page Views_x', 'Page Views_y', 'Time Spent (minutes)']]

    # Define the Reader object
    reader = Reader(rating_scale=(1, 5), line_format='user item rating')

    # Load the data into Surprise Dataset
    data = Dataset.load_from_df(data[['Customer ID', 'Product ID', 'Ratings']], reader)

    # Split the data into train and test sets
    trainset, testset = train_test_split(data, test_size=0.2)

    # List of algorithms
    algorithms = [
        KNNBasic(),
        KNNWithMeans(),
        KNNWithZScore(),
        BaselineOnly(),
        CoClustering(),
        SlopeOne(),
        NMF(),
        SVD(),
        SVDpp()
    ]

    # Iterate over algorithms
    for algo in algorithms:
        start_time = time.time()
        # Train the algorithm on the trainset
        algo.fit(trainset)

        # Test the algorithm on the testset
        predictions = algo.test(testset)

        # Evaluate the model
        rmse = accuracy.rmse(predictions, verbose=False)
        mae = accuracy.mae(predictions, verbose=False)
        execution_time = time.time() - start_time
        print(f"{algo.__class__.__name__} RMSE: {rmse:.4f}, MAE: {mae:.4f}, Execution Time: {execution_time:.2f} seconds")


Computing the msd similarity matrix...
Done computing similarity matrix.
KNNBasic RMSE: 0.8661, MAE: 0.5663, Execution Time: 2.94 seconds
Computing the msd similarity matrix...
Done computing similarity matrix.
KNNWithMeans RMSE: 1.0991, MAE: 0.8497, Execution Time: 3.02 seconds
Computing the msd similarity matrix...
Done computing similarity matrix.
KNNWithZScore RMSE: 1.1435, MAE: 0.8966, Execution Time: 3.22 seconds
Estimating biases using als...
BaselineOnly RMSE: 0.2536, MAE: 0.2152, Execution Time: 0.10 seconds
CoClustering RMSE: 0.7478, MAE: 0.5926, Execution Time: 1.08 seconds
SlopeOne RMSE: 1.1485, MAE: 0.9112, Execution Time: 0.17 seconds
NMF RMSE: 0.2820, MAE: 0.1786, Execution Time: 1.22 seconds
SVD RMSE: 0.1594, MAE: 0.1272, Execution Time: 0.72 seconds
SVDpp RMSE: 0.1609, MAE: 0.1261, Execution Time: 0.85 seconds


In [13]:
import pandas as pd
from surprise import Dataset, Reader, KNNBasic, KNNWithMeans, KNNWithZScore, BaselineOnly, CoClustering, SlopeOne, NMF, SVD, SVDpp
from surprise.model_selection import train_test_split
from surprise import accuracy
import time
import pickle

# Define function to process data in batches
def process_data_in_batches(filename, chunksize):
    reader = pd.read_csv(filename, chunksize=chunksize)
    for chunk in reader:
        yield chunk

# Set the chunk size for batch processing
chunksize = 50000  # Adjust based on your memory constraints

# Load customer interactions data
customer_interactions = pd.read_csv('/content/drive/MyDrive/fair_dataset/customer_interactions_synthetic.csv')

# Process and split data into train and test sets in batches
customer_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/customer_interactions_synthetic.csv', chunksize)
purchase_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/purchase_history_synthetic.csv', chunksize)
product_chunks = process_data_in_batches('/content/drive/MyDrive/fair_dataset/product_details_synthetic.csv', chunksize)

for customer_chunk, purchase_chunk, product_chunk in zip(customer_chunks, purchase_chunks, product_chunks):
    # Merge datasets
    merged_data = pd.merge(customer_chunk, purchase_chunk, on='Customer ID', how='right')
    merged_data = pd.merge(merged_data, product_chunk, on='Product ID', how='left')

    # Replace NaN values of 'Page Views_x' and 'Time Spent (minutes)' with values from customer interactions
    merged_data['Page Views_x'] = merged_data['Page Views_x'].fillna(merged_data['Customer ID'].map(customer_interactions.set_index('Customer ID')['Page Views']))
    merged_data['Time Spent (minutes)'] = merged_data['Time Spent (minutes)'].fillna(merged_data['Customer ID'].map(customer_interactions.set_index('Customer ID')['Time Spent (minutes)']))

    # Replace all other NaN values with 0
    merged_data.fillna(0, inplace=True)

    # Filter relevant columns
    data = merged_data[['Customer ID', 'Product ID', 'Ratings', 'Price_x', 'Page Views_x', 'Page Views_y', 'Time Spent (minutes)']]

    # Define the Reader object
    reader = Reader(rating_scale=(1, 5), line_format='user item rating')

    # Load the data into Surprise Dataset
    data = Dataset.load_from_df(data[['Customer ID', 'Product ID', 'Ratings']], reader)

    # Split the data into train and test sets
    trainset, testset = train_test_split(data, test_size=0.2)

    # List of algorithms
    algorithms = [
        KNNBasic(),
        KNNWithMeans(),
        KNNWithZScore(),
        BaselineOnly(),
        CoClustering(),
        SlopeOne(),
        NMF(),
        SVD(),
        SVDpp()
    ]

    # Iterate over algorithms
    for algo in algorithms:
        start_time = time.time()
        # Train the algorithm on the trainset
        algo.fit(trainset)

        # Test the algorithm on the testset
        predictions = algo.test(testset)

        # Evaluate the model
        rmse = accuracy.rmse(predictions, verbose=False)
        mae = accuracy.mae(predictions, verbose=False)
        execution_time = time.time() - start_time
        print(f"{algo.__class__.__name__} RMSE: {rmse:.4f}, MAE: {mae:.4f}, Execution Time: {execution_time:.2f} seconds")

        # Save the trained model using pickle
        model_filename = f"{algo.__class__.__name__}_model.pkl"
        with open(model_filename, 'wb') as f:
            pickle.dump(algo, f)
        print(f"Trained model saved as {model_filename}")


Computing the msd similarity matrix...
Done computing similarity matrix.
KNNBasic RMSE: 0.8738, MAE: 0.5744, Execution Time: 3.08 seconds
Trained model saved as KNNBasic_model.pkl
Computing the msd similarity matrix...
Done computing similarity matrix.
KNNWithMeans RMSE: 1.0976, MAE: 0.8526, Execution Time: 2.98 seconds
Trained model saved as KNNWithMeans_model.pkl
Computing the msd similarity matrix...
Done computing similarity matrix.
KNNWithZScore RMSE: 1.1378, MAE: 0.8945, Execution Time: 3.18 seconds
Trained model saved as KNNWithZScore_model.pkl
Estimating biases using als...
BaselineOnly RMSE: 0.2540, MAE: 0.2158, Execution Time: 0.12 seconds
Trained model saved as BaselineOnly_model.pkl
CoClustering RMSE: 0.8182, MAE: 0.6333, Execution Time: 1.22 seconds
Trained model saved as CoClustering_model.pkl
SlopeOne RMSE: 1.1488, MAE: 0.9145, Execution Time: 0.16 seconds
Trained model saved as SlopeOne_model.pkl
NMF RMSE: 0.2878, MAE: 0.1801, Execution Time: 1.19 seconds
Trained model 

In [14]:
merged_data.head()

Unnamed: 0,Customer ID,Page Views_x,Time Spent (minutes),Avatar,Product ID,Purchase Date,Category_x,Price_x,Page Views_y,Category_y,Price_y,Ratings,Product Icon
0,1217225,178,134.0,https://raw.githubusercontent.com/reckn/super-...,942,2023-10-05 06:51:34.943239,Electronics,3787.672607,3,Electronics,3787.672607,2.6,https://raw.githubusercontent.com/reckn/super-...
1,3670929,182,150.0,https://raw.githubusercontent.com/reckn/super-...,884,2023-12-23 10:09:31.943314,Electronics,3988.595379,1,Electronics,3988.595379,4.8,https://raw.githubusercontent.com/reckn/super-...
2,3103312,63,50.0,https://raw.githubusercontent.com/reckn/super-...,232,2023-08-27 14:17:02.943330,Consumer Electronics Accessories,140.166011,1,Consumer Electronics Accessories,140.166011,1.2,https://raw.githubusercontent.com/reckn/super-...
3,980994,104,89.0,https://raw.githubusercontent.com/reckn/super-...,631,2023-12-06 02:52:11.943345,Home and Kitchen Appliances,272.764911,3,Home and Kitchen Appliances,272.764911,2.4,https://raw.githubusercontent.com/reckn/super-...
4,4941220,178,142.0,https://raw.githubusercontent.com/reckn/super-...,1005,2023-09-17 08:53:39.943360,Beauty and Personal Care Products,164.574978,3,Beauty and Personal Care Products,164.574978,3.2,https://raw.githubusercontent.com/reckn/super-...


In [15]:
print(merged_data[merged_data['Customer ID'] == 2582589])

Empty DataFrame
Columns: [Customer ID, Page Views_x, Time Spent (minutes), Avatar, Product ID, Purchase Date, Category_x, Price_x, Page Views_y, Category_y, Price_y, Ratings, Product Icon]
Index: []


In [16]:
df = pd.read_csv('/content/drive/MyDrive/fair_dataset/purchase_history_synthetic.csv')
print(df[df['Customer ID'] == 2582589])

Empty DataFrame
Columns: [Customer ID, Product ID, Purchase Date, Category, Price, Page Views]
Index: []


In [17]:
# Check if all customers that interact with the website is purchasing, return empty: the statement is true!

import pandas as pd

# Read the CSV files into DataFrames
customer_interactions = pd.read_csv('/content/drive/MyDrive/fair_dataset/customer_interactions_synthetic.csv')
purchase_history = pd.read_csv('/content/drive/MyDrive/fair_dataset/purchase_history_synthetic.csv')
product_detail = pd.read_csv('/content/drive/MyDrive/fair_dataset/product_details_synthetic.csv')

# Merge the datasets
merged_data = pd.merge(customer_interactions, purchase_history, on='Customer ID', how='inner')
merged_data = pd.merge(merged_data, product_detail, on='Product ID', how='left')

# Filter data of customer IDs that are found in customer_interactions but not in purchase_history
customer_ids_with_interactions = set(customer_interactions['Customer ID'])
customer_ids_with_purchases = set(purchase_history['Customer ID'])

customer_ids_without_purchase_history = customer_ids_with_interactions - customer_ids_with_purchases

filtered_data = merged_data[merged_data['Customer ID'].isin(customer_ids_without_purchase_history)]

print(filtered_data)


Empty DataFrame
Columns: [Customer ID, Page Views_x, Time Spent (minutes), Avatar, Product ID, Purchase Date, Category_x, Price_x, Page Views_y, Category_y, Price_y, Ratings, Product Icon]
Index: []


In [18]:
import pandas as pd
from surprise import Dataset, Reader
import pickle

# Define a function to load the model from the .pkl file
def load_model(filename):
    with open(filename, 'rb') as f:
        model = pickle.load(f)
    return model

# Load customer interactions data
customer_interactions = pd.read_csv('/content/drive/MyDrive/fair_dataset/customer_interactions_synthetic.csv')

# Process data for a single batch (you can adjust this as needed)
customer_chunk = pd.read_csv('/content/drive/MyDrive/fair_dataset/customer_interactions_synthetic.csv')
purchase_chunk = pd.read_csv('/content/drive/MyDrive/fair_dataset/purchase_history_synthetic.csv')
product_chunk = pd.read_csv('/content/drive/MyDrive/fair_dataset/product_details_synthetic.csv')

# Merge datasets
merged_data = pd.merge(customer_chunk, purchase_chunk, on='Customer ID', how='right')
merged_data = pd.merge(merged_data, product_chunk, on='Product ID', how='left')

# Replace NaN values of 'Page Views_x' and 'Time Spent (minutes)' with values from customer interactions
merged_data['Page Views_x'] = merged_data['Page Views_x'].fillna(merged_data['Customer ID'].map(customer_interactions.set_index('Customer ID')['Page Views']))
merged_data['Time Spent (minutes)'] = merged_data['Time Spent (minutes)'].fillna(merged_data['Customer ID'].map(customer_interactions.set_index('Customer ID')['Time Spent (minutes)']))

# Replace all other NaN values with 0
merged_data.fillna(0, inplace=True)

# Filter relevant columns
data = merged_data[['Customer ID', 'Product ID', 'Ratings', 'Price_x', 'Page Views_x', 'Page Views_y', 'Time Spent (minutes)']]

# Define the Reader object
reader = Reader(rating_scale=(1, 5), line_format='user item rating')

# Load the data into Surprise Dataset
data = Dataset.load_from_df(data[['Customer ID', 'Product ID', 'Ratings']], reader)

# Load a specific model
model_filename = "KNNBasic_model.pkl"
loaded_model = load_model(model_filename)

# Example of making predictions (replace this with your actual prediction task)
user_id = 'some_user_id'
item_id = 'some_item_id'
prediction = loaded_model.predict(user_id, item_id)

# Print the prediction
print(f"Model: {model_filename}, Prediction: {prediction.est}")


Model: KNNBasic_model.pkl, Prediction: 3.048925


In [19]:
import pickle

# Load the model file to inspect its contents
with open("SVD_model.pkl", 'rb') as f:
    loaded_object = pickle.load(f)

print(type(loaded_object))


<class 'surprise.prediction_algorithms.matrix_factorization.SVD'>
