#### **5.Machine Learning Model :  Recommendation System**

The recommendation will be in three different approach : content-based method, collaborative filtering and combining both in a hybrid model of recommendation system.

The difference in recommendations between content-based filtering and collaborative filtering comes from the underlying mechanisms of how these methods work:

**A. Content-Based Filtering:**

**How it works:** Content-based filtering recommends items that are similar to those a user has liked in the past, based on the features of the items themselves. In this case, features like product_category_name_english are used to find similarities between items.

Since the method is based on features (like category, description length, etc.), it will naturally recommend items that are similar to those the user has interacted with in the same category. For example, if the user likes electronics, content-based filtering will recommend other electronics products, as they share similar attributes (e.g., category).

**In-Short :** Content-Based Filtering is based solely on the similarity of item features.

**B. Collaborative Filtering:**

**How it works:** Collaborative filtering uses the preferences of many users to make recommendations. It looks at what other users with similar tastes have liked and recommends those items to you, regardless of the content or category of the items.

Collaborative filtering does not directly consider item features like categories. Instead, it looks at patterns of user behavior. If other users who liked the same items as you have also liked items in different categories, it may recommend those cross-category items to you. This is why collaborative filtering can recommend items from different categories—because it leverages user behavior patterns, not just item similarities.

**In-Short :** Collaborative Filtering considers user preferences based on their ratings of other items.









In [1]:
!pip install surprise



In [2]:
import pandas as pd
import numpy as np

from sklearn.metrics import mean_squared_error
from sklearn.metrics.pairwise import linear_kernel
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer


from surprise import Dataset, Reader, SVD, accuracy
from surprise.model_selection import train_test_split as surprise_train_test_split
from sklearn.metrics.pairwise import cosine_similarity


from surprise import accuracy, Reader, Dataset
from surprise import NormalPredictor, KNNBasic, SVD, SVDpp, CoClustering, SlopeOne, NMF, KNNBaseline
from surprise.model_selection import cross_validate, KFold, GridSearchCV, train_test_split


In [3]:
def format_percentage(value):
    return f"{value:.2f}%"

In [4]:
df_all = pd.read_csv('/content/df_olist_clean.csv')
df_all.head()

Unnamed: 0,customer_id,customer_unique_id,customer_city,customer_state,order_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_customer_date,order_estimated_delivery_date,...,payment_type,payment_installments,payment_value,product_category_name,seller_city,seller_state,review_score,review_comment_message,review_creation_date,review_answer_timestamp
0,06b8999e2fba1a1fbc88172c00ba8bc7,C48272,franca,SP,00e7ee1b050b8499577073aeb2a297a1,delivered,2017-05-16 15:05:35,2017-05-16 15:22:12,2017-05-25 10:35:35,2017-06-05,...,credit_card,2.0,146.87,office_furniture,itaquaquecetuba,SP,4.0,,2017-05-26 00:00:00.000000000,2017-05-30 22:34:40
1,18955e83d337fd6b2def6b18a428ac77,C14788,sao bernardo do campo,SP,29150127e6685892b6eab3eec79f59c7,delivered,2018-01-12 20:48:24,2018-01-12 20:58:32,2018-01-29 12:41:19,2018-02-06,...,credit_card,8.0,335.48,housewares,itajai,SC,5.0,,2018-01-30 00:00:00.000000000,2018-02-10 22:43:29
2,4e7b3e00288586ebd08712fdd0374a03,C2174,sao paulo,SP,b2059ed67ce144a36e2aa97d2c9e9ad2,delivered,2018-05-19 16:07:45,2018-05-20 16:19:10,2018-06-14 17:58:51,2018-06-13,...,credit_card,7.0,157.73,office_furniture,itaquaquecetuba,SP,5.0,,2018-06-15 00:00:00.000000000,2018-06-15 12:10:59
3,b2b6027bc5c5109e529d4dc6358b12c3,C13604,mogi das cruzes,SP,951670f92359f4fe4a63112aa7306eba,delivered,2018-03-13 16:06:38,2018-03-13 17:29:19,2018-03-28 16:04:25,2018-04-10,...,credit_card,1.0,173.3,office_furniture,itaquaquecetuba,SP,5.0,,2018-03-29 00:00:00.000000000,2018-04-02 18:36:47
4,4f2d8ab171c80ec8364f7c12e35b23ad,C18911,campinas,SP,6b7d50bd145f6fc7f33cebabd7e49d0f,delivered,2018-07-29 09:51:30,2018-07-29 10:10:09,2018-08-09 20:55:48,2018-08-15,...,credit_card,8.0,252.25,home_confort,ibitinga,SP,5.0,O baratheon è esxelente Amo adoro o baratheon,2018-08-10 00:00:00.000000000,2018-08-17 01:59:52


In [5]:
df_all.shape

(114085, 26)

##### **5.1 Feature Engineering and Selection**

##### **5.1.1 Feature Selection**

In [7]:
df_collab = df_all.groupby(['customer_unique_id', 'product_id'])['review_score'].agg('mean').reset_index()
df_collab = df_collab.rename(columns={'review_score': 'rating'})


df_content = df_all[['product_id',
                     'product_category_name']].drop_duplicates()


In [8]:
df_all['order_purchase_timestamp'] = pd.to_datetime(df_all['order_purchase_timestamp'])
print(df_all['order_purchase_timestamp'].dtype)


datetime64[ns]


    5.1.2 Dataset Preparation : Model Performance Evaluation

In this section, we will separate the dataset to be used for building and testing models by dividing it between customers who have made their first purchase, which will later be used for a content-based filtering model and a hybrid recommendation system, and customers who have made their second purchase, which will be used for a collaborative filtering model.

In [10]:
df_all.shape

(114085, 26)

In [12]:
# Group by 'customer_unique_id' and filter for customers with exactly 2 unique 'order_id'
df_two_transactions = df_all.groupby('customer_unique_id').filter(lambda x: x['order_id'].nunique() == 2)

# Ensuring the data contains only customers with exactly 2 transactions
df_two_transactions = df_two_transactions.groupby('customer_unique_id').filter(lambda x: len(x) == 2)


    5.1.2.A Dataset for Repeat Transaction Evaluation

In [13]:
import numpy as np

# Set the random seed
np.random.seed(42)

# Ensure the data contains only customers with exactly 2 transactions
df_two_transactions = df_two_transactions.groupby('customer_unique_id').filter(lambda x: len(x) == 2)

# Get the last transaction of each unique customer (ordered by 'customer_unique_id' and 'order_id')
df_last_transaction = df_two_transactions.sort_values(['customer_unique_id', 'order_id']).groupby('customer_unique_id').tail(1).reset_index(drop=True)

# Select the first 100 unique customers' last transactions
df_check = df_last_transaction.head(100).reset_index(drop=True)
df_check = df_check.sort_values(['customer_unique_id', 'order_id'])


# Exclude the identified last transactions from df_all and update df_all
df_all = df_all.merge(df_check[['customer_unique_id', 'order_id']], on=['customer_unique_id', 'order_id'], how='left', indicator=True)
# df_all = df_all.merge(df_check[['customer_unique_id', 'order_id']], on=['customer_unique_id', 'order_id'], how='left', indicator=True)
df_all = df_all[df_all['_merge'] == 'left_only'].drop('_merge', axis=1)

# Print the resulting df_all
df_all.shape


(113985, 27)

    5.1.2.B Dataset for Cross Categorical Repeat Transaction Evaluation

In [15]:
import pandas as pd

# Step 1: Ensure the data is sorted by 'customer_unique_id' and 'order_id' to distinguish 1st and 2nd transactions
df_two_transactions_sorted = df_two_transactions.sort_values(by=['customer_unique_id', 'order_id'])

# Step 2: Identify customers whose 1st and 2nd transactions have different product categories
df_different_categories = df_two_transactions_sorted.groupby('customer_unique_id').filter(
    lambda x: x.iloc[0]['product_category_name'] != x.iloc[1]['product_category_name']
)

# Step 3: Get the last transaction for each customer in df_different_categories
df_last_different_category_transaction = df_different_categories.groupby('customer_unique_id').tail(1).reset_index(drop=True)

# Step 4: Select the first 100 unique customers' last transactions where 1st and 2nd transactions have different categories
df_check_2 = df_last_different_category_transaction.head(100).reset_index(drop=True)
# Step 1: Exclude the identified last transactions in df_check_2 from df_all
df_all = df_all.merge(df_check_2[['customer_unique_id', 'order_id']], on=['customer_unique_id', 'order_id'], how='left', indicator=True)

# Step 2: Keep only the rows that are not in df_check_2
df_all = df_all[df_all['_merge'] == 'left_only'].drop('_merge', axis=1)

# Print the resulting df_all to verify
print(df_all.shape)



(113951, 27)


##### **5.2. Modelling Recommendation Systems**



In [18]:
df_all.columns

Index(['customer_id', 'customer_unique_id', 'customer_city', 'customer_state',
       'order_id', 'order_status', 'order_purchase_timestamp',
       'order_approved_at', 'order_delivered_customer_date',
       'order_estimated_delivery_date', 'order_item_id', 'product_id',
       'seller_id', 'price', 'freight_value', 'payment_sequential',
       'payment_type', 'payment_installments', 'payment_value',
       'product_category_name', 'seller_city', 'seller_state', 'review_score',
       'review_comment_message', 'review_creation_date',
       'review_answer_timestamp', 'seller_avg_review_score'],
      dtype='object')

    5.2.1 Collaborative Filtering

In [19]:
#Identify customers with more than one occurrence (Repeat Customer)
repeat = df_all.groupby('customer_unique_id').filter(lambda x: len(x) > 1)
repeat['repeat'] = 1
repeat.shape

(35597, 28)

In [20]:
#Identify customers with one occurrence (First Time Customer)
new = df_all.groupby('customer_unique_id').filter(lambda x: len(x) == 1).reset_index(drop=True)
new['repeat'] = 0
new.shape

(78354, 28)

In [21]:
df_full = pd.concat((repeat, new), axis=0).reset_index(drop=True)

In [22]:
df_collaborative = repeat.groupby(['customer_unique_id','product_id'])['review_score'].agg(['mean']).reset_index()
df_collaborative = df_collaborative.rename({'mean':'estimator', 'product_id':'productId'}, axis=1)

In [23]:
df_collaborative.sort_values(by='estimator', ascending=False)

Unnamed: 0,customer_unique_id,productId,estimator
5830,C358,P17041,5.0
14574,C71291,P9418,5.0
14577,C71306,P28749,5.0
14576,C71305,P25790,5.0
14575,C71295,P20518,5.0
...,...,...,...
15501,C75235,P5803,1.0
6145,C37102,P26294,1.0
15490,C75172,P30295,1.0
15489,C75172,P19509,1.0


In [24]:
null =df_collaborative.isnull().sum()
null

Unnamed: 0,0
customer_unique_id,0
productId,0
estimator,0


In [25]:
df_collaborative = df_collaborative.dropna(subset=['estimator'])

In [26]:
# Scaling for Feature of Raw Rating
scaler = (df_collaborative.estimator.min(), df_collaborative.estimator.max())
reader = Reader(rating_scale=scaler)

In [27]:
#Load the dataframe
data = Dataset.load_from_df(df_collaborative[['customer_unique_id','productId', 'estimator']], reader)

In [28]:
import random

random.seed(42)

#shuffle the user, item, rating for unbiased result
all_collab = data.raw_ratings
random.shuffle(all_collab)

In [29]:
#split data with ratio 80:20 into train (set A) and test data (set B)
threshold = int(0.8 * len(all_collab))
train_raw_collab = all_collab[:threshold]
test_raw_collab = all_collab[threshold:]

In [30]:
listed = [all_collab, train_raw_collab, test_raw_collab]
names = ['all_collab', 'train_raw_ratings', 'test_raw_ratings']

for i, lis in enumerate(listed):
    count = len(lis)
    print(f"Shape of {names[i]}: {count}")

Shape of all_collab: 19908
Shape of train_raw_ratings: 15926
Shape of test_raw_ratings: 3982


In [31]:
#insert train_raw_collab into data
data.raw_ratings = train_raw_collab

In [32]:
from surprise import accuracy, Reader, Dataset
from surprise import NormalPredictor, KNNBasic, SVD, SVDpp, CoClustering, SlopeOne, NMF, KNNBaseline
from surprise.model_selection import cross_validate, KFold, GridSearchCV, train_test_split

    5.2.1 Training and Testing Model for Collaborative Filtering 

In [33]:
# Define the NormalPredictor model
normpr = NormalPredictor()

# Cross-validation
np_result = cross_validate(normpr, data, measures=['RMSE'], cv=5, verbose=False, n_jobs=2)

# Training the model
trainset = data.build_full_trainset()
normpr.fit(trainset)

# Calculate RMSE
np_pred_train = normpr.test(trainset.build_testset())
np_rmse_train = accuracy.rmse(np_pred_train, verbose=False)

# Load and calculate RMSE on the testset
testset = data.construct_testset(test_raw_collab)
np_pred_test = normpr.test(testset)
np_rmse_test = accuracy.rmse(np_pred_test, verbose=False)

In [34]:
# Define the SVD model
svd = SVD(n_factors=30, n_epochs=25, biased=True, lr_all=0.00004, reg_all=0.4, verbose=False, random_state=47)

# Cross-validation
svd_result = cross_validate(svd, data, measures=['RMSE'], cv=5, verbose=False, n_jobs=2)

# Training the model
trainset = data.build_full_trainset()
svd.fit(trainset)

# Calculate RMSE
svd_pred_train = svd.test(trainset.build_testset())
svd_rmse_train = accuracy.rmse(svd_pred_train, verbose=False)

# Load and calculate RMSE on the testset
testset = data.construct_testset(test_raw_collab)
svd_pred_test = svd.test(testset)
svd_rmse_test = accuracy.rmse(svd_pred_test, verbose=False)

In [35]:
# Define the SVDpp model
svdpp = SVDpp(n_factors=50, n_epochs=20, lr_all=0.00008, reg_all=0.4, verbose=False, random_state=47)

# Cross-validation
svdpp_result = cross_validate(svdpp, data, measures=['RMSE'], cv=5, verbose=False, n_jobs=2)

# Training the model
trainset = data.build_full_trainset()
svdpp.fit(trainset)

# Calculate RMSE
svdpp_pred_train = svdpp.test(trainset.build_testset())
svdpp_rmse_train = accuracy.rmse(svdpp_pred_train, verbose=False)

# Load and calculate RMSE on the testset
testset = data.construct_testset(test_raw_collab)
svdpp_pred_test = svdpp.test(testset)
svdpp_rmse_test = accuracy.rmse(svdpp_pred_test, verbose=False)


In [36]:
# Define the NMF model
nmf = NMF(n_factors=20, n_epochs=30, lr_bu=0.0000001, lr_bi=0.0000001, reg_pu=5, reg_qi=5, biased=True, random_state=47)

# Cross-validation
nmf_result = cross_validate(nmf, data, measures=['RMSE'], cv=5, verbose=False, n_jobs=2)

# Training the model
trainset = data.build_full_trainset()
nmf.fit(trainset)

# Calculate RMSE
nmf_pred_train = nmf.test(trainset.build_testset())
nmf_rmse_train = accuracy.rmse(nmf_pred_train, verbose=False)

# Load and calculate RMSE on the testset
testset = data.construct_testset(test_raw_collab)
nmf_pred_test = nmf.test(testset)
nmf_rmse_test = accuracy.rmse(nmf_pred_test, verbose=False)


In [37]:

# Define the KNNBasic model
knnb = KNNBasic(sim_options={'name': 'cosine', 'user_based': False}, verbose=False, random_state=47)

# Cross-validation
knnb_result = cross_validate(knnb, data, measures=['RMSE'], cv=5, verbose=False, n_jobs=2)

# Training the model
trainset = data.build_full_trainset()
knnb.fit(trainset)

# Calculate RMSE
knnb_pred_train = knnb.test(trainset.build_testset())
knnb_rmse_train = accuracy.rmse(knnb_pred_train, verbose=False)

# Load and calculate RMSE on the testset
testset = data.construct_testset(test_raw_collab)
knnb_pred_test = knnb.test(testset)
knnb_rmse_test = accuracy.rmse(knnb_pred_test, verbose=False)


In [38]:
# Define the CoClustering model
coc = CoClustering(n_cltr_u=8, n_cltr_i=8, n_epochs=30, random_state=47)

# Cross-validation
coc_result = cross_validate(coc, data, measures=['RMSE'], cv=5, verbose=False, n_jobs=2)

# Training the model
trainset = data.build_full_trainset()
coc.fit(trainset)

# Calculate RMSE
coc_pred_train = coc.test(trainset.build_testset())
coc_rmse_train = accuracy.rmse(coc_pred_train, verbose=False)

# Load and calculate RMSE on the testset
testset = data.construct_testset(test_raw_collab)
coc_pred_test = coc.test(testset)
coc_rmse_test = accuracy.rmse(coc_pred_test, verbose=False)


In [39]:
# Define the SlopeOne model
so = SlopeOne()

# Cross-validation
so_result = cross_validate(so, data, measures=['RMSE'], cv=5, verbose=False, n_jobs=2)

# Training the model
trainset = data.build_full_trainset()
so.fit(trainset)

# Calculate RMSE
so_pred_train = so.test(trainset.build_testset())
so_rmse_train = accuracy.rmse(so_pred_train, verbose=False)

# Load and calculate RMSE on the testset
testset = data.construct_testset(test_raw_collab)
so_pred_test = so.test(testset)
so_rmse_test = accuracy.rmse(so_pred_test, verbose=False)


In [40]:
rec_rmses = [[np_rmse_train, np_rmse_test],
             [svd_rmse_train, svd_rmse_test],
             [svdpp_rmse_train, svd_rmse_test],
             [nmf_rmse_train, nmf_rmse_test],
             [knnb_rmse_train, knnb_rmse_test],
             [coc_rmse_train, coc_rmse_test],
             [so_rmse_train, so_rmse_test]]

rec_model_names = ['normalpred', 'SVD', 'SVD++', 'NMF','KNNBasic','CoClustering','SlopeOne']
df_comparison = pd.DataFrame(rec_rmses, index=rec_model_names)
df_comparison.rename(columns = {0:'RMSE train', 1:'RMSE test'}, inplace = True)
df_comparison.T

Unnamed: 0,normalpred,SVD,SVD++,NMF,KNNBasic,CoClustering,SlopeOne
RMSE train,1.94866,1.517063,1.51376,1.520551,0.303789,0.409178,0.036394
RMSE test,1.953035,1.536,1.536,1.537422,1.495519,1.428014,1.386304


    5.2.2 Hyperparameter Tuning for Best Model : SVD ++ 

In [41]:
# Scaling for Feature of Raw Rating
scaler = (df_collaborative.estimator.min(), df_collaborative.estimator.max())
reader = Reader(rating_scale=scaler)
#Load the dataframe
data = Dataset.load_from_df(df_collaborative[['customer_unique_id','productId', 'estimator']], reader)

In [42]:
#Hyperparameter Tuning
param_grid = {'n_factors': [25, 50],'n_epochs': [30,50],
              'lr_all': [0.005,0.01],'reg_all':[0.02,0.1]}

gs_svdpp = GridSearchCV(SVDpp, param_grid, measures=['rmse'], cv=5)
gs_svdpp.fit(data)

best_score_svdpp = gs_svdpp.best_score['rmse']
best_param_svdpp = gs_svdpp.best_params['rmse']

print(f'Best score: {best_score_svdpp}')
print(f'Best parameter: {best_param_svdpp}')



Best score: 1.3643328551431844
Best parameter: {'n_factors': 25, 'n_epochs': 50, 'lr_all': 0.01, 'reg_all': 0.1}


In [43]:
# Create the model using the best parameters from GridSearchCV
best_param_svdpp = gs_svdpp.best_params['rmse']
best_model_svdpp = SVDpp(**best_param_svdpp)

# Fit the model on the entire dataset
best_model_svdpp.fit(data.build_full_trainset())

<surprise.prediction_algorithms.matrix_factorization.SVDpp at 0x7de6699f2aa0>

##### **5.3 Content Based Filtering**

In [44]:
# Handle missing values (if any)
df_content['product_category_name'] = df_content['product_category_name'].fillna('unknown')

In [45]:
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df_content['product_category_name'])

In [46]:
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

In [47]:
def get_recommendations(product_id, df_content, cosine_sim, top_n=5):
    idx = df_content[df_content['product_id'] == product_id].index[0]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:top_n+1]  # Skip the item itself

    # Extract product IDs and categories
    recommended_indices = [i[0] for i in sim_scores]
    recommended_products = df_content.iloc[recommended_indices][['product_id', 'product_category_name']]

    return recommended_products

In [48]:
def get_user_purchased_products(user_id, df_all):
    # Filter df_all to get products bought by the specified user
    user_purchases = df_all[df_all['customer_unique_id'] == user_id]

    # Extract the product IDs of the products purchased by the user
    purchased_product_ids = user_purchases['product_id'].unique()

    return purchased_product_ids



In [49]:
# Content Based Recommendation System Implementation

user_id =   'C48272'
get_product_id = get_user_purchased_products(user_id,df_all)
product_id_str = get_product_id[0]
recommendations = get_recommendations(product_id_str, df_content, cosine_sim, top_n=5)

print(get_product_id)
print(recommendations)


['P20845']
    product_id product_category_name
2       P23317      office_furniture
3       P20373      office_furniture
42      P17672      office_furniture
72      P21112      office_furniture
292     P29007      office_furniture


##### **5.4 Hybrid Recommendation System**

In [50]:
df_content.columns

Index(['product_id', 'product_category_name'], dtype='object')

In [51]:
df_collab.columns

Index(['customer_unique_id', 'product_id', 'rating'], dtype='object')

In [52]:
reader = Reader(rating_scale=(1, 5))  # Adjust rating scale if needed
data = Dataset.load_from_df(df_collab[['customer_unique_id', 'product_id', 'rating']], reader)

trainset = data.build_full_trainset()
svdpp = SVDpp()
svdpp.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVDpp at 0x7de66940dc60>

In [53]:
svdpp_model = best_model_svdpp

In [54]:
def hybrid_recommendation_system(user_id, df_all, df_content, cosine_sim, svdpp_model, top_n=5):
    # Step 1: Content-Based Recommendation
    get_product_id = get_user_purchased_products(user_id, df_all)
    content_based_recommendations = []

    if get_product_id:  # Ensure user has purchased products
        product_id_str = get_product_id[0]  # Get the first product ID from the list
        content_based_recommendations = get_recommendations(product_id_str, df_content, cosine_sim, top_n=top_n)

    # Step 2: Collaborative Filtering - Get top 2 recommended categories
    user_rated_items = df_all[df_all['customer_unique_id'] == user_id]['product_id'].values

    category_predictions = []
    for product_id in df_content['product_id'].unique():
        if product_id not in user_rated_items:
            pred = svdpp_model.predict(user_id, product_id)
            category = df_content[df_content['product_id'] == product_id]['product_category_name'].values[0]
            category_predictions.append((category, pred.est))

    # Sort and select top 2 categories based on the prediction score
    top_categories = sorted(category_predictions, key=lambda x: x[1], reverse=True)[:2]
    top_categories = [category for category, _ in top_categories]

    # Step 3: Cascade into Content-Based for top 2 categories
    cascade_recommendations = []
    for category in top_categories:
        # Filter products by category
        category_products = df_content[df_content['product_category_name'] == category]['product_id'].values
        for product_id in category_products[:2]:  # Select 2 products for each category
            recs = get_recommendations(product_id, df_content, cosine_sim, top_n=top_n)
            cascade_recommendations.extend(recs[['product_id', 'product_category_name']].values.tolist())

    return {
        "content_based_recommendations": content_based_recommendations,
        "collaborative_filtering_recommendations": cascade_recommendations
    }



In [55]:
user_id = 'C48272'

# Call the hybrid recommendation system function
recommendations = hybrid_recommendation_system(user_id, df_all, df_content, cosine_sim, svdpp_model)

In [56]:
def print_recommendations(recommendations):
    # Print Content-Based Recommendations
    print("Similar Items:\n")
    for index, row in recommendations["content_based_recommendations"].iterrows():
        print(f"Product ID: {row['product_id']}, Category: {row['product_category_name']}")

    print("\nYou might also like:\n")
    for rec in recommendations["collaborative_filtering_recommendations"]:
        print(f"Product ID: {rec[0]}, Category: {rec[1]}")

# Call the function to print recommendations
print_recommendations(recommendations)


Similar Items:

Product ID: P23317, Category: office_furniture
Product ID: P20373, Category: office_furniture
Product ID: P17672, Category: office_furniture
Product ID: P21112, Category: office_furniture
Product ID: P29007, Category: office_furniture

You might also like:

Product ID: P11675, Category: consoles_games
Product ID: P16377, Category: consoles_games
Product ID: P18203, Category: consoles_games
Product ID: P1318, Category: consoles_games
Product ID: P10955, Category: consoles_games
Product ID: P12310, Category: computers_accessories
Product ID: P2403, Category: computers_accessories
Product ID: P30580, Category: computers_accessories
Product ID: P7743, Category: computers_accessories
Product ID: P29467, Category: computers_accessories
Product ID: P10103, Category: perfumery
Product ID: P6578, Category: perfumery
Product ID: P11134, Category: perfumery
Product ID: P11833, Category: perfumery
Product ID: P715, Category: perfumery
Product ID: P16030, Category: auto
Product ID: 

In [57]:
df_all[['customer_unique_id','product_id','order_status','product_category_name']].head()

Unnamed: 0,customer_unique_id,product_id,order_status,product_category_name
0,C48272,P20845,delivered,office_furniture
1,C14788,P9335,delivered,housewares
2,C2174,P23317,delivered,office_furniture
3,C13604,P20373,delivered,office_furniture
4,C18911,P18210,delivered,home_confort


In [58]:
df_all[df_all['customer_unique_id']=='C48272'][['customer_unique_id','product_id','order_status','product_category_name']]

Unnamed: 0,customer_unique_id,product_id,order_status,product_category_name
0,C48272,P20845,delivered,office_furniture


##### **5.5 Recommendation for New User (Cold Start)**

In a cold-start scenario, where a new user has just joined the platform and there is no prior interaction data, our recommendation system will focus on suggesting popular items. These items are selected based on metrics such as the most purchased, most reviewed, and those trending in the user's geographic area. This approach ensures that new users receive relevant and appealing recommendations even without a personalized history.

In [59]:
def find_popular_items(data, n_recs):
    top_n_items = data.product_id.value_counts().sort_values(ascending=False)[:n_recs].index
    return list(top_n_items)

In [60]:
def popular_in_your_area(data, state, n_recs):
    location_df = data[data.customer_state == state]
    top_n_items = location_df.product_id.value_counts().sort_values(ascending=False)[:n_recs].index
    return list(top_n_items)

In [61]:
def first_time_recommender(data, uid, n_recs):
    hot_items = find_popular_items(data, n_recs)
    state = data[data.customer_unique_id==uid].customer_state.max()
    popular_in_area = popular_in_your_area(data, state, n_recs)

    print(f"Hot items you might like:\n {hot_items}\n")
    print(f"Popular items in your area:{state}\n {popular_in_area}")

    recommendation = {'Hot Items': hot_items, 'Area': popular_in_area}

    return recommendation

In [62]:
# Example Recommendation
recommendation = first_time_recommender(df_all, 'C48272', 3)

Hot items you might like:
 ['P21247', 'P18956', 'P8281']

Popular items in your area:SP
 ['P21247', 'P18956', 'P8281']


#### **5.6 Machine Learning Implementation**


In [63]:
def find_popular_items(df_all, n_recs):
    # Group by product_id and count the number of occurrences (i.e., sales)
    popular_items = df_all['product_id'].value_counts().head(n_recs).index.tolist()

    # Get product details (e.g., name, category) based on product_id
    popular_items_details = df_all[df_all['product_id'].isin(popular_items)][['product_id', 'product_category_name']].drop_duplicates()

    return popular_items_details


In [64]:
def popular_in_your_area(df_all, global_area, n_recs):
    # Filter data for the specific global area
    area_data = df_all[df_all['customer_state'] == global_area]

    # Group by product_id and count the number of occurrences (i.e., sales) in the specified area
    most_sold_items = area_data['product_id'].value_counts().head(n_recs).index.tolist()

    # Get product details (e.g., name, category) based on product_id
    most_sold_items_details = area_data[area_data['product_id'].isin(most_sold_items)][['product_id','product_category_name']].drop_duplicates()

    return most_sold_items_details



In [65]:
global_area = 'SP'
n_recs = 3
def check_and_recommend(user_id, df_all, df_content, cosine_sim, svdpp_model,data):
    if user_id in df_all['customer_unique_id'].values:
        # User exists in df_all, use the hybrid recommendation system
        recommendations = hybrid_recommendation_system(user_id, df_all, df_content, cosine_sim, svdpp_model)
    else:
        # If the user does not exist in df_all, provide cold start recommendations
        print(f"New customer detected: {user_id}. Generating cold start recommendations.")

        # Cold start recommendations: find most sold products globally
        hot_items = find_popular_items(df_all, n_recs)

        # Cold start recommendations: find popular items in the global area
        popular_in_area = popular_in_your_area(df_all, global_area, n_recs)

        # Return cold start recommendations
        return {
            "hot_items": hot_items,
            "popular_in_area": popular_in_area
        }

    return recommendations



5.6.1. Implementation On New User

In [66]:
# Example usage:
user_id = '81766705'
recommendations = check_and_recommend(user_id, df_all, df_content, cosine_sim, svdpp_model,data)

New customer detected: 81766705. Generating cold start recommendations.


In [67]:
# Separate the hot items and popular items in the area
hot_items = recommendations.get("hot_items", pd.DataFrame())  # Fallback to empty DataFrame if not found
popular_in_area = recommendations.get("popular_in_area", pd.DataFrame())  # Fallback to empty DataFrame if not found

# Print hot items
print("Hot Items You Might Like:")
print(hot_items)

# Print popular items in the area
print("\nPopular Items in Your Area:")
print(popular_in_area)

Hot Items You Might Like:
    product_id product_category_name
17      P18956        bed_bath_table
32       P8281          garden_tools
329     P21247       furniture_decor

Popular Items in Your Area:
    product_id product_category_name
17      P18956        bed_bath_table
32       P8281          garden_tools
390     P21247       furniture_decor


### Normal Customer ###

In [68]:
# Example usage:
user_id = 'C48272'
recommendations = check_and_recommend(user_id, df_all,df_content, cosine_sim, svdpp_model,data)


In [69]:
# Call the function to print recommendations
print_recommendations(recommendations)

Similar Items:

Product ID: P23317, Category: office_furniture
Product ID: P20373, Category: office_furniture
Product ID: P17672, Category: office_furniture
Product ID: P21112, Category: office_furniture
Product ID: P29007, Category: office_furniture

You might also like:

Product ID: P11675, Category: consoles_games
Product ID: P16377, Category: consoles_games
Product ID: P18203, Category: consoles_games
Product ID: P1318, Category: consoles_games
Product ID: P10955, Category: consoles_games
Product ID: P12310, Category: computers_accessories
Product ID: P2403, Category: computers_accessories
Product ID: P30580, Category: computers_accessories
Product ID: P7743, Category: computers_accessories
Product ID: P29467, Category: computers_accessories
Product ID: P10103, Category: perfumery
Product ID: P6578, Category: perfumery
Product ID: P11134, Category: perfumery
Product ID: P11833, Category: perfumery
Product ID: P715, Category: perfumery
Product ID: P16030, Category: auto
Product ID: 

## 5.7 Batch Processing for Unseen Data ##

### 5.7.1 Batch Processing for 2nd Transaction of 100 Customers ###


In [70]:
# Extract 'customer_unique_id' from df_check and convert to a list
customer_ids_list = df_check['customer_unique_id'].tolist()

# Print the resulting list
print(customer_ids_list)

['C10000', 'C10012', 'C10084', 'C10273', 'C10285', 'C104', 'C10403', 'C10433', 'C1046', 'C10488', 'C10532', 'C1055', 'C10701', 'C10723', 'C10757', 'C10837', 'C10922', 'C11006', 'C11055', 'C11199', 'C11306', 'C1138', 'C11446', 'C11460', 'C11563', 'C11581', 'C11600', 'C11633', 'C11634', 'C11685', 'C11722', 'C11768', 'C11819', 'C12028', 'C12071', 'C12104', 'C12110', 'C12184', 'C12225', 'C12233', 'C12240', 'C12272', 'C12351', 'C12481', 'C12672', 'C12697', 'C12755', 'C12784', 'C1282', 'C12921', 'C12975', 'C13042', 'C13052', 'C13053', 'C13055', 'C13077', 'C1312', 'C13183', 'C13321', 'C13375', 'C13411', 'C13420', 'C13514', 'C13582', 'C13609', 'C13705', 'C1378', 'C13782', 'C1391', 'C13931', 'C14022', 'C14096', 'C1412', 'C14126', 'C14142', 'C14187', 'C14214', 'C14372', 'C14398', 'C14409', 'C14473', 'C14492', 'C14496', 'C14587', 'C14601', 'C14677', 'C14679', 'C14716', 'C14729', 'C14861', 'C14869', 'C14882', 'C14960', 'C15067', 'C15154', 'C15194', 'C15199', 'C15214', 'C15217', 'C15246']


In [71]:
# Convert the list to a DataFrame
df_customer_ids1 = pd.DataFrame(customer_ids_list, columns=['customer_unique_id'])

# Save the DataFrame to a CSV file
df_customer_ids1.to_csv('customer_ids1.csv', index=False)

print("CSV file saved successfully!")

CSV file saved successfully!


In [72]:
# Filter df_all to include only rows where 'customer_unique_id' is in customer_ids_list
df_first_transaction = df_all[df_all['customer_unique_id'].isin(customer_ids_list)]

# Print the resulting DataFrame
df_first_transaction.shape


(100, 27)

In [81]:
import pandas as pd

# Step 1: Extract unique user IDs from df_transaction
user_ids = df_first_transaction['customer_unique_id'].unique()

# Step 2: Initialize an empty list to store the top 3 recommendations for each user
top_3_recommendations = []

# Step 3: Loop through each user and generate predictions
for user_id in user_ids:
    predictions_list = []

    # Get the list of all items (e.g., products) in your dataset
    items = df_first_transaction['product_id'].unique()

    # Predict for each item for the current user
    for item_id in items:
        prediction = svdpp.predict(user_id, item_id)

        # Map item_id to product_category using df_first_transaction
        product_category = df_first_transaction[df_first_transaction['product_id'] == item_id]['product_category_name'].values[0]

        # Append the prediction details to the list
        predictions_list.append({
            'user_id': user_id,
            'item_id': item_id,
            'product_category': product_category,
            'estimated_rating': prediction.est
        })

    # Convert the predictions list to a DataFrame
    df_user_predictions = pd.DataFrame(predictions_list)

    # Step 4: Group by product_category and calculate the average estimated rating
    df_avg_ratings = df_user_predictions.groupby('product_category')['estimated_rating'].mean().reset_index()

    # Step 5: Sort by estimated rating in descending order and get the top 2 categories
    df_top_2 = df_avg_ratings.sort_values(by='estimated_rating', ascending=False).head(4)

    # Step 6: Get the user's own product category (assuming it's the category of the user's last transaction)
    own_category = df_first_transaction[df_first_transaction['customer_unique_id'] == user_id].iloc[-1]['product_category_name']

    # Step 7: Append the user's own category to the front of the top 2 categories
    top_3_categories = [own_category] + df_top_2['product_category'].tolist()

    # Append the top 3 categories to the recommendations list
    top_3_recommendations.append({
        'user_id': user_id,
        'top_5_categories': top_3_categories
    })

# Step 8: Convert the top 3 recommendations to a DataFrame
df_top_3_recommendations = pd.DataFrame(top_3_recommendations)

# Print the resulting DataFrame with top 3 product categories for each user
print(df_top_3_recommendations)


   user_id                                   top_5_categories
0   C14882  [telephony, garden_tools, stationery, books_ge...
1   C13705  [health_beauty, garden_tools, stationery, auto...
2   C11768  [telephony, garden_tools, books_general_intere...
3   C12784  [furniture_decor, garden_tools, stationery, co...
4   C12975  [fashion_bags_accessories, auto, books_general...
..     ...                                                ...
95  C10922  [garden_tools, stationery, auto, cool_stuff, c...
96  C13321  [watches_gifts, books_general_interest, cool_s...
97  C15194  [stationery, stationery, auto, cool_stuff, cin...
98  C12028  [bed_bath_table, books_general_interest, lugga...
99  C11563  [sports_leisure, luggage_accessories, cool_stu...

[100 rows x 2 columns]


In [85]:
# Step 1: Merge df_check with df_top_3_recommendations on user_id (customer_unique_id)
merged_df = pd.merge(df_check, df_top_3_recommendations, left_on='customer_unique_id', right_on='user_id')

# Step 2: Check if the product_category_name in df_check exists in the top 3 categories
merged_df['is_in_top_3'] = merged_df.apply(
    lambda row: row['product_category_name'] in row['top_5_categories'], axis=1
)

# Step 3: Calculate the percentage of product categories in df_check that are in the top 3 recommendations
percentage_existing_prediction = merged_df['is_in_top_3'].mean() * 100

# Print the percentage
print(f"The percentage of product categories in df_check that are in the top 5 recommendations is: {percentage_existing_prediction:.2f}%")


The percentage of product categories in df_check that are in the top 5 recommendations is: 41.00%


In [86]:
import pandas as pd

# Step 1: Filter df_two_transactions to include only the customers in df_check
df_filtered = df_two_transactions[df_two_transactions['customer_unique_id'].isin(df_check['customer_unique_id'])]

# Step 2: Ensure the DataFrame is sorted by customer and transaction time/order (to differentiate 1st and 2nd transactions)
df_filtered = df_filtered.sort_values(by=['customer_unique_id', 'order_id'])

# Step 3: Compare the product categories of the 1st and 2nd transactions for each customer
df_comparison = df_filtered.groupby('customer_unique_id').apply(lambda x: x.iloc[0]['product_category_name'] == x.iloc[1]['product_category_name']).reset_index()
df_comparison.columns = ['customer_unique_id', 'same_category']

# Step 4: Calculate the percentage of customers with the same product category for both transactions
percentage_same_category = df_comparison['same_category'].mean() * 100

# Print the percentage
print(f"The percentage of customers in df_check with the same product category for both transactions is: {percentage_same_category:.2f}%")



The percentage of customers in df_check with the same product category for both transactions is: 34.00%


### Batch Processing for 2nd Transaction of 100 Customer with Cross-Categorical Transaction ###

In [76]:
# Extract 'customer_unique_id' from df_check and convert to a list
customer_ids_list2 = df_check_2['customer_unique_id'].tolist()

# Print the resulting list
print(customer_ids_list2)

['C10273', 'C10285', 'C104', 'C10433', 'C1046', 'C10488', 'C10532', 'C10701', 'C10723', 'C10837', 'C10922', 'C11006', 'C11306', 'C1138', 'C11446', 'C11460', 'C11581', 'C11633', 'C11634', 'C11685', 'C11722', 'C11768', 'C11819', 'C12071', 'C12104', 'C12184', 'C12225', 'C12233', 'C12240', 'C12272', 'C12351', 'C12481', 'C12697', 'C12784', 'C1282', 'C13055', 'C13077', 'C13183', 'C13321', 'C13375', 'C13411', 'C13420', 'C13514', 'C13582', 'C13609', 'C1378', 'C14022', 'C14096', 'C1412', 'C14142', 'C14214', 'C14398', 'C14473', 'C14492', 'C14496', 'C14587', 'C14601', 'C14677', 'C14679', 'C14716', 'C14869', 'C14882', 'C14960', 'C15194', 'C15199', 'C15214', 'C15459', 'C15524', 'C15597', 'C15608', 'C15640', 'C15669', 'C15804', 'C15819', 'C15953', 'C16001', 'C16072', 'C16151', 'C16210', 'C1622', 'C16253', 'C16331', 'C16364', 'C16525', 'C16545', 'C16629', 'C17016', 'C17081', 'C1711', 'C17159', 'C17236', 'C17300', 'C17617', 'C1764', 'C17839', 'C17950', 'C17996', 'C18128', 'C18189', 'C18223']


In [77]:
# Convert the list to a DataFrame
df_customer_ids = pd.DataFrame(customer_ids_list2, columns=['customer_unique_id'])

# Save the DataFrame to a CSV file
df_customer_ids.to_csv('customer_ids2.csv', index=False)

print("CSV file saved successfully!")

CSV file saved successfully!


In [78]:
# Filter df_all to include only rows where 'customer_unique_id' is in customer_ids_list
df_first_transaction2 = df_all[df_all['customer_unique_id'].isin(customer_ids_list2)]

# Print the resulting DataFrame
df_first_transaction2.shape

(100, 27)

In [88]:
import pandas as pd

# Step 1: Extract unique user IDs from df_transaction
user_ids = df_first_transaction2['customer_unique_id'].unique()

# Step 2: Initialize an empty list to store the top 3 recommendations for each user
top_3_recommendations2 = []

# Step 3: Loop through each user and generate predictions
for user_id in user_ids:
    predictions_list = []

    # Get the list of all items (e.g., products) in your dataset
    items = df_first_transaction2['product_id'].unique()

    # Predict for each item for the current user
    for item_id in items:
        prediction = svdpp.predict(user_id, item_id)

        # Map item_id to product_category using df_first_transaction
        product_category = df_first_transaction2[df_first_transaction2['product_id'] == item_id]['product_category_name'].values[0]

        # Append the prediction details to the list
        predictions_list.append({
            'user_id': user_id,
            'item_id': item_id,
            'product_category': product_category,
            'estimated_rating': prediction.est
        })

    # Convert the predictions list to a DataFrame
    df_user_predictions2 = pd.DataFrame(predictions_list)

    # Step 4: Group by product_category and calculate the average estimated rating
    df_avg_ratings2 = df_user_predictions2.groupby('product_category')['estimated_rating'].mean().reset_index()

    # Step 5: Sort by estimated rating in descending order and get the top 2 categories
    df_top2 = df_avg_ratings2.sort_values(by='estimated_rating', ascending=False).head(4)

    # Step 6: Get the user's own product category (assuming it's the category of the user's last transaction)
    own_category2 = df_first_transaction2[df_first_transaction2['customer_unique_id'] == user_id].iloc[-1]['product_category_name']

    # Step 7: Append the user's own category to the front of the top 2 categories
    top_3_categories2 = [own_category] + df_top2['product_category'].tolist()

    # Append the top 3 categories to the recommendations list
    top_3_recommendations2.append({
        'user_id': user_id,
        'top_5_categories': top_3_categories2
    })

# Step 8: Convert the top 3 recommendations to a DataFrame
df_top_3_recommendations2 = pd.DataFrame(top_3_recommendations2)

# Print the resulting DataFrame with top 3 product categories for each user
print(df_top_3_recommendations2)

   user_id                                   top_5_categories
0   C14882  [sports_leisure, drinks, garden_tools, cool_st...
1   C11768  [sports_leisure, drinks, cool_stuff, auto, sta...
2   C15669  [sports_leisure, health_beauty, auto, baby, st...
3   C12784  [sports_leisure, cool_stuff, stationery, drink...
4    C1282  [sports_leisure, stationery, cool_stuff, baby,...
..     ...                                                ...
95  C14473  [sports_leisure, baby, consoles_games, cool_st...
96  C12104  [sports_leisure, baby, cool_stuff, stationery,...
97  C10922  [sports_leisure, cool_stuff, drinks, stationer...
98  C13321  [sports_leisure, home_confort, baby, consoles_...
99  C15194  [sports_leisure, drinks, cool_stuff, stationer...

[100 rows x 2 columns]


In [90]:
# Step 1: Merge df_check with df_top_3_recommendations on user_id (customer_unique_id)
merged_df2 = pd.merge(df_check_2, df_top_3_recommendations2, left_on='customer_unique_id', right_on='user_id')

# Step 2: Check if the product_category_name in df_check exists in the top 3 categories
merged_df2['is_in_top_3'] = merged_df2.apply(
    lambda row: row['product_category_name'] in row['top_5_categories'], axis=1
)

# Step 3: Calculate the percentage of product categories in df_check that are in the top 3 recommendations
percentage_existing_prediction2 = merged_df2['is_in_top_3'].mean() * 100

# Print the percentage
print(f"The percentage of product categories in df_check that are in the top 5 recommendations is: {percentage_existing_prediction2:.2f}%")

The percentage of product categories in df_check that are in the top 5 recommendations is: 19.00%
