# **BUSINESS CASE 3: Recheio Recommendation System**  


## 🎓 Master’s Program in Data Science & Advanced Analytics 
**Nova IMS** | March 2025   
**Course:** Business Cases with Data Science

## 👥 Team **Group A**  
- **Alice Viegas** | 20240572  
- **Bernardo Faria** | 20240579  
- **Dinis Pinto** | 20240612  
- **Daan van Holten** | 20240681
- **Philippe Dutranoit** | 20240518

## 📊 Project Overview  
This notebook utilizes the following datasets:  
- Case3_Recheio_2025 (1).xlsx <br>
- The goal of the project is to design a recomendation system so that the company can propose better products to existing costumers.

## 📊 Goal of the notebook

In this notebook we will build a smart basket for existing clients. <br>

In [8]:
import os
import pandas as pd
import numpy as np
from sklearn.metrics import pairwise_distances
import random

In [9]:
# Global definitions
baseFolder = os.getcwd()
exportsFolder = baseFolder + os.sep +'Exports' + os.sep

In [10]:
transactions = pd.read_csv('../Data/df.csv')
print("Number of unique clients:", transactions['Client ID'].nunique())

Number of unique clients: 1529


In [11]:
transactions.info()
print("Number of unique clients:", transactions['Client ID'].nunique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 884099 entries, 0 to 884098
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   Date                 884099 non-null  object
 1   Client ID            884099 non-null  int64 
 2   ZIP Code             884099 non-null  int64 
 3   ID Client Type       389817 non-null  object
 4   ID Product           884099 non-null  int64 
 5   Product Description  884099 non-null  object
 6   ID Product Category  884099 non-null  object
 7   Own Brand            884099 non-null  int64 
dtypes: int64(4), object(4)
memory usage: 54.0+ MB
Number of unique clients: 1529


We will build a smart basket based on the following 2 criteria:
- items previously purchased by the client
- items similar to the ones he has bought, based on cosine similarity

The smart basket will incorporate 5 products from each criteria, giving 10 recommended products. In the cases where the client has not bought 5 products, there will be less recommendations, as the first criteria cannot be met.

We will then do a Monte Carlo cross validation with 5 iterations. In each iteration the data will be split at a random point in time, where the train will be the data before and the test the data after that point. The client that don't meet a minimum criteria (1 transaction in the test set and 5 transactions in the train set) will be excluded from the accuracy analysis. The rest will be used to compute the mean hit rate, where the actual recommendations in the test set are compared with the predicted ones in the train set.

# Recommendation System

In [12]:
class SmartBasketRecommenderCV:
    def __init__(self, transactions, min_train=5, min_test=1, k=10):
        self.transactions = transactions.copy()
        self.k = k
        self.min_train = min_train
        self.min_test = min_test

        self.transactions['Date'] = pd.to_datetime(self.transactions['Date'])

    # this function will be used to get the top 5 products purchased by a client
    def top_purchase_history(self, client_id, df):
        client_data = df[df['Client ID'] == client_id]
        top_products = (
            client_data['ID Product']
            .value_counts()
            .head(5)
            .index
            .tolist()
        )
        return top_products

    # this function will be used to get the top 5 products purchased by similar clients (based on cosine similarity)
    # it also boosts the score of own brand products
    # it uses the cosine similarity to find similar clients
    # it then filters out the products already purchased by the target client
    # and returns the top 5 products purchased by similar clients
    def collaborative_recommendations(self, client_id, interaction_matrix, df, top_n_similar=5):
        if client_id not in interaction_matrix.index:
            return []

        client_idx = interaction_matrix.index.get_loc(client_id)
        distance_matrix = pairwise_distances(interaction_matrix, metric='cosine')
        distances = distance_matrix[client_idx]
        similar_indices = distances.argsort()[1:top_n_similar+1]
        similar_clients = interaction_matrix.index[similar_indices]

        similar_purchases = df[df['Client ID'].isin(similar_clients)]
        target_purchases = df[df['Client ID'] == client_id]['ID Product'].unique()

    # Filter out already purchased items
        new_purchases = similar_purchases[~similar_purchases['ID Product'].isin(target_purchases)]

    # Group by product and boost own brand
        product_scores = (
            new_purchases.groupby('ID Product')
            .agg({
                'Client ID': 'count',
                'Own Brand': 'max'  # Assuming binary (0 or 1)
            })
            .rename(columns={'Client ID': 'count'})
     )

    # Boost own brand products
        product_scores['score'] = product_scores['count'] * (1.5 * product_scores['Own Brand'] + 1)

    # Return top 5
        top_products = product_scores.sort_values(by='score', ascending=False).head(5).index.tolist()
        return top_products

    # this function gets the smart basket recommendations for a client combining the recommendations from the 2 previous functions
    def smart_basket(self, client_id, df, interaction_matrix):
        hist_recs = self.top_purchase_history(client_id, df)
        collab_recs = self.collaborative_recommendations(client_id, interaction_matrix, df)
        final_recs = hist_recs.copy()
        for item in collab_recs:
            if item not in final_recs:
                final_recs.append(item)
            if len(final_recs) == 10:
                break
        return final_recs

    # this function calculates the precision at k for the recommendations
    # it checks how many of the top k recommendations are in the test set
    def precision_at_k(self, train_recs, test_items):
        if not test_items:
            return 0.0
        hits = len(set(train_recs[:self.k]) & set(test_items))
        return hits / self.k

    # this function creates the interaction matrix for the transactions
    def create_interaction_matrix(self):
        matrix = pd.crosstab(self.transactions['Client ID'], self.transactions['ID Product'])
        return matrix.applymap(lambda x: 1 if x > 0 else 0)

    # this function performs a Monte Carlo cross-validation
    # it randomly splits the data into training and test sets multiple times
    # and calculates the hit rate for each iteration
    # the function takes the number of iterations, the number of weeks to train on, and a random seed for reproducibility
    # it returns a list of hit rates for each iteration
    # the function also prints the mean and standard deviation of the hit rates
    def monte_carlo_cv(self, iterations=5, weeks_train=45, random_seed=42):
        hit_rates_all = []

        for i in range(iterations):
            earliest = self.transactions['Date'].min()
            latest = self.transactions['Date'].max() - pd.to_timedelta(weeks_train, unit='w')
            random_start = earliest + (latest - earliest) * random.random()
            split_date = pd.to_datetime(random_start) + pd.to_timedelta(weeks_train, unit='w')

            self.transactions['train_split'] = (self.transactions['Date'] <= split_date).astype(int)
            train_set = self.transactions[self.transactions['train_split'] == 1]
            test_set = self.transactions[self.transactions['train_split'] == 0]

            interaction_matrix = pd.crosstab(self.transactions['Client ID'], self.transactions['ID Product'])
            interaction_matrix = interaction_matrix.applymap(lambda x: 1 if x > 0 else 0)

            valid_clients = []
            for client_id in self.transactions['Client ID'].unique():
                if (train_set[train_set['Client ID'] == client_id].shape[0] >= self.min_train and
                    test_set[test_set['Client ID'] == client_id].shape[0] >= self.min_test):
                    valid_clients.append(client_id)

            hit_rates = []
            for client_id in valid_clients:
                train_recs = self.smart_basket(client_id, train_set, interaction_matrix)
                test_items = test_set[test_set['Client ID'] == client_id]['ID Product'].unique().tolist()
                hit = self.precision_at_k(train_recs, test_items)
                hit_rates.append(hit)

            mean_hit = np.mean(hit_rates)
            hit_rates_all.append(mean_hit)
            print(f"Iteration {i+1}: Hit Rate = {mean_hit:.2%}")

        print(f"\nFinal Hit Rate: {np.mean(hit_rates_all):.2%} ± {np.std(hit_rates_all):.2%}")
        return hit_rates_all


# Performance of the model

To evaluate the performance of the recommendation model, Monte Carlo cross-validation was applied by repeatedly splitting the transaction data into training and testing sets. For each iteration, the recommender system generated product suggestions based on the training data, and its accuracy was assessed using the hit rate — the proportion of test set products that appeared in the recommended list. Averaging hit rates over multiple random splits provides a robust estimate of how well the model is likely to perform in real-world scenarios, helping to avoid overfitting to a single data partition.

In [13]:
recommender = SmartBasketRecommenderCV(transactions, min_train=5, min_test=1, k=10)
hit_rates = recommender.monte_carlo_cv(iterations=5, random_seed=42)

  interaction_matrix = interaction_matrix.applymap(lambda x: 1 if x > 0 else 0)


Iteration 1: Hit Rate = 34.90%


  interaction_matrix = interaction_matrix.applymap(lambda x: 1 if x > 0 else 0)


Iteration 2: Hit Rate = 39.33%


  interaction_matrix = interaction_matrix.applymap(lambda x: 1 if x > 0 else 0)


Iteration 3: Hit Rate = 39.95%


  interaction_matrix = interaction_matrix.applymap(lambda x: 1 if x > 0 else 0)


Iteration 4: Hit Rate = 38.33%


  interaction_matrix = interaction_matrix.applymap(lambda x: 1 if x > 0 else 0)


Iteration 5: Hit Rate = 36.47%

Final Hit Rate: 37.80% ± 1.87%


# Final recommendations

In [14]:
# getting the actual recommendations for each client
recs = {}
interaction_matrix = recommender.create_interaction_matrix()  
for client_id in transactions['Client ID'].unique():
    recs[client_id] = recommender.smart_basket(client_id, transactions, interaction_matrix)


  return matrix.applymap(lambda x: 1 if x > 0 else 0)


In [21]:
recs 
prod_map = transactions[['ID Product', 'Product Description']].drop_duplicates().set_index('ID Product')['Product Description'].to_dict()
mapped_recs = {client: [(pid, prod_map.get(pid, "Unknown")) for pid in prod_list] for client, prod_list in recs.items()}

rows = []
for client, rec_list in mapped_recs.items():
    for rank, (pid, desc) in enumerate(rec_list, start=1):
        rows.append({
            "Client ID": client,
            "Rank": rank,
            "ID Product": pid,
            "Product Description": desc
        })

df_mapped_recs = pd.DataFrame(rows)
df_own_brand = transactions[['ID Product', 'Own Brand']].drop_duplicates()
df_mapped_recs = df_mapped_recs.merge(df_own_brand, on='ID Product', how='left')
df_mapped_recs['Own Brand'] = df_mapped_recs['Own Brand'].fillna(0).astype(int)

percentage_own_brand = df_mapped_recs['Own Brand'].mean() * 100
print(f"Percentage of recommendations containing own brand: {percentage_own_brand:.2f}%") # this is the percentage of recommendations that are own brand products
print("Number of unique clients:", df_mapped_recs['Client ID'].nunique()) # form the we can see that all the clients have at least 1 recommendation
avg_recs = df_mapped_recs.groupby("Client ID").size().mean() 
print("Average number of recommendations per client:", avg_recs) # this is the average number of recommendations per client

Percentage of recommendations containing own brand: 56.95%
Number of unique clients: 1529
Average number of recommendations per client: 9.457815565729234


In [None]:
df_mapped_recs    # dataframe recommendations for all clients with transaction history at the moment of the dataset.

Unnamed: 0,Client ID,Rank,ID Product,Product Description,Own Brand
0,210100281,1,53429,ARROZ CAROLINO MASTERCHEF 5 KG,1
1,210100281,2,370149,CENOURA SC10KG (CAL25/40) RCH,1
2,210100281,3,278283,COUVE CORACAO DE BOI C/FOLHAS RCH,1
3,210100281,4,927088,POLPA TOMATE MCHEF 1LT,1
4,210100281,5,277674,COGUMELO BRANCO MÉDIO RCH,1
...,...,...,...,...,...
14456,210106443,6,53429,ARROZ CAROLINO MASTERCHEF 5 KG,1
14457,210106443,7,655701,IOG.AMANHECER TUTTI FRUTTI 125GR,1
14458,210106443,8,655700,IOG.AMANHECER BANANA 125GR,1
14459,210106443,9,915237,DET LOICA MCHEF 5LT,1


If we want to access the recommendations for a specified client we can use get_recommendation function:

In [16]:
def get_recommendation(client_id_input):
    if client_id_input in df_mapped_recs['Client ID'].unique():
        return df_mapped_recs.loc[df_mapped_recs['Client ID'] == client_id_input]
    else:
        print("Client ID not found.")
        return pd.DataFrame()

In [28]:
# Example of usage:

client_id_input = 210106443
get_recommendation(client_id_input)

Unnamed: 0,Client ID,Rank,ID Product,Product Description,Own Brand
14451,210106443,1,926336,"V. TTO ABDEGAS 11,5% TETRA 1 LT",0
14452,210106443,2,659149,FARINHA AMANH S/FERMENTO 1KG,0
14453,210106443,3,645238,SACO PEBD MCHEF CRIS.40X60 10KG,1
14454,210106443,4,752025,SACO LIXO MCHEF AT 100LT 10 UN,1
14455,210106443,5,868736,PIMENTÃO DOCE CARMENCITA 490G PET,0
14456,210106443,6,53429,ARROZ CAROLINO MASTERCHEF 5 KG,1
14457,210106443,7,655701,IOG.AMANHECER TUTTI FRUTTI 125GR,1
14458,210106443,8,655700,IOG.AMANHECER BANANA 125GR,1
14459,210106443,9,915237,DET LOICA MCHEF 5LT,1
14460,210106443,10,53376,COGUMELOS MAST.CHEF LAMIN.780GR,1


# Export

In [27]:
df_mapped_recs.to_csv('../Data/recommendations_clients_with_transactions.csv', index=False)