# **BUSINESS CASE 3: Recheio Recommendation System**  


## 🎓 Master’s Program in Data Science & Advanced Analytics 
**Nova IMS** | March 2025   
**Course:** Business Cases with Data Science

## 👥 Team **Group A**  
- **Alice Viegas** | 20240572  
- **Bernardo Faria** | 20240579  
- **Dinis Pinto** | 20240612  
- **Daan van Holten** | 20240681
- **Philippe Dutranoit** | 20240518

## 📊 Project Overview  
This notebook utilizes the following datasets:  
- Case3_Recheio_2025 (1).xlsx <br>
- The goal of the project is to design a recomendation system so that the company can propose better products to existing costumers.

## 📊 Goal of the notebook

In this notebook we will build a smart basket for existing clients. <br>

In [1]:
import os
import pandas as pd
import numpy as np
from sklearn.metrics import pairwise_distances
import random

In [2]:
# Global definitions
baseFolder = os.getcwd()
exportsFolder = baseFolder + os.sep +'Exports' + os.sep

In [3]:
transactions = pd.read_csv('../Data/df.csv')

In [4]:
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 884099 entries, 0 to 884098
Data columns (total 8 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   Date                 884099 non-null  object
 1   Client ID            884099 non-null  int64 
 2   ZIP Code             884099 non-null  int64 
 3   ID Client Type       389817 non-null  object
 4   ID Product           884099 non-null  int64 
 5   Product Description  884099 non-null  object
 6   ID Product Category  884099 non-null  object
 7   Own Brand            884099 non-null  int64 
dtypes: int64(4), object(4)
memory usage: 54.0+ MB


We will build a smart basket based on the following 2 criteria:
- items previously purchased by the client
- items similar to the ones he has bought, based on cosine similarity

The smart basket will incorporate 5 products from each criteria, giving 10 recommended products. In the cases where the client has not bought 5 products, there will be less recommendations, as the first criteria cannot be met.

We will then do a Monte Carlo cross validation with 5 iterations. In each iteration the data will be split at a random point in time, where the train will be the data before and the test the data after that point. The client that don't meet a minimum criteria (1 transaction in the test set and 5 transactions in the train set) will be excluded from the accuracy analysis. The rest will be used to compute the mean hit rate, where the actual recommendations in the test set are compared with the predicted ones in the train set.

In [5]:
class SmartBasketRecommenderCV:
    def __init__(self, transactions, min_train=5, min_test=1, k=10):
        self.transactions = transactions.copy()
        self.k = k
        self.min_train = min_train
        self.min_test = min_test

        self.transactions['Date'] = pd.to_datetime(self.transactions['Date'])

    # this function will be used to get the top 5 products purchased by a client
    def top_purchase_history(self, client_id, df):
        client_data = df[df['Client ID'] == client_id]
        top_products = (
            client_data['ID Product']
            .value_counts()
            .head(5)
            .index
            .tolist()
        )
        return top_products

    # this function will be used to get the top 5 products purchased by similar clients (based on cosine similarity)
    def collaborative_recommendations(self, client_id, interaction_matrix, df, top_n_similar=5):
        if client_id not in interaction_matrix.index:
            return []

        client_idx = interaction_matrix.index.get_loc(client_id)
        distance_matrix = pairwise_distances(interaction_matrix, metric='cosine')
        distances = distance_matrix[client_idx]
        similar_indices = distances.argsort()[1:top_n_similar+1]
        similar_clients = interaction_matrix.index[similar_indices]

        similar_purchases = df[df['Client ID'].isin(similar_clients)]['ID Product']
        target_purchases = df[df['Client ID'] == client_id]['ID Product'].unique()
        recommendations = similar_purchases[~similar_purchases.isin(target_purchases)]
        return recommendations.value_counts().head(5).index.tolist()

    # this function gets the smart basket recommendations for a client combining the recommendations from the 2 previous functions
    def smart_basket(self, client_id, df, interaction_matrix):
        hist_recs = self.top_purchase_history(client_id, df)
        collab_recs = self.collaborative_recommendations(client_id, interaction_matrix, df)
        final_recs = hist_recs.copy()
        for item in collab_recs:
            if item not in final_recs:
                final_recs.append(item)
            if len(final_recs) == 10:
                break
        return final_recs

    # this function calculates the precision at k for the recommendations
    # it checks how many of the top k recommendations are in the test set
    def precision_at_k(self, train_recs, test_items):
        if not test_items:
            return 0.0
        hits = len(set(train_recs[:self.k]) & set(test_items))
        return hits / self.k

    # this function creates the interaction matrix for the transactions
    def create_interaction_matrix(self):
        matrix = pd.crosstab(self.transactions['Client ID'], self.transactions['ID Product'])
        return matrix.applymap(lambda x: 1 if x > 0 else 0)

    # this function performs a Monte Carlo cross-validation
    # it randomly splits the data into training and test sets multiple times
    # and calculates the hit rate for each iteration
    # the function takes the number of iterations, the number of weeks to train on, and a random seed for reproducibility
    # it returns a list of hit rates for each iteration
    # the function also prints the mean and standard deviation of the hit rates
    def monte_carlo_cv(self, iterations=5, weeks_train=45, random_seed=42):
        hit_rates_all = []

        for i in range(iterations):
            earliest = self.transactions['Date'].min()
            latest = self.transactions['Date'].max() - pd.to_timedelta(weeks_train, unit='w')
            random_start = earliest + (latest - earliest) * random.random()
            split_date = pd.to_datetime(random_start) + pd.to_timedelta(weeks_train, unit='w')

            self.transactions['train_split'] = (self.transactions['Date'] <= split_date).astype(int)
            train_set = self.transactions[self.transactions['train_split'] == 1]
            test_set = self.transactions[self.transactions['train_split'] == 0]

            interaction_matrix = pd.crosstab(self.transactions['Client ID'], self.transactions['ID Product'])
            interaction_matrix = interaction_matrix.applymap(lambda x: 1 if x > 0 else 0)

            valid_clients = []
            for client_id in self.transactions['Client ID'].unique():
                if (train_set[train_set['Client ID'] == client_id].shape[0] >= self.min_train and
                    test_set[test_set['Client ID'] == client_id].shape[0] >= self.min_test):
                    valid_clients.append(client_id)

            hit_rates = []
            for client_id in valid_clients:
                train_recs = self.smart_basket(client_id, train_set, interaction_matrix)
                test_items = test_set[test_set['Client ID'] == client_id]['ID Product'].unique().tolist()
                hit = self.precision_at_k(train_recs, test_items)
                hit_rates.append(hit)

            mean_hit = np.mean(hit_rates)
            hit_rates_all.append(mean_hit)
            print(f"Iteration {i+1}: Hit Rate = {mean_hit:.2%}")

        print(f"\nFinal Hit Rate: {np.mean(hit_rates_all):.2%} ± {np.std(hit_rates_all):.2%}")
        return hit_rates_all


In [6]:
recommender = SmartBasketRecommenderCV(transactions, min_train=5, min_test=1, k=10)
hit_rates = recommender.monte_carlo_cv(iterations=5, random_seed=42)

  interaction_matrix = interaction_matrix.applymap(lambda x: 1 if x > 0 else 0)


Iteration 1: Hit Rate = 34.44%


  interaction_matrix = interaction_matrix.applymap(lambda x: 1 if x > 0 else 0)


Iteration 2: Hit Rate = 35.56%


  interaction_matrix = interaction_matrix.applymap(lambda x: 1 if x > 0 else 0)


Iteration 3: Hit Rate = 40.08%


  interaction_matrix = interaction_matrix.applymap(lambda x: 1 if x > 0 else 0)


Iteration 4: Hit Rate = 37.62%


  interaction_matrix = interaction_matrix.applymap(lambda x: 1 if x > 0 else 0)


Iteration 5: Hit Rate = 40.86%

Final Hit Rate: 37.71% ± 2.48%


In [7]:
# getting the actual recommendations for each client
recs = {}
interaction_matrix = recommender.create_interaction_matrix()  
for client_id in transactions['Client ID'].unique():
    recs[client_id] = recommender.smart_basket(client_id, transactions, interaction_matrix)


  return matrix.applymap(lambda x: 1 if x > 0 else 0)


In [8]:
recs

{210100281: [53429,
  370149,
  278283,
  927088,
  277674,
  484903,
  731071,
  277728,
  533168,
  748027],
 210100701: [700041,
  53407,
  428129,
  865304,
  903222,
  646028,
  655704,
  655700,
  509797,
  655701],
 210100742: [278794,
  278727,
  912742,
  623900,
  658661,
  484903,
  589377,
  451586,
  53425,
  731071],
 210101007: [269462,
  940196,
  369845,
  896927,
  372332,
  910876,
  885753,
  848551,
  859288],
 210101290: [959682,
  692893,
  46551,
  752197,
  707764,
  891782,
  917036,
  381775,
  467788,
  748027],
 210101632: [589377,
  721171,
  461321,
  276806,
  828437,
  776452,
  659149,
  622341,
  731071,
  749441],
 210101746: [734459,
  704578,
  708763,
  954063,
  276809,
  890937,
  483897,
  915237,
  41045,
  922602],
 210101779: [655701,
  621958,
  655704,
  629616,
  629613,
  628320,
  765372,
  119036,
  578320,
  932724],
 210101784: [926338,
  119036,
  716133,
  939014,
  761134,
  806215,
  749441,
  757619,
  857504,
  53425],
 2101018