# Recomendar productos según el carrito de compras del usuario

### Objetivo

Incrementar el Basket Size (tamaño del carrito de compras) de los usuarios mejorando la experiencia de compra a través de la recomendación de productos que tienen alta probabilidad de comprarse en conjunto, en base a reglas de asociación.

## Importar librerías

In [1]:
import time
from datetime import datetime
import numpy as np
import pandas as pd

In [2]:
orders = pd.read_csv('C:/Users/josefina.lin/Documents/Master/10-Trabajo Final/datasets/model_input.csv')

In [3]:
orders.head()

Unnamed: 0,StockCode,Description,InvoiceNo,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,21730,GLASS STAR FROSTED T-LIGHT HOLDER,536365,6,2010-12-01 08:26:00,4.25,17850.0,United Kingdom
1,21730,GLASS STAR FROSTED T-LIGHT HOLDER,536373,6,2010-12-01 09:02:00,4.25,17850.0,United Kingdom
2,21730,GLASS STAR FROSTED T-LIGHT HOLDER,536375,6,2010-12-01 09:32:00,4.25,17850.0,United Kingdom
3,21730,GLASS STAR FROSTED T-LIGHT HOLDER,536396,6,2010-12-01 10:51:00,4.25,17850.0,United Kingdom
4,21730,GLASS STAR FROSTED T-LIGHT HOLDER,536406,6,2010-12-01 11:33:00,4.25,17850.0,United Kingdom


In [4]:
orders.drop(["CustomerID","Quantity","UnitPrice","InvoiceDate"],
            axis=1,
           inplace=True)

## Inspeccionar la data

In [5]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 379115 entries, 0 to 379114
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   StockCode    379115 non-null  object
 1   Description  379115 non-null  object
 2   InvoiceNo    379115 non-null  object
 3   Country      379115 non-null  object
dtypes: object(4)
memory usage: 11.6+ MB


In [6]:
### Identificamos las columnas que son categóricas

CATEGORICAL_COLUMNS = ['InvoiceNo',
                       'StockCode',
                       'Description', 
                       'Country'
                      ]
orders[CATEGORICAL_COLUMNS] = orders[CATEGORICAL_COLUMNS].astype('object')

In [7]:
orders[CATEGORICAL_COLUMNS].describe()

Unnamed: 0,InvoiceNo,StockCode,Description,Country
count,379115,379115,379115,379115
unique,18476,3778,3673,3
top,576339,85123A,WHITE HANGING HEART T-LIGHT HOLDER,United Kingdom
freq,519,1964,1997,361763


## Implementación de las Reglas de Asociación

In [8]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules, fpgrowth

In [9]:
import networkx as nx
from pyvis.network import Network

El __soporte__ indica qué tan popular es un artículo, es decir, cuán frecuente aparece en las compras. El umbral lo tenemos que determinar nostotros, por ejemplo, si lo seteamos en 0.5 entonces nos quedaeremos solo con aquellos que ocurren juntos por lo menos el 5% de las veces. Cuanto más bajo sea mayor combinaciones de artícuilos lograremos y por lo tanto más cantidad de recomendaciones y este es nuestro objetivo.

### Helper Functions

In [10]:
def perform_rule_calculation(transact_items_matrix, rule_type="fpgrowth", min_support=0.005):
    """
    desc: this function performs the association rule calculation 
    @params:
        - transact_items_matrix: the transaction X Items matrix
        - rule_type: 
                    - apriori or Growth algorithms (default="fpgrowth")
                    
        - min_support: minimum support threshold value (default = 0.005)
        
    @returns:
        - the matrix containing 3 columns:
            - support: support values for each combination of items
            - itemsets: the combination of items
            - number_of_items: the number of items in each combination of items
            
        - the excution time for the corresponding algorithm
        
    """
    start_time = 0
    total_execution = 0
    
    if(not rule_type=="fpgrowth"):
        start_time = time.time()
        rule_items = apriori(transact_items_matrix, 
                       min_support=min_support, 
                       use_colnames=True)
        total_execution = time.time() - start_time
        print("Computed Apriori!")
        
    else:
        start_time = time.time()
        rule_items = fpgrowth(transact_items_matrix, 
                       min_support=min_support, 
                       use_colnames=True)
        total_execution = time.time() - start_time
        print("Computed Fp Growth!")
    
    rule_items['number_of_items'] = rule_items['itemsets'].apply(lambda x: len(x))
    
    return rule_items, total_execution

In [11]:
def compute_association_rule(rule_matrix, metric="lift", min_thresh=1):
    """
    @desc: Compute the final association rule
    @params:
        - rule_matrix: the corresponding algorithms matrix
        - metric: the metric to be used (default is lift)
        - min_thresh: the minimum threshold (default is 1)
        
    @returns:
        - rules: all the information for each transaction satisfying the given metric & threshold
    """
    rules = association_rules(rule_matrix, 
                              metric=metric, 
                              min_threshold=min_thresh)
    
    return rules

In [12]:
# Plot Lift Vs Coverage(confidence) 
def plot_metrics_relationship(rule_matrix, col1, col2):
    """
    desc: shows the relationship between the two input columns 
    @params:
        - rule_matrix: the matrix containing the result of a rule (apriori or Fp Growth)
        - col1: first column
        - col2: second column
    """
    fit = np.polyfit(rule_matrix[col1], rule_matrix[col2], 1)
    fit_funt = np.poly1d(fit)
    plt.plot(rule_matrix[col1], rule_matrix[col2], 'yo', rule_matrix[col1], 
    fit_funt(rule_matrix[col1]))
    plt.xlabel(col1)
    plt.ylabel(col2)
    plt.title('{} vs {}'.format(col1, col2))

In [13]:
def compare_time_exec(algo1=list, alg2=list):
    """
    @desc: shows the execution time between two algorithms
    @params:
        - algo1: list containing the description of first algorithm, where
            
        - algo2: list containing the description of second algorithm, where
    """
    
    execution_times = [algo1[1], algo2[1]]
    algo_names = (algo1[0], algo2[0])
    y=np.arange(len(algo_names))
    
    plt.bar(y,execution_times,color=['orange', 'blue'])
    plt.xticks(y,algo_names)
    plt.xlabel('Algorithms')
    plt.ylabel('Time')
    plt.title("Execution Time (seconds) Comparison")
    plt.show()

## Fp Growth Algorithm

Se calcula el soporte de todo el conjunto de artículos utilizando la matriz calculada __trans_encoder_matrix__ con un soporte mínimo de 0.005. 
La  regla aplicada es del tipo "fpgrowth" que es la que está por default.

In [14]:
min_thresh = 1
f_metric='lift'
t = datetime.now().strftime("%Y%m%d")

In [15]:
for country in  list(orders.Country.unique()):
    
    print(country)
    ## generamos una lista de transacciones por país
    all_transactions = [transaction[1]['StockCode'].tolist() 
                    for transaction 
                    in list(orders.loc[orders.Country==country].groupby(['InvoiceNo']))]

United Kingdom


  in list(orders.loc[orders.Country==country].groupby(['InvoiceNo']))]


France
Germany


  in list(orders.loc[orders.Country==country].groupby(['InvoiceNo']))]
  in list(orders.loc[orders.Country==country].groupby(['InvoiceNo']))]


In [16]:
## Creamos la matriz de transaction encoder    
trans_encoder = TransactionEncoder() # Instanciate the encoder
trans_encoder_matrix = trans_encoder.fit(all_transactions).transform(all_transactions)
trans_encoder_matrix = pd.DataFrame(trans_encoder_matrix, columns=trans_encoder.columns_)

In [17]:
trans_encoder_matrix.head()

Unnamed: 0,10002,10125,10135,11001,15034,15036,15039,15044A,15044B,15044D,...,90170,90173,90201A,90201B,90201C,90201D,90202D,90204,M,POST
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True


In [19]:
## Make rule calculation rule
    
fpgrowth_matrix, fp_growth_exec_time = perform_rule_calculation(trans_encoder_matrix)

Computed Fp Growth!


In [20]:
## Perform lift result 
fp_growth_rule_lift = compute_association_rule(fpgrowth_matrix, metric=f_metric, min_thresh=min_thresh)

result_fpg_rule_lift = fp_growth_rule_lift.explode('antecedents', 
                                                   ignore_index=True).explode('consequents').copy()

MemoryError: Unable to allocate 1.86 GiB for an array with shape (7, 35720178) and data type float64

In [None]:
## Save results product to recommend and total products by product
recommendation_by_product = result_fpg_rule_lift[['antecedents',
                                                  'consequents', 
                                                  'support',
                                                  'confidence',
                                                  'lift'
                                                 ]]
recommendation_by_product['Country'] = country

recommendation_by_product = recommendation_by_product[['Country',
                                                       'antecedents',
                                                       'consequents',
                                                       'support',
                                                       'confidence',
                                                       'lift'
                                                      ]].groupby(['Country',
                                                                  'antecedents',
                                                                  'consequents'
                                                                 ]).max().reset_index()

In [None]:
recommendation_by_product

In [None]:
fpgrowth_matrix, fp_growth_exec_time = perform_rule_calculation(trans_encoder_matrix) # Run the algorithm
print("Fp Growth execution took: {} seconds".format(fp_growth_exec_time))

In [None]:
recommendation_by_product.info()

#### Agregamos la descripción del producto para antecedentes y consecuentes

##### Antecedentes

In [None]:
# 1. Se mergea con el dataframe orders para traer la descripción del producto
fpg_recommendation_by_product = pd.merge(recommendation_by_product,
                                         orders[['Description','StockCode']],
                                         how='inner',
                                         left_on = ['antecedents'],
                                         right_on = ['StockCode']
                                         )
fpg_recommendation_by_product.drop_duplicates(inplace=True)
fpg_recommendation_by_product.info()

In [None]:
# 2. Se renombra el campo a "antecedents_description" y se borran los campos del dataframe de orders
fpg_recommendation_by_product['antecedents_description'] = fpg_recommendation_by_product['Description']
fpg_recommendation_by_product.drop(['Description','StockCode'], axis=1, inplace=True)
fpg_recommendation_by_product.info()

##### Consecuentes

In [None]:
# 1. Se mergea con el dataframe orders para traer la descripción del producto
fpg_recommendation_by_product = pd.merge(fpg_recommendation_by_product,
                                         orders[['Description','StockCode']],
                                         how='inner',
                                         left_on = ['consequents'],
                                         right_on = ['StockCode']
                                         )
fpg_recommendation_by_product.drop_duplicates(inplace=True)
fpg_recommendation_by_product.info()

In [None]:
# 2. Se renombra el campo a "consequents_description" y se borran los campos del dataframe de orders
fpg_recommendation_by_product['consequents_description'] = fpg_recommendation_by_product['Description']
fpg_recommendation_by_product.drop(['Description','StockCode'], axis=1, inplace=True)
fpg_recommendation_by_product.info()

In [27]:
# 3. Se reordenan las columnas del dataframe para que visualmente
# df = df[['mean', 4,3,2,1]]
fpg_recommendation_by_product = fpg_recommendation_by_product[['Country',
                                                               'antecedents',
                                                               'antecedents_description',
                                                               'consequents',
                                                               'consequents_description',
                                                               'support',
                                                               'confidence',
                                                               'lift']]
fpg_recommendation_by_product.sort_values(by='antecedents_description', inplace=True)
fpg_recommendation_by_product.head(5)

Unnamed: 0,Country,antecedents,antecedents_description,consequents,consequents_description,support,confidence,lift
1123730,Germany,23437,50'S CHRISTMAS GIFT BAG LARGE,20719,WOODLAND CHARLOTTE BAG,0.005725,0.6,5.932075
5378433,Germany,23437,50'S CHRISTMAS GIFT BAG LARGE,85049G,CHOCOLATE BOX RIBBONS,0.005725,1.0,37.428571
958524,Germany,23437,50'S CHRISTMAS GIFT BAG LARGE,POST,POSTAGE,0.009542,1.0,37.428571
2215559,Germany,23437,50'S CHRISTMAS GIFT BAG LARGE,22423,REGENCY CAKESTAND 3 TIER,0.005725,0.6,5.716364
5363551,Germany,23437,50'S CHRISTMAS GIFT BAG LARGE,85049E,SCANDINAVIAN REDS RIBBONS,0.005725,1.0,34.933333


#### Se guarda los resultados del FP Growth

In [28]:
fpg_recommendation_by_product.to_csv('C:/Users/josefina.lin/Documents/Master/10-Trabajo Final/datasets/fp_growth_results.csv', 
                                     index = False)

## Apriori Algorithm

In [29]:
apriori_matrix, apriori_exec_time = perform_rule_calculation(trans_encoder_matrix, rule_type="apriori")
print("Apriori Execution took: {} seconds".format(apriori_exec_time))

Computed Apriori!
Apriori Execution took: 42.11325025558472 seconds


In [30]:
apriori_rule_lift = compute_association_rule(apriori_matrix)


In [31]:
apriori_rule_lift.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(20725),(10125),0.03626,0.01145,0.005725,0.157895,13.789474,0.00531,1.173903
1,(10125),(20725),0.01145,0.03626,0.005725,0.5,13.789474,0.00531,1.927481
2,(21880),(10125),0.030534,0.01145,0.005725,0.1875,16.375,0.005376,1.216676
3,(10125),(21880),0.01145,0.030534,0.005725,0.5,16.375,0.005376,1.938931
4,(21883),(10125),0.024809,0.01145,0.005725,0.230769,20.153846,0.005441,1.285115


In [32]:
result_apriori_rule = apriori_rule_lift.explode('antecedents',
                                                ignore_index=True).explode('consequents').copy()

In [33]:
## Save results product to recommend and total products by product
apr_recommendation_by_product = result_apriori_rule[['antecedents',
                                                     'consequents', 
                                                     'support',
                                                     'confidence',
                                                     'lift'
                                                    ]]
apr_recommendation_by_product['Country'] = country

apr_recommendation_by_product = apr_recommendation_by_product[['Country',
                                                               'antecedents',
                                                               'consequents',
                                                               'support',
                                                               'confidence',
                                                               'lift'
                                                             ]].groupby(['Country',
                                                                         'antecedents',
                                                                         'consequents'
                                                                        ]).max().reset_index()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  apr_recommendation_by_product['Country'] = country


In [34]:
apr_recommendation_by_product.head()

Unnamed: 0,Country,antecedents,consequents,support,confidence,lift
0,Germany,10125,20725,0.005725,1.0,174.666667
1,Germany,10125,21880,0.005725,1.0,174.666667
2,Germany,10125,21883,0.005725,1.0,174.666667
3,Germany,10125,22326,0.005725,0.75,4.596491
4,Germany,10125,22328,0.007634,1.0,7.594203


#### Agregamos la descripción del producto para antecedentes y consecuentes

##### Antecedentes

In [35]:
# 1. Se mergea con el dataframe orders para traer la descripción del producto
apr_recommendation_by_product = pd.merge(apr_recommendation_by_product,
                                         orders[['Description','StockCode']],
                                         how='inner',
                                         left_on = ['antecedents'],
                                         right_on = ['StockCode']
                                         )
apr_recommendation_by_product.drop_duplicates(inplace=True)
apr_recommendation_by_product.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15918 entries, 0 to 6816390
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Country      15918 non-null  object 
 1   antecedents  15918 non-null  object 
 2   consequents  15918 non-null  object 
 3   support      15918 non-null  float64
 4   confidence   15918 non-null  float64
 5   lift         15918 non-null  float64
 6   Description  15918 non-null  object 
 7   StockCode    15918 non-null  object 
dtypes: float64(3), object(5)
memory usage: 1.1+ MB


In [36]:
# 2. Se renombra el campo a "antecedents_description" y se borran los campos del dataframe de orders
apr_recommendation_by_product['antecedents_description'] = apr_recommendation_by_product['Description']
apr_recommendation_by_product.drop(['Description','StockCode'], axis=1, inplace=True)
apr_recommendation_by_product.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15918 entries, 0 to 6816390
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Country                  15918 non-null  object 
 1   antecedents              15918 non-null  object 
 2   consequents              15918 non-null  object 
 3   support                  15918 non-null  float64
 4   confidence               15918 non-null  float64
 5   lift                     15918 non-null  float64
 6   antecedents_description  15918 non-null  object 
dtypes: float64(3), object(4)
memory usage: 994.9+ KB


##### Consecuentes

In [37]:
# 1. Se mergea con el dataframe orders para traer la descripción del producto
apr_recommendation_by_product = pd.merge(apr_recommendation_by_product,
                                         orders[['Description','StockCode']],
                                         how='inner',
                                         left_on = ['consequents'],
                                         right_on = ['StockCode']
                                         )
apr_recommendation_by_product.drop_duplicates(inplace=True)
apr_recommendation_by_product.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15918 entries, 0 to 6817037
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Country                  15918 non-null  object 
 1   antecedents              15918 non-null  object 
 2   consequents              15918 non-null  object 
 3   support                  15918 non-null  float64
 4   confidence               15918 non-null  float64
 5   lift                     15918 non-null  float64
 6   antecedents_description  15918 non-null  object 
 7   Description              15918 non-null  object 
 8   StockCode                15918 non-null  object 
dtypes: float64(3), object(6)
memory usage: 1.2+ MB


In [38]:
# 2. Se renombra el campo a "consequents_description" y se borran los campos del dataframe de orders
apr_recommendation_by_product['consequents_description'] = apr_recommendation_by_product['Description']
apr_recommendation_by_product.drop(['Description','StockCode'], axis=1, inplace=True)
apr_recommendation_by_product.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15918 entries, 0 to 6817037
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Country                  15918 non-null  object 
 1   antecedents              15918 non-null  object 
 2   consequents              15918 non-null  object 
 3   support                  15918 non-null  float64
 4   confidence               15918 non-null  float64
 5   lift                     15918 non-null  float64
 6   antecedents_description  15918 non-null  object 
 7   consequents_description  15918 non-null  object 
dtypes: float64(3), object(5)
memory usage: 1.1+ MB


In [39]:
# 3. Se reordenan las columnas del dataframe para que visualmente
# df = df[['mean', 4,3,2,1]]
apr_recommendation_by_product = apr_recommendation_by_product[['Country',
                                                               'antecedents',
                                                               'antecedents_description',
                                                               'consequents',
                                                               'consequents_description',
                                                               'support',
                                                               'confidence',
                                                               'lift']]
apr_recommendation_by_product.sort_values(by='antecedents_description', inplace=True)
apr_recommendation_by_product.head(5)

Unnamed: 0,Country,antecedents,antecedents_description,consequents,consequents_description,support,confidence,lift
1123730,Germany,23437,50'S CHRISTMAS GIFT BAG LARGE,20719,WOODLAND CHARLOTTE BAG,0.005725,0.6,5.932075
5378433,Germany,23437,50'S CHRISTMAS GIFT BAG LARGE,85049G,CHOCOLATE BOX RIBBONS,0.005725,1.0,37.428571
958524,Germany,23437,50'S CHRISTMAS GIFT BAG LARGE,POST,POSTAGE,0.009542,1.0,37.428571
2215559,Germany,23437,50'S CHRISTMAS GIFT BAG LARGE,22423,REGENCY CAKESTAND 3 TIER,0.005725,0.6,5.716364
5363551,Germany,23437,50'S CHRISTMAS GIFT BAG LARGE,85049E,SCANDINAVIAN REDS RIBBONS,0.005725,1.0,34.933333


In [40]:
apr_recommendation_by_product.tail(5)

Unnamed: 0,Country,antecedents,antecedents_description,consequents,consequents_description,support,confidence,lift
1891235,Germany,16161U,WRAP SUKI AND FRIENDS,22423,REGENCY CAKESTAND 3 TIER,0.009542,1.0,131.0
906765,Germany,23232,WRAP VINTAGE PETALS DESIGN,POST,POSTAGE,0.007634,1.0,1.420054
796686,Germany,22709,WRAP WEDDING DAY,POST,POSTAGE,0.005725,1.0,32.75
2484622,Germany,22709,WRAP WEDDING DAY,22704,WRAP RED APPLES,0.005725,1.0,32.75
1000806,Germany,84832,ZINC WILLIE WINKIE CANDLE STICK,POST,POSTAGE,0.007634,1.0,1.420054


In [41]:
fpg_recommendation_by_product.tail(5)

Unnamed: 0,Country,antecedents,antecedents_description,consequents,consequents_description,support,confidence,lift
1891235,Germany,16161U,WRAP SUKI AND FRIENDS,22423,REGENCY CAKESTAND 3 TIER,0.009542,1.0,131.0
906765,Germany,23232,WRAP VINTAGE PETALS DESIGN,POST,POSTAGE,0.007634,1.0,1.420054
796686,Germany,22709,WRAP WEDDING DAY,POST,POSTAGE,0.005725,1.0,32.75
2484622,Germany,22709,WRAP WEDDING DAY,22704,WRAP RED APPLES,0.005725,1.0,32.75
1000806,Germany,84832,ZINC WILLIE WINKIE CANDLE STICK,POST,POSTAGE,0.007634,1.0,1.420054


#### Se guarda los resultados del Apriori