# Recomendar productos según el carrito de compras del usuario

## Objetivo

Incrementar el Basket Size (tamaño del carrito de compras) de los usuarios mejorando la experiencia de compra a través de la recomendación de productos que tienen alta probabilidad de comprarse en conjunto, en base a reglas de asociación generadas por el algoritmo de FP Growth. 
Se elige este algoritmo pues en el notebook anterior se comprobó que es más eficiente en cuanto a la implementación que el algoritmo Apriori y llegan a los mismos resultados.

## Alcance

Se trabajará con los datos del `model_input.csv`, resultado del primer notebook (1_eda). En el análisis exploratorio realizado se definió las siguientes decisiones a tomar:
1. Filtrar el dataset solo para United Kingdom
2. Borrar las órdenes con 1 solo producto
3. Borrar las órdenes que potencialmente pueden ser para más de 1 cliente: sin CustomerID y con más de 100 productos por órden.
4. Borrar las descripciones duplicadas que están asociadas a 1 StockCode 

## Importar librerías

In [1]:
import time
from datetime import datetime
import numpy as np
import pandas as pd
import pymongo

In [2]:
orders = pd.read_csv('C:/Users/josefina.lin/Documents/Master/10-Trabajo Final/datasets/model_input.csv')

In [3]:
orders.head()

Unnamed: 0,StockCode,Description,InvoiceNo,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,21730,GLASS STAR FROSTED T-LIGHT HOLDER,536365,6,2010-12-01 08:26:00,4.25,17850.0,United Kingdom
1,21730,GLASS STAR FROSTED T-LIGHT HOLDER,536373,6,2010-12-01 09:02:00,4.25,17850.0,United Kingdom
2,21730,GLASS STAR FROSTED T-LIGHT HOLDER,536375,6,2010-12-01 09:32:00,4.25,17850.0,United Kingdom
3,21730,GLASS STAR FROSTED T-LIGHT HOLDER,536396,6,2010-12-01 10:51:00,4.25,17850.0,United Kingdom
4,21730,GLASS STAR FROSTED T-LIGHT HOLDER,536406,6,2010-12-01 11:33:00,4.25,17850.0,United Kingdom


In [4]:
orders.drop(["CustomerID","Quantity","UnitPrice","InvoiceDate"],
            axis=1,
           inplace=True)

## Inspeccionar la data

In [5]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 361763 entries, 0 to 361762
Data columns (total 4 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   StockCode    361763 non-null  object
 1   Description  361763 non-null  object
 2   InvoiceNo    361763 non-null  object
 3   Country      361763 non-null  object
dtypes: object(4)
memory usage: 11.0+ MB


In [6]:
### Identificamos las columnas que son categóricas

CATEGORICAL_COLUMNS = ['InvoiceNo',
                       'StockCode',
                       'Description', 
                       'Country'
                      ]
orders[CATEGORICAL_COLUMNS] = orders[CATEGORICAL_COLUMNS].astype('object')

In [7]:
orders[CATEGORICAL_COLUMNS].describe()

Unnamed: 0,InvoiceNo,StockCode,Description,Country
count,361763,361763,361763,361763
unique,17549,3774,3669,1
top,576339,85123A,WHITE HANGING HEART T-LIGHT HOLDER,United Kingdom
freq,519,1960,1993,361763


## Implementación de las Reglas de Asociación

In [8]:
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules, fpgrowth

In [9]:
import networkx as nx
from pyvis.network import Network

El __soporte__ indica qué tan popular es un artículo, es decir, cuán frecuente aparece en las compras. El umbral lo tenemos que determinar nostotros, por ejemplo, si lo seteamos en 0.5 entonces nos quedaeremos solo con aquellos que ocurren juntos por lo menos el 5% de las veces. Cuanto más bajo sea mayor combinaciones de artícuilos lograremos y por lo tanto más cantidad de recomendaciones y este es nuestro objetivo.

### Helper Functions

In [10]:
def perform_rule_calculation(transact_items_matrix, rule_type="fpgrowth", min_support=0.005):
    """
    desc: this function performs the association rule calculation 
    @params:
        - transact_items_matrix: the transaction X Items matrix
        - rule_type: 
                    - apriori or Growth algorithms (default="fpgrowth")
                    
        - min_support: minimum support threshold value (default = 0.005)
        
    @returns:
        - the matrix containing 3 columns:
            - support: support values for each combination of items
            - itemsets: the combination of items
            - number_of_items: the number of items in each combination of items
            
        - the excution time for the corresponding algorithm
        
    """
    start_time = 0
    total_execution = 0
    
    if(not rule_type=="fpgrowth"):
        start_time = time.time()
        rule_items = apriori(transact_items_matrix, 
                       min_support=min_support, 
                       use_colnames=True)
        total_execution = time.time() - start_time
        print("Computed Apriori!")
        
    else:
        start_time = time.time()
        rule_items = fpgrowth(transact_items_matrix, 
                       min_support=min_support, 
                       use_colnames=True)
        total_execution = time.time() - start_time
        print("Computed Fp Growth!")
    
    rule_items['number_of_items'] = rule_items['itemsets'].apply(lambda x: len(x))
    
    return rule_items, total_execution

In [11]:
def compute_association_rule(rule_matrix, metric="lift", min_thresh=1):
    """
    @desc: Compute the final association rule
    @params:
        - rule_matrix: the corresponding algorithms matrix
        - metric: the metric to be used (default is lift)
        - min_thresh: the minimum threshold (default is 1)
        
    @returns:
        - rules: all the information for each transaction satisfying the given metric & threshold
    """
    rules = association_rules(rule_matrix, 
                              metric=metric, 
                              min_threshold=min_thresh)
    
    return rules

In [12]:
# Plot Lift Vs Coverage(confidence) 
def plot_metrics_relationship(rule_matrix, col1, col2):
    """
    desc: shows the relationship between the two input columns 
    @params:
        - rule_matrix: the matrix containing the result of a rule (apriori or Fp Growth)
        - col1: first column
        - col2: second column
    """
    fit = np.polyfit(rule_matrix[col1], rule_matrix[col2], 1)
    fit_funt = np.poly1d(fit)
    plt.plot(rule_matrix[col1], rule_matrix[col2], 'yo', rule_matrix[col1], 
    fit_funt(rule_matrix[col1]))
    plt.xlabel(col1)
    plt.ylabel(col2)
    plt.title('{} vs {}'.format(col1, col2))

In [13]:
def compare_time_exec(algo1=list, alg2=list):
    """
    @desc: shows the execution time between two algorithms
    @params:
        - algo1: list containing the description of first algorithm, where
            
        - algo2: list containing the description of second algorithm, where
    """
    
    execution_times = [algo1[1], algo2[1]]
    algo_names = (algo1[0], algo2[0])
    y=np.arange(len(algo_names))
    
    plt.bar(y,execution_times,color=['orange', 'blue'])
    plt.xticks(y,algo_names)
    plt.xlabel('Algorithms')
    plt.ylabel('Time')
    plt.title("Execution Time (seconds) Comparison")
    plt.show()

## Fp Growth Algorithm

Se calcula el soporte de todo el conjunto de artículos utilizando la matriz calculada __trans_encoder_matrix__ con un soporte mínimo de 0.005. 
La  regla aplicada es del tipo "fpgrowth" que es la que está por default.

Se divide la base de orders en dos conjuntos: el de entrenamiento y el de testeo.
El de entrenamiento lo usaremos para calcular las reglas de asociación y el de testeo lo utilizaremos más adelante para la evaluación de los resultados.

In [14]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(orders, test_size=0.1)

In [15]:
min_thresh = 1
f_metric='lift'
t = datetime.now().strftime("%Y%m%d")

In [16]:
all_transactions = [transaction[1]['StockCode'].tolist() 
                for transaction 
                in list(train.groupby(['InvoiceNo']))]

  in list(train.groupby(['InvoiceNo']))]


In [17]:
## Creamos la matriz de transaction encoder    
trans_encoder = TransactionEncoder() # Instanciate the encoder
trans_encoder_matrix = trans_encoder.fit(all_transactions).transform(all_transactions)
trans_encoder_matrix = pd.DataFrame(trans_encoder_matrix, columns=trans_encoder.columns_)

In [18]:
trans_encoder_matrix.head()

Unnamed: 0,10002,10080,10120,10123C,10124A,10124G,10125,10133,10135,11001,...,90214Y,90214Z,C2,DOT,M,PADS,POST,gift_0001_20,gift_0001_30,m
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [19]:
## Make rule calculation rule
    
fpgrowth_matrix, fp_growth_exec_time = perform_rule_calculation(trans_encoder_matrix)

Computed Fp Growth!


In [20]:
## Perform lift result 
fp_growth_rule_lift = compute_association_rule(fpgrowth_matrix, metric=f_metric, min_thresh=min_thresh)

result_fpg_rule_lift = fp_growth_rule_lift.explode('antecedents', 
                                                   ignore_index=True).explode('consequents').copy()

In [21]:
## Save results product to recommend and total products by product
recommendation_by_product = result_fpg_rule_lift[['antecedents',
                                                  'consequents', 
                                                  'support',
                                                  'confidence',
                                                  'lift'
                                                 ]]

recommendation_by_product = recommendation_by_product[['antecedents',
                                                       'consequents',
                                                       'support',
                                                       'confidence',
                                                       'lift'
                                                      ]].groupby(['antecedents',
                                                                  'consequents'
                                                                 ]).max().reset_index()

In [22]:
recommendation_by_product

Unnamed: 0,antecedents,consequents,support,confidence,lift
0,15056BL,15056N,0.005996,0.509709,30.464230
1,15056N,15056BL,0.005996,0.358362,30.464230
2,15056N,20679,0.005539,0.331058,26.472548
3,20675,20676,0.005482,0.436364,29.618605
4,20675,20677,0.007595,0.604545,49.703286
...,...,...,...,...,...
2213,DOT,22379,0.005767,0.372694,14.900942
2214,DOT,22386,0.008223,0.671233,25.742897
2215,DOT,22411,0.008965,0.745098,36.422140
2216,DOT,22502,0.005539,0.357934,20.351081


In [23]:
recommendation_by_product.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2218 entries, 0 to 2217
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   antecedents  2218 non-null   object 
 1   consequents  2218 non-null   object 
 2   support      2218 non-null   float64
 3   confidence   2218 non-null   float64
 4   lift         2218 non-null   float64
dtypes: float64(3), object(2)
memory usage: 86.8+ KB


#### Agregamos la descripción del producto para antecedentes y consecuentes

##### Antecedentes

In [24]:
# 1. Se mergea con el dataframe orders para traer la descripción del producto
fpg_recommendation_by_product = pd.merge(recommendation_by_product,
                                         orders[['Description','StockCode']],
                                         how='inner',
                                         left_on = ['antecedents'],
                                         right_on = ['StockCode']
                                         )
fpg_recommendation_by_product.drop_duplicates(inplace=True)
fpg_recommendation_by_product.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2218 entries, 0 to 1638934
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   antecedents  2218 non-null   object 
 1   consequents  2218 non-null   object 
 2   support      2218 non-null   float64
 3   confidence   2218 non-null   float64
 4   lift         2218 non-null   float64
 5   Description  2218 non-null   object 
 6   StockCode    2218 non-null   object 
dtypes: float64(3), object(4)
memory usage: 138.6+ KB


In [25]:
# 2. Se renombra el campo a "antecedents_description" y se borran los campos del dataframe de orders
fpg_recommendation_by_product['antecedents_description'] = fpg_recommendation_by_product['Description']
fpg_recommendation_by_product.drop(['Description','StockCode'], axis=1, inplace=True)
fpg_recommendation_by_product.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2218 entries, 0 to 1638934
Data columns (total 6 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   antecedents              2218 non-null   object 
 1   consequents              2218 non-null   object 
 2   support                  2218 non-null   float64
 3   confidence               2218 non-null   float64
 4   lift                     2218 non-null   float64
 5   antecedents_description  2218 non-null   object 
dtypes: float64(3), object(3)
memory usage: 121.3+ KB


##### Consecuentes

In [26]:
# 1. Se mergea con el dataframe orders para traer la descripción del producto
fpg_recommendation_by_product = pd.merge(fpg_recommendation_by_product,
                                         orders[['Description','StockCode']],
                                         how='inner',
                                         left_on = ['consequents'],
                                         right_on = ['StockCode']
                                         )
fpg_recommendation_by_product.drop_duplicates(inplace=True)
fpg_recommendation_by_product.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2218 entries, 0 to 1638797
Data columns (total 8 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   antecedents              2218 non-null   object 
 1   consequents              2218 non-null   object 
 2   support                  2218 non-null   float64
 3   confidence               2218 non-null   float64
 4   lift                     2218 non-null   float64
 5   antecedents_description  2218 non-null   object 
 6   Description              2218 non-null   object 
 7   StockCode                2218 non-null   object 
dtypes: float64(3), object(5)
memory usage: 156.0+ KB


In [27]:
# 2. Se renombra el campo a "consequents_description" y se borran los campos del dataframe de orders
fpg_recommendation_by_product['consequents_description'] = fpg_recommendation_by_product['Description']
fpg_recommendation_by_product.drop(['Description','StockCode'], axis=1, inplace=True)
fpg_recommendation_by_product.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2218 entries, 0 to 1638797
Data columns (total 7 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   antecedents              2218 non-null   object 
 1   consequents              2218 non-null   object 
 2   support                  2218 non-null   float64
 3   confidence               2218 non-null   float64
 4   lift                     2218 non-null   float64
 5   antecedents_description  2218 non-null   object 
 6   consequents_description  2218 non-null   object 
dtypes: float64(3), object(4)
memory usage: 138.6+ KB


In [28]:
# 3. Se reordenan las columnas del dataframe para que visualmente
# df = df[['mean', 4,3,2,1]]
fpg_recommendation_by_product = fpg_recommendation_by_product[['antecedents',
                                                               'antecedents_description',
                                                               'consequents',
                                                               'consequents_description',
                                                               'support',
                                                               'confidence',
                                                               'lift']]
fpg_recommendation_by_product.sort_values(by='antecedents_description', inplace=True)
fpg_recommendation_by_product.head(5)

Unnamed: 0,antecedents,antecedents_description,consequents,consequents_description,support,confidence,lift
1252978,20975,12 PENCILS SMALL TUBE RED RETROSPOT,20974,12 PENCILS SMALL TUBE SKULL,0.005482,0.363636,25.781377
1252685,20974,12 PENCILS SMALL TUBE SKULL,20975,12 PENCILS SMALL TUBE RED RETROSPOT,0.005482,0.388664,25.781377
1478605,22150,3 STRIPEY MICE FELTCRAFT,22149,FELTCRAFT 6 FLOWER FRIENDS,0.005825,0.298246,12.376486
1250240,22150,3 STRIPEY MICE FELTCRAFT,22147,FELTCRAFT BUTTERFLY HEARTS,0.006167,0.315789,13.964912
1457592,23354,6 GIFT TAGS 50'S CHRISTMAS,23351,ROLL WRAP 50'S CHRISTMAS,0.005196,0.439614,33.040824


#### Se guarda los resultados del FP Growth

1. Se guarda en CSV

In [29]:
fpg_recommendation_by_product.to_csv('C:/Users/josefina.lin/Documents/Master/10-Trabajo Final/datasets/fp_growth_results.csv', 
                                     index = False)

2. Se guarda en MongoDB

In [30]:
import json

In [31]:
# 2.1 Primero transformamos en JSON las recomendaciones por producto
records = json.loads(fpg_recommendation_by_product.T.to_json()).values()

In [32]:
# 2.2 Nos conectamos al servidor de MongoDB local
myclient = pymongo.MongoClient("mongodb://localhost:27017/")

In [33]:
# 2.3 Nos conectamos a la base de datos "association_rules" y a la colección "rules"
db = myclient['association_rules']
collection = db['rules']

In [34]:
# 2.4 Insertamos en la collection los registros que transformamos en JSON en el paso 1
collection.insert_many(records)

<pymongo.results.InsertManyResult at 0x24ff35e9a60>

#### Se guarda el dataset de testeo

In [35]:
test.to_csv('C:/Users/josefina.lin/Documents/Master/10-Trabajo Final/datasets/test.csv', 
                                     index = False)