<img src="newlogomioti.png" style="height: 100px">
<center style="color:#888">Módulo Data Science in IoT<br/>Asignatura Machine learning 2 (Unsupervised learning)</center>

# Challenge S6: Reglas de asociación para análisis de cesta de la compra

## Objetivos

El objetivo de este challenge es enfrentarse a un problema de generación de reglas de asociación real: a partir de datos de transacciones de compra de un supermercado, realizar un estudio de la cesta de la compra (*market basket analysis*) mediante la generación de reglas de asociación.

In [1]:
# Descomentar esto para instalar paquete mlxtend
#import sys
#!{sys.executable} -m pip install mlxtend

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

import pandas as pd

## Carga de datos

Para comenzar, cargamos el fichero de tickets de venta:

In [2]:
df = pd.read_excel('data/Online Retail.xlsx')
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


Y lo convertimos al formato esperado por la librería de reglas de asociación

In [4]:
# Preprocesamiento y filtrado de columnas no útiles
df['Description'] = df['Description'].str.strip()
df.dropna(axis=0, subset=['InvoiceNo'], inplace=True)
df['InvoiceNo'] = df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]

# Filtramos por país, agrupamos por ticket y transponemos
basket = (df[df['Country'] =="Germany"]
          .groupby(['InvoiceNo', 'Description'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

In [5]:
basket.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE SKULLS,...,YULETIDE IMAGES GIFT WRAP SET,ZINC HEART T-LIGHT HOLDER,ZINC STAR T-LIGHT HOLDER,ZINC BOX SIGN HOME,ZINC FOLKART SLEIGH BELLS,ZINC HEART LATTICE T-LIGHT HOLDER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536527,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536840,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536861,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536967,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536983,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
# Convertimos los binarios a enteros
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

In [7]:
basket_sets = basket.applymap(encode_units)
basket_sets.drop('POSTAGE', inplace=True, axis=1)

In [8]:
basket_sets.head()

Description,10 COLOUR SPACEBOY PEN,12 COLOURED PARTY BALLOONS,12 IVORY ROSE PEG PLACE SETTINGS,12 MESSAGE CARDS WITH ENVELOPES,12 PENCIL SMALL TUBE WOODLAND,12 PENCILS SMALL TUBE RED RETROSPOT,12 PENCILS SMALL TUBE SKULL,12 PENCILS TALL TUBE POSY,12 PENCILS TALL TUBE RED RETROSPOT,12 PENCILS TALL TUBE SKULLS,...,YULETIDE IMAGES GIFT WRAP SET,ZINC HEART T-LIGHT HOLDER,ZINC STAR T-LIGHT HOLDER,ZINC BOX SIGN HOME,ZINC FOLKART SLEIGH BELLS,ZINC HEART LATTICE T-LIGHT HOLDER,ZINC METAL HEART DECORATION,ZINC T-LIGHT HOLDER STAR LARGE,ZINC T-LIGHT HOLDER STARS SMALL,ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536527,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536840,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536861,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536967,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536983,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


* ¿Cuál es el tamaño de nuestro conjunto de items $I$ ?

* ¿ Cuál es el tamaño de nuestro conjunto de transacciones $D$ ?

## Reglas de asociación con mlxtend

Ahora vamos a utilizar la librería [`mlxtend`](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/) para generar las reglas de asociación más interesantes en nuestro conjunto de datos. 

* Comenzamos calculando el soporte de los itemsets frecuentes de nuestro conjunto, aplicando un umbral mínimo de 0.07:

* Prueba a ejecutar el mismo algoritmo apriori con umbral mínimo de soporte de 0.05 y 0.09. ¿Qué observas? ¿A qué crees que se debe?

De ahora en adelante utilizaremos el primer soporte calculado con un umbral de valor 0.07. 

* ¿Cuál es el item con mayor valor de soporte?

* Ahora vamos a calcular las reglas de asociación utilizando el soporte calculado previamente. Prueba utilizando la métrica confidence con un umbral mínimo de 1.

* Ahora prueba utilizando la métrica lift con un umbral mínimo de 1.

* Sobre estas últimas reglas calculadas, filtra aquellas que tengan un lift mayor que 1.5 y un confidence mayor que 0.7

* ¿Cómo intepretarías el significado en concreto de estas reglas de asociación generadas?

Que las dos cajas de frutas y comidas se suelen comprar juntas un 80% de las veces. Al tener un valor de lift alto también explica la dependencia entre ambos productos.

## Reglas de asociación a mano

Tomemos como ejemplo la regla de asociación generada en el paso previo. Vamos a realizar los cálculos previamente obtenidos por `mlxtend`, ahora nosotros a mano. 

* Comenzamos por el soporte. Calcula el soporte del antecedente, del consecuente y de ambos, de la última regla de asociación (sin utilizar mlxtend).

* Una vez calculado los soportes, calcula la confianza (confidence) de la regla.

* Ahora calcula el lift.

Comprueba que obtienes los mismos resultados que con `mlxtend`.

## Opcional

* Utilizando los datos calculados por `mlxtend`, realizar un scatter plot que represente el `confidence` frente al `lift` de las reglas de asociación obtenidas.

* En el análisis realizado previamente, se filtraron las ventas únicamente de Alemania. Calcular las reglas de asociación para otro país y comparar los resultados obtenidos con los de antes.