## 4. Reglas de Asociación (Association Rules)
### 4.1 Fundamentos de las Reglas de Asociación
- Profundización en conceptos, métricas (soporte, confianza, elevación) y significado en minería de datos.
### 4.2 Algoritmo Apriori en Acción
- Explicación detallada e implementación en Python.
- **Aplicación en el mundo real:** Ejecución de la minería de reglas de asociación en un conjunto de datos de venta al por menor.

# Reglas de asociación (Association Rules)

Las reglas de asociación son una técnica de minería de datos usada para descubrir relaciones interesantes entre variables en grandes bases de datos. Se utilizan principalmente en análisis de cesta de mercado. 

Para ello tenemos siempre que uno de los artículos es un antecedente y el otro un consecuente. A partir de ahí calculamos principalmente 3 métricas que nos ayudan a calcular cómo de vendido es un producto.

1. **Apoyo (Support)**: 
   - El soporte mide la frecuencia con la que un elemento o conjunto de elementos aparece en el conjunto de datos. En nuestro ejemplo de la tienda de comestibles, si 100 personas compraron leche en 1.000 transacciones, el apoyo a la leche es del 10%.
   - ¿Por qué es importante? El apoyo nos ayuda a filtrar los conjuntos de elementos menos frecuentes para su posterior análisis, centrándonos en los patrones más comunes y potencialmente significativos.

  $\text{support}(A \Rightarrow B) = \frac{\text{# transacciones con } A \cup B}{\text{total transacciones}}$
  

2. **Confianza (Confidence)**:
   - La confianza mide la frecuencia con la que los artículos de Y aparecen en transacciones que contienen X. Si de los que compraron leche, 30 también compraron galletas, la confianza de la regla {Leche -> Galletas} es del 30%.
   - La confianza indica la fiabilidad de la inferencia realizada por la regla. Las reglas de alta confianza tienen más probabilidades de ser de interés porque representan asociaciones fuertes.
   
  $\text{confidence}(A \Rightarrow B) = \frac{\text{# transacciones con } A \cup B}{\text{# transacciones con } A}$


3. **Elevación (Lift)**
   - ¿Qué es? La elevación mide la frecuencia con la que X e Y aparecen juntos, más de lo que cabría esperar si fueran estadísticamente independientes. Si la leche y las galletas se compran juntas tres veces más de lo que sugeriría la hipótesis de independencia, la elevación es de 3.
   - La elevación ayuda a identificar la fuerza de una regla sobre la probabilidad base de la ocurrencia de elementos. Un valor de elevación superior a 1 indica una asociación positiva entre X e Y.
   
   $\text{Lift}(a \Rightarrow b) = \frac{\text{confidence}(A \Rightarrow B)}{\text{support B}}$

## Interpretación de los resultados

No existe una regla clara, como en otros indicadores estadísticos, que nos indique que la asociación es significativa, ya que depende de la situación y el dataset.

Un **confidence** alto no garantiza utilidad, especialmente si el producto consecuente es muy común. Pero a partir de 0.7 podríamos considerar asosiación alta.

Un **lift** > 1: Asociación positiva, se compran juntos más de lo esperado

Un **lift** = 0: Sin asociación

Un **lift** < 1: Asociación negativa, uno excluye al otro


## Ejemplo de utilización

Vamos a usar el dataset groceries en el que tenemos un listado de items comprados en un mismo carrito.

In [6]:
import numpy as np
import pandas as pd

#Visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

#Apriori libraries 
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [13]:
groceries = pd.read_csv('./../Data/Modelos ML/Groceries_dataset.csv', sep = ';')
groceries.rename(columns = {'Member_number':'id','itemDescription':'item'}, inplace = True)
groceries[groceries['id']==1000].sort_values('Date')

Unnamed: 0,id,Date,item
4843,1000,15/03/2015,sausage
8395,1000,15/03/2015,whole milk
20992,1000,15/03/2015,semi-finished bread
24544,1000,15/03/2015,yogurt
13331,1000,24/06/2014,whole milk
29480,1000,24/06/2014,pastry
32851,1000,24/06/2014,salty snack
2047,1000,24/07/2015,canned beer
18196,1000,24/07/2015,misc. beverages
6388,1000,25/11/2015,sausage


In [45]:
groceries.shape

(38765, 3)

In [39]:
quantity = pd.DataFrame(groceries[['id','Date']].value_counts())
quantity.columns = ['qty_purchased']
quantity.reset_index(inplace = True)
quantity

Unnamed: 0,id,Date,qty_purchased
0,1780,12/07/2015,11
1,1669,19/08/2015,9
2,1806,06/07/2015,9
3,2393,17/04/2015,9
4,3543,11/10/2015,9
...,...,...,...
14958,1741,10/12/2014,2
14959,3111,30/10/2014,2
14960,1741,08/10/2014,2
14961,3112,03/07/2014,2


In [35]:
groceries2 = groceries.copy()
groceries2['qty_purchased']=groceries['id'].map(groceries['id'].value_counts())
# groceries2 = groceries2[['item','qty_purchased']]
# groceries2 = groceries2.drop_duplicates(['item','qty_purchased'])
groceries2.sort_values('qty_purchased', ascending = False)[0:15]

Unnamed: 0,id,Date,item,qty_purchased
20631,3180,15/09/2015,domestic eggs,36
5052,3180,19/10/2015,pastry,36
16091,3180,19/10/2015,rolls/buns,36
30682,3180,02/11/2014,rolls/buns,36
421,3180,15/03/2015,whole milk,36
21616,3180,03/07/2015,margarine,36
21201,3180,19/10/2015,tropical fruit,36
1810,3180,04/05/2015,tropical fruit,36
13853,3180,09/11/2014,whole milk,36
20190,3180,19/10/2015,condensed milk,36


In [42]:
#Creating sparse matrix 
basket = (groceries2.groupby(['id', 'item'])['qty_purchased']
          .sum().unstack().reset_index().fillna(0)
          .set_index('id'))
basket

item,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,26.0,13.0,0.0
1001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,12.0,0.0,...,0.0,0.0,0.0,12.0,0.0,12.0,0.0,24.0,0.0,0.0
1002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,0.0,0.0
1003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1004,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,63.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,6.0,6.0,0.0,0.0
4998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4999,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,32.0,...,0.0,0.0,0.0,16.0,0.0,0.0,0.0,0.0,16.0,0.0


In [41]:
#Encoding the quantity purchased
def encode(x):
    '''Encoding the quantity of products with 0s and 1s
    0:when qty is less than or equal to 0
    1:when qty is greater than or equal to 1'''
    if x <= 0:
        return 0
    if x >= 1:
        return 1

#Appying on our data
basket_sets = basket.applymap(encode)
basket_sets

  basket_sets = basket.applymap(encode)


TypeError: '<=' not supported between instances of 'str' and 'int'

In [28]:
frequent_itemsets = apriori(basket_sets, min_support=0.02, use_colnames=True)
frequent_itemsets



Unnamed: 0,support,itemsets
0,0.078502,(UHT-milk)
1,0.031042,(baking powder)
2,0.119548,(beef)
3,0.079785,(berries)
4,0.062083,(beverages)
...,...,...
889,0.027963,"(yogurt, soda, other vegetables, whole milk)"
890,0.021293,"(yogurt, tropical fruit, other vegetables, who..."
891,0.021036,"(sausage, soda, whole milk, rolls/buns)"
892,0.022832,"(sausage, yogurt, whole milk, rolls/buns)"


In [29]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(UHT-milk),(bottled water),0.078502,0.213699,0.021293,0.271242,1.269268,1.0,0.004517,1.078960,0.230217,0.078598,0.073181,0.185441
1,(bottled water),(UHT-milk),0.213699,0.078502,0.021293,0.099640,1.269268,1.0,0.004517,1.023477,0.269801,0.078598,0.022939,0.185441
2,(UHT-milk),(other vegetables),0.078502,0.376603,0.038994,0.496732,1.318979,1.0,0.009430,1.238697,0.262440,0.093711,0.192700,0.300137
3,(other vegetables),(UHT-milk),0.376603,0.078502,0.038994,0.103542,1.318979,1.0,0.009430,1.027933,0.387936,0.093711,0.027174,0.300137
4,(UHT-milk),(rolls/buns),0.078502,0.349666,0.031042,0.395425,1.130863,1.0,0.003592,1.075687,0.125578,0.078165,0.070361,0.242100
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2951,"(soda, rolls/buns)","(yogurt, whole milk)",0.119805,0.150590,0.024628,0.205567,1.365080,1.0,0.006587,1.069203,0.303844,0.100209,0.064724,0.184555
2952,(yogurt),"(soda, whole milk, rolls/buns)",0.282966,0.065162,0.024628,0.087035,1.335684,1.0,0.006190,1.023959,0.350499,0.076130,0.023398,0.232494
2953,(whole milk),"(yogurt, soda, rolls/buns)",0.458184,0.042329,0.024628,0.053751,1.269836,1.0,0.005233,1.012071,0.392193,0.051752,0.011927,0.317785
2954,(soda),"(yogurt, whole milk, rolls/buns)",0.313494,0.065931,0.024628,0.078560,1.191540,1.0,0.003959,1.013705,0.234157,0.069414,0.013520,0.226050


In [30]:
#Customizable function to change the lift and confidence
def rules_mod(lift,confidence):
    '''rules_mod is a function to control the rules 
    based on lift and confidence threshold'''
    return rules[ (rules['lift'] >= lift) &
      (rules['confidence'] >= confidence) ].sort_values('lift', ascending = False)

#Calling function
rules_mod(0.7,0.2)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
2933,"(sausage, whole milk)","(yogurt, rolls/buns)",0.106978,0.111339,0.022832,0.213429,1.916929,1.0,0.010921,1.129791,0.535633,0.116798,0.114881,0.209249
2936,"(yogurt, rolls/buns)","(sausage, whole milk)",0.111339,0.106978,0.022832,0.205069,1.916929,1.0,0.010921,1.123396,0.538262,0.116798,0.109842,0.209249
2934,"(sausage, rolls/buns)","(yogurt, whole milk)",0.082350,0.150590,0.022832,0.277259,1.841148,1.0,0.010431,1.175261,0.497859,0.108669,0.149125,0.214438
2878,"(sausage, whole milk)","(yogurt, other vegetables)",0.106978,0.120318,0.023089,0.215827,1.793806,1.0,0.010217,1.121796,0.495538,0.113065,0.108572,0.203862
2738,"(yogurt, bottled water)","(other vegetables, whole milk)",0.066444,0.191380,0.022063,0.332046,1.735009,1.0,0.009346,1.210593,0.453786,0.093580,0.173958,0.223664
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
18,(beef),(bottled water),0.119548,0.213699,0.025911,0.216738,1.014220,1.0,0.000363,1.003880,0.015925,0.084307,0.003865,0.168993
147,(fruit/vegetable juice),(bottled water),0.124936,0.213699,0.026937,0.215606,1.008921,1.0,0.000238,1.002430,0.010105,0.086420,0.002425,0.170828
2143,"(tropical fruit, root vegetables)",(other vegetables),0.057465,0.376603,0.021806,0.379464,1.007597,1.0,0.000164,1.004610,0.007999,0.052894,0.004589,0.218683
2207,"(soda, tropical fruit)",(other vegetables),0.081837,0.376603,0.031042,0.379310,1.007188,1.0,0.000222,1.004361,0.007773,0.072629,0.004342,0.230868


In [44]:
# Agrupar cada transacción por Member_number + Date (esto representa un carrito único)
transactions = groceries.groupby(['id', 'Date'])['item'].apply(list).tolist()

# Transformar las transacciones en formato binario (cesta de mercado)
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
basket_df = pd.DataFrame(te_ary, columns=te.columns_)
basket_df

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14958,False,False,False,False,False,False,False,False,False,True,...,False,False,False,False,False,False,False,False,True,False
14959,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
14960,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
14961,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [61]:
df = pd.read_csv('./../Data/Modelos ML/Groceries_dataset.csv', sep = ';')

# Agrupar cada transacción por Member_number + Date (esto representa un carrito único)
transactions = df.groupby(['Member_number', 'Date'])['itemDescription'].apply(list).tolist()

# Transformar las transacciones en formato binario (cesta de mercado)
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
basket_df = pd.DataFrame(te_ary, columns=te.columns_)

# Aplicar el algoritmo Apriori para encontrar conjuntos de productos frecuentes (min_support = 0.02)
frequent_itemsets = apriori(basket_df, min_support=0.001, use_colnames=True)

# Generar las reglas de asociación con una confianza mínima del 0.3
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.00)




In [65]:
rules.sort_values('lift', ascending = False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
1235,(sausage),"(yogurt, whole milk)",0.060349,0.011161,0.001470,0.024363,2.182917,1.0,0.000797,1.013532,0.576701,0.020992,0.013351,0.078050
1234,"(yogurt, whole milk)",(sausage),0.011161,0.060349,0.001470,0.131737,2.182917,1.0,0.000797,1.082219,0.548014,0.020992,0.075973,0.078050
1233,"(sausage, whole milk)",(yogurt),0.008955,0.085879,0.001470,0.164179,1.911760,1.0,0.000701,1.093681,0.481231,0.015748,0.085657,0.090650
1236,(yogurt),"(sausage, whole milk)",0.085879,0.008955,0.001470,0.017121,1.911760,1.0,0.000701,1.008307,0.521727,0.015748,0.008239,0.090650
474,(specialty chocolate),(citrus fruit),0.015973,0.053131,0.001403,0.087866,1.653762,1.0,0.000555,1.038081,0.401735,0.020731,0.036684,0.057141
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57,(beef),(tropical fruit),0.033950,0.067767,0.001136,0.033465,0.493817,1.0,-0.001165,0.964510,-0.514814,0.011296,-0.036796,0.025115
48,(beef),(rolls/buns),0.033950,0.110005,0.001604,0.047244,0.429474,1.0,-0.002131,0.934127,-0.578968,0.011268,-0.070518,0.030912
49,(rolls/buns),(beef),0.110005,0.033950,0.001604,0.014581,0.429474,1.0,-0.002131,0.980344,-0.598817,0.011268,-0.020050,0.030912
469,(citrus fruit),(sausage),0.053131,0.060349,0.001203,0.022642,0.375177,1.0,-0.002003,0.961419,-0.637531,0.010714,-0.040129,0.021288


## Otras medidas e indicadores de asociación



| **Métrica**           | **Descripción**                                                                                      | **Interpretación**                                     |
|-----------------------|------------------------------------------------------------------------------------------------------|--------------------------------------------------------|
| **Representativity**  | Proporción del soporte conjunto respecto al soporte del antecedente:<br> \\( \frac{support(A ∪ B)}{support(A)} \\) | Igual que `confidence`.                               |
| **Leverage**          | Diferencia entre el soporte observado y el esperado si fueran independientes:<br> \\( support(A ∪ B) - support(A) \cdot support(B) \\) | 0 = sin relación, valores altos = asociación fuerte.   |
| **Conviction**        | Mide cuánto más a menudo A ocurre sin B de lo esperado:<br> \\( \frac{P(A) \cdot P(¬B)}{P(A \land ¬B)} \\) | > 1 = asociación positiva, ∞ = regla perfecta.         |
| **Zhang’s Metric**    | Mide fuerza de la implicación sin sesgo hacia ítems frecuentes.<br> Simétrica.                       | 0 = sin relación, cercano a 1 = fuerte, −1 = negativa. |
| **Jaccard**           | Similitud entre A y B:<br> \\( \frac{support(A ∪ B)}{support(A) + support(B) - support(A ∪ B)} \\)   | 0 a 1. Más alto = más co-ocurrencia relativa.          |
| **Certainty**         | Cuánto mejora la certeza de B sabiendo A:<br> \\( confidence - support(B) \\)                        | > 0 = A aumenta la certeza de B.                       |
| **Kulczynski**        | Promedio de las confianzas directa e inversa:<br> \\( \frac{P(B|A) + P(A|B)}{2} \\)                  | Métrica simétrica. Valor alto = fuerte asociación.     |

---

### Recomendaciones de uso

- **Leverage** y **Conviction** son útiles para detectar relaciones no triviales.
- **Zhang**, **Jaccard**, **Kulczynski** y **Certainty** ayudan a filtrar reglas engañosas, especialmente con ítems frecuentes o con soporte bajo.
- **Kulczynski** es útil cuando se quiere que la relación A ↔ B sea fuerte en ambos sentidos.

