# **Trabajo Práctico - Reglas de Asociación - Lisandro Duhalde**


En este trabajo se aplican reglas de asociación para el análisis de patrones frecuentes en un conjunto de datos transaccionales.  
Se utiliza el algoritmo Apriori para identificar itemsets frecuentes y extraer reglas de asociación relevantes, evaluándolas mediante métricas como soporte, confianza y lift.

El objetivo es descubrir relaciones significativas entre los elementos del dataset y analizar su interpretación desde el punto de vista del negocio/dominio del problema.



## Técnicas utilizadas
- Reglas de asociación
- Algoritmo Apriori
- Soporte, confianza y lift

In [None]:
import warnings

# Ignorar DeprecationWarning del modulo jupyter_client
warnings.filterwarnings("ignore", category=DeprecationWarning, module="jupyter_client")

Cargamos la librería que nos permite acceder al dataset

In [None]:
# soporte para cargar dataset de https://www.openml.org/
!pip install openml
import openml

Collecting openml
  Downloading openml-0.15.1-py3-none-any.whl.metadata (10 kB)
Collecting liac-arff>=2.4.0 (from openml)
  Downloading liac-arff-2.5.0.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting xmltodict (from openml)
  Downloading xmltodict-1.0.2-py3-none-any.whl.metadata (15 kB)
Collecting minio (from openml)
  Downloading minio-7.2.18-py3-none-any.whl.metadata (6.5 kB)
Collecting pycryptodome (from minio->openml)
  Downloading pycryptodome-3.23.0-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Downloading openml-0.15.1-py3-none-any.whl (160 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.4/160.4 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading minio-7.2.18-py3-none-any.whl (93 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m93.1/93.1 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading xmltodict-1.0.2-py3-none-any.whl (13 kB)
Downloading pycryptodome-3.23

Accedemos al dataset

In [None]:
import pandas as pd

# indicamos cual dataset queremos utilizar, en este caso el nro. 24
dataset = openml.datasets.get_dataset(42585)

# separamos las información almacenada en el dataset
X, y, categorical_indicator, attribute_names = dataset.get_data(
    dataset_format='dataframe',
    target=dataset.default_target_attribute
)

#  concatenamos la información relevante en un único DataFrame
df = pd.concat([X, y], axis=1)
df

Unnamed: 0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,species
0,Torgersen,39.1,18.7,181.0,3750.0,MALE,Adelie
1,Torgersen,39.5,17.4,186.0,3800.0,FEMALE,Adelie
2,Torgersen,40.3,18.0,195.0,3250.0,FEMALE,Adelie
3,Torgersen,,,,,,Adelie
4,Torgersen,36.7,19.3,193.0,3450.0,FEMALE,Adelie
...,...,...,...,...,...,...,...
339,Biscoe,,,,,,Gentoo
340,Biscoe,46.8,14.3,215.0,4850.0,FEMALE,Gentoo
341,Biscoe,50.4,15.7,222.0,5750.0,MALE,Gentoo
342,Biscoe,45.2,14.8,212.0,5200.0,FEMALE,Gentoo


a) El dataset seleccionado contiene información sobre pingüinos de la base Palmer, un laboratorio en la Antártida. Sobre un total de 344 pingüinos, se los clasifica de acuerdo con 7 características.

Dentro de las 7 características, la especie, la isla y el sexo son de tipo nominal, y solo pueden tomar 3 valores. El resto son de tipo numérico y pueden tomar un rango mayor de valores.

b) Aplicar las reglas de asociación en este dataset puede ser útil para encontrar distintos tipos de relaciones entre las variables estudiadas, como por ejemplo si dentro de alguna isla predomina alguna especie, o cómo varía el peso de los pingüinos por especie/isla, también cómo varían las medidas de los pingüinos de distinta especie/género, etc.

c) Para las filas con valores faltantes, se eliminaran esas filas, ya que para discretizar los intervalos es necesario remover los valores faltantes; para la variable 'sex', se reemplazaron las celdas con valor '_' por NaN para que las elimine correctamente la funcion 'dropna'.

Luego vamos a discretizar en 3 intervalos a las variables continuas.

Y por ultimo se utilizara One-Hod Encoding para que algoritmo Apiori funcione correctamente.


In [None]:
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer

df['sex'] = df['sex'].replace('_', pd.NA)

columnas_cat = ["island","sex","species"]
df_limpio = df.dropna(subset=columnas_cat)

columnas_cont = ["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm", "body_mass_g"]
kbd = KBinsDiscretizer(n_bins=3, encode="ordinal", strategy="quantile")
df_limpio[columnas_cont] = kbd.fit_transform(df_limpio[columnas_cont])

df_procesado = pd.get_dummies(df_limpio, columns=columnas_cat + columnas_cont)
df_procesado

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_limpio[columnas_cont] = kbd.fit_transform(df_limpio[columnas_cont])


Unnamed: 0,island_Biscoe,island_Dream,island_Torgersen,sex_FEMALE,sex_MALE,species_Adelie,species_Chinstrap,species_Gentoo,culmen_length_mm_0.0,culmen_length_mm_1.0,culmen_length_mm_2.0,culmen_depth_mm_0.0,culmen_depth_mm_1.0,culmen_depth_mm_2.0,flipper_length_mm_0.0,flipper_length_mm_1.0,flipper_length_mm_2.0,body_mass_g_0.0,body_mass_g_1.0,body_mass_g_2.0
0,False,False,True,False,True,True,False,False,True,False,False,False,False,True,True,False,False,False,True,False
1,False,False,True,True,False,True,False,False,True,False,False,False,True,False,True,False,False,False,True,False
2,False,False,True,True,False,True,False,False,True,False,False,False,True,False,False,True,False,True,False,False
4,False,False,True,True,False,True,False,False,True,False,False,False,False,True,False,True,False,True,False,False
5,False,False,True,False,True,True,False,False,True,False,False,False,False,True,True,False,False,True,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
338,True,False,False,True,False,False,False,True,False,False,True,True,False,False,False,False,True,False,False,True
340,True,False,False,True,False,False,False,True,False,False,True,True,False,False,False,False,True,False,False,True
341,True,False,False,False,True,False,False,True,False,False,True,True,False,False,False,False,True,False,False,True
342,True,False,False,True,False,False,False,True,False,True,False,True,False,False,False,False,True,False,False,True


d) Para obtener las reglas de asociación, primero obtenemos los itemsets frecuentes a partir del algoritmo Apriori y luego generamos las reglas de asociación a partir de esos itemsets.El soporte (0.15) y la confianza (0.75) poseen valores moderados, ya que luego en la etapa de post procesamiento se van a filtrar un poco mas las reglas irrelelevantes.

In [None]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

frequent_itemsets = apriori(df_procesado, min_support=0.15,use_colnames=True)

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.75)

rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(species_Gentoo),(island_Biscoe),0.357357,0.489489,0.357357,1.000000,2.042945,1.0,0.182435,inf,0.794393,0.730061,1.000000,0.865031
1,(culmen_depth_mm_0.0),(island_Biscoe),0.327327,0.489489,0.315315,0.963303,1.967974,1.0,0.155092,13.911411,0.731207,0.628743,0.928117,0.803737
2,(flipper_length_mm_2.0),(island_Biscoe),0.345345,0.489489,0.330330,0.956522,1.954121,1.0,0.161287,11.741742,0.745830,0.654762,0.914834,0.815684
3,(body_mass_g_2.0),(island_Biscoe),0.336336,0.489489,0.318318,0.946429,1.933501,1.0,0.153685,9.529530,0.727482,0.627219,0.895063,0.798368
4,(species_Chinstrap),(island_Dream),0.204204,0.369369,0.204204,1.000000,2.707317,1.0,0.128777,inf,0.792453,0.552846,1.000000,0.776423
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
617,"(culmen_length_mm_2.0, culmen_depth_mm_0.0, bo...","(island_Biscoe, species_Gentoo, flipper_length...",0.159159,0.330330,0.153153,0.962264,2.913036,1.0,0.100578,17.746246,0.781022,0.455357,0.943650,0.712950
618,"(culmen_length_mm_2.0, flipper_length_mm_2.0, ...","(island_Biscoe, species_Gentoo, body_mass_g_2.0)",0.156156,0.309309,0.153153,0.980769,3.170836,1.0,0.104853,35.915916,0.811318,0.490385,0.972157,0.737957
619,"(island_Biscoe, culmen_length_mm_2.0)","(culmen_depth_mm_0.0, flipper_length_mm_2.0, s...",0.201201,0.252252,0.153153,0.761194,3.017591,1.0,0.102400,3.131194,0.837019,0.510000,0.680633,0.684168
620,"(culmen_length_mm_2.0, species_Gentoo)","(culmen_depth_mm_0.0, island_Biscoe, flipper_l...",0.201201,0.252252,0.153153,0.761194,3.017591,1.0,0.102400,3.131194,0.837019,0.510000,0.680633,0.684168


e) Para la etapa de post procesamiento, ademas de haber utilizado soporte y confianza previamente, vamos a utilizar valores altos para lift (2) y conviction (3), para filtrar reglas que no sean utiles.

In [None]:
rules_filtradas = rules[(rules['conviction'] > 3) & (rules['lift'] > 2)]

rules_filtradas



Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(species_Gentoo),(island_Biscoe),0.357357,0.489489,0.357357,1.000000,2.042945,1.0,0.182435,inf,0.794393,0.730061,1.000000,0.865031
4,(species_Chinstrap),(island_Dream),0.204204,0.369369,0.204204,1.000000,2.707317,1.0,0.128777,inf,0.792453,0.552846,1.000000,0.776423
8,(culmen_length_mm_0.0),(species_Adelie),0.333333,0.438438,0.333333,1.000000,2.280822,1.0,0.187187,inf,0.842342,0.760274,1.000000,0.880137
11,(species_Gentoo),(culmen_depth_mm_0.0),0.357357,0.327327,0.312312,0.873950,2.669956,1.0,0.195339,5.336537,0.973266,0.838710,0.812613,0.914039
12,(culmen_depth_mm_0.0),(species_Gentoo),0.327327,0.357357,0.312312,0.954128,2.669956,1.0,0.195339,14.009610,0.929816,0.838710,0.928620,0.914039
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
617,"(culmen_length_mm_2.0, culmen_depth_mm_0.0, bo...","(island_Biscoe, species_Gentoo, flipper_length...",0.159159,0.330330,0.153153,0.962264,2.913036,1.0,0.100578,17.746246,0.781022,0.455357,0.943650,0.712950
618,"(culmen_length_mm_2.0, flipper_length_mm_2.0, ...","(island_Biscoe, species_Gentoo, body_mass_g_2.0)",0.156156,0.309309,0.153153,0.980769,3.170836,1.0,0.104853,35.915916,0.811318,0.490385,0.972157,0.737957
619,"(island_Biscoe, culmen_length_mm_2.0)","(culmen_depth_mm_0.0, flipper_length_mm_2.0, s...",0.201201,0.252252,0.153153,0.761194,3.017591,1.0,0.102400,3.131194,0.837019,0.510000,0.680633,0.684168
620,"(culmen_length_mm_2.0, species_Gentoo)","(culmen_depth_mm_0.0, island_Biscoe, flipper_l...",0.201201,0.252252,0.153153,0.761194,3.017591,1.0,0.102400,3.131194,0.837019,0.510000,0.680633,0.684168


f) Para analizar las reglas vamos a ordenarlas por lift.

En la tabla se pueden obtener muchos patrones a partir de las reglas, entre ellos la relacion entre especie y isla que habita, por ejemplo, los pinguinos de especie Gentoo habitan mucho la isla Biscoe. Luego los machos suelen tener mayor altura y peso. Estos son algunos ejemplos, luego se puede analizar en profundidad para obtener mas patrones.

In [None]:
pd.set_option("display.max_colwidth", None)

rules_filtradas_ordenadas = rules_filtradas.sort_values(by="lift", ascending=False)
rules_filtradas_ordenadas[["antecedents","consequents","support","confidence","lift","conviction"]].head(30)



Unnamed: 0,antecedents,consequents,support,confidence,lift,conviction
592,"(sex_MALE, species_Gentoo)","(culmen_length_mm_2.0, island_Biscoe, flipper_length_mm_2.0, body_mass_g_2.0)",0.15015,0.819672,4.264857,4.479661
570,"(culmen_length_mm_2.0, island_Biscoe, flipper_length_mm_2.0, body_mass_g_2.0)","(sex_MALE, species_Gentoo)",0.15015,0.78125,4.264857,3.73402
420,"(sex_MALE, species_Gentoo)","(island_Biscoe, culmen_length_mm_2.0, body_mass_g_2.0)",0.153153,0.836066,4.218331,4.890991
413,"(island_Biscoe, culmen_length_mm_2.0, body_mass_g_2.0)","(sex_MALE, species_Gentoo)",0.153153,0.772727,4.218331,3.593994
584,"(culmen_length_mm_2.0, species_Gentoo, body_mass_g_2.0)","(island_Biscoe, sex_MALE, flipper_length_mm_2.0)",0.15015,0.757576,4.204545,3.381757
579,"(island_Biscoe, culmen_length_mm_2.0, body_mass_g_2.0)","(sex_MALE, species_Gentoo, flipper_length_mm_2.0)",0.15015,0.757576,4.204545,3.381757
585,"(sex_MALE, species_Gentoo, flipper_length_mm_2.0)","(island_Biscoe, culmen_length_mm_2.0, body_mass_g_2.0)",0.15015,0.833333,4.204545,4.810811
580,"(island_Biscoe, sex_MALE, flipper_length_mm_2.0)","(culmen_length_mm_2.0, species_Gentoo, body_mass_g_2.0)",0.15015,0.833333,4.204545,4.810811
577,"(sex_MALE, island_Biscoe, species_Gentoo)","(culmen_length_mm_2.0, flipper_length_mm_2.0, body_mass_g_2.0)",0.15015,0.819672,4.199243,4.463008
543,"(sex_MALE, species_Gentoo)","(culmen_length_mm_2.0, flipper_length_mm_2.0, body_mass_g_2.0)",0.15015,0.819672,4.199243,4.463008


g) En el futuro se puede, utilizar esta informacion para en caso de buscar un pinguino de la especie Gentoo ir a la isla Biscoe. Tambien pueden estudiar si por algun motivo especial en esa isla, los machos son mas grandes. O pueden predecir en el futuro las medidas de un pinguino de esa especie.