# StaTDS Tutorial: Assoaction Rule Mining

Association rule mining is a type of unsupervised learning technique that focuses on uncovering interesting relationships, frequent patterns, or associations among sets of items in large databases. The goal is to identify rules that point to the co-occurrence of certain items within these databases. This tutorial presents a comparison of two well-known algorithms for mining association rules:Apriori and FP-Growth. The algorithms are compared in terms of runtime (the lower the better) for a varied set of datasets.

## Dependency Installation




Before we start, it's crucial to set up our development environment by installing the necessary dependencies. If you're using Google Colab, you only need to install StaTDS.

In [1]:
!pip install statds

Collecting statds
  Downloading statds-1.1.1-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas>=2.0 (from statds)
  Downloading pandas-2.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m51.4 MB/s[0m eta [36m0:00:00[0m
Collecting tzdata>=2022.7 (from pandas>=2.0->statds)
  Downloading tzdata-2024.1-py2.py3-none-any.whl (345 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.4/345.4 kB[0m [31m29.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tzdata, pandas, statds
  Attempting uninstall: pandas
    Found existing installation: pandas 1.5.3
    Uninstalling pandas-1.5.3:
      Successfully uninstalled pandas-1.5.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed

## Evaluating Classification Algorithms: A Structured Approach


The evaluation process of classification algorithms can be structured into the following steps to ensure a rigorous and meaningful comparison. This task has become essential due to the wide variety of available algorithms and the impact of their parameters on performance.

1. Selection of Datasets and Algorithms: We begin by jointly selecting datasets and algorithms. This choice is critical as it defines the context and expectations of our comparative study.

2. Training and Evaluation of Algorithms: Each algorithm is trained on the selected datasets. This step is crucial in understanding how each algorithm performs with different data. Here, we also determine which metrics to study, knowing whether these are intended to be maximized or minimized.

3. Collection and Analysis of Metrics: The results from each algorithm-dataset combination are collected. We then conduct a preliminary analysis to identify trends and notable behaviors. This analysis helps us formulate hypotheses about the performance and characteristics of the algorithms.

4. Statistical Analysis: We proceed with conducting statistical tests on the gathered results. This step is vital to determine the presence of significant statistical differences between the algorithms and to ensure that our conclusions are solidly grounded.

5. Conclusion and Presentation of Results: Finally, we conclude our study with the presentation of our findings. Here, we synthesize our research, highlight key differences, and provide data-based recommendations.

![StepsEvaluate](https://media.discordapp.net/attachments/1179836822440902738/1179836868834119710/StepsEvaluate.png?ex=657b3bb5&is=6568c6b5&hm=6e753d5fe91a04bb6d407b99acc6f9a5302f9c36c455f6d8a7a8183fd1368dfc&=&format=webp&quality=lossless)


Each of these steps will be illustrated and detailed below, using a case example. This approach will facilitate tracking and replication in similar situations. Through this structured flow, we aim to provide a clear and reproducible methodology for comparing classification algorithms.

### Select Datasets and Algorithms

In [2]:
import mlxtend

In [3]:
!wget https://cdn.discordapp.com/attachments/1179836822440902738/1184774217053507585/dataset_association_rule.zip
!unzip dataset_association_rule.zip

--2024-02-21 12:05:05--  https://cdn.discordapp.com/attachments/1179836822440902738/1184774217053507585/dataset_association_rule.zip
Resolving cdn.discordapp.com (cdn.discordapp.com)... 162.159.129.233, 162.159.134.233, 162.159.135.233, ...
Connecting to cdn.discordapp.com (cdn.discordapp.com)|162.159.129.233|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2126603 (2.0M) [application/zip]
Saving to: ‘dataset_association_rule.zip’


2024-02-21 12:05:06 (14.4 MB/s) - ‘dataset_association_rule.zip’ saved [2126603/2126603]

Archive:  dataset_association_rule.zip
   creating: dataset_association_rule/
  inflating: dataset_association_rule/BakeryProcess.csv  
  inflating: dataset_association_rule/BasketAnalysisProcess.csv  
  inflating: dataset_association_rule/CarProcess.csv  
  inflating: dataset_association_rule/GroceriesDatasetProcess.csv  
  inflating: dataset_association_rule/MarketOptimisationProcess.csv  
  inflating: dataset_association_rule/TicTacToeProces

In [4]:
import os
import pandas as pd
def process_datasets():
  path = "dataset_association_rule/"

  contenido = os.listdir(path)
  datasets = {}
  for fichero in contenido:
      if os.path.isfile(os.path.join(path, fichero)) and fichero.endswith('.csv'):
          df = pd.read_csv(path + fichero, index_col=0)
          datasets[fichero]=df

  return datasets


In [5]:
list_df = process_datasets()

In [9]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import fpgrowth
import time

results = {"Dataset": [],"Apriori":[], "FP-Growth": []}
for key in list_df.keys():
  df = list_df[key].astype(bool)
  frequent_itemsets_apriori = apriori(df, min_support=0.05, use_colnames=True)
  a = time.time()
  # Mostrar los conjuntos de ítems frecuentes
  b = time.time()
  print(f"Apriori: {b-a}", end=" | ")
  results["Apriori"].append(b-a)
  # Ejecutar el algoritmo FP-Growth
  a = time.time()
  frequent_itemsets_fp = fpgrowth(df, min_support=0.05, use_colnames=True)
  b = time.time()
  print(f"FP-Growth: {b-a}")
  results["FP-Growth"].append(b-a)
  results["Dataset"].append(key)
  # Mostrar los conjuntos de ítems frecuentes

results_df = pd.DataFrame(results)

  and should_run_async(code)


Apriori: 1.1920928955078125e-06 | FP-Growth: 0.08519220352172852
Apriori: 7.152557373046875e-07 | FP-Growth: 0.06537485122680664
Apriori: 7.152557373046875e-07 | FP-Growth: 0.011725187301635742
Apriori: 9.5367431640625e-07 | FP-Growth: 0.15550851821899414
Apriori: 7.152557373046875e-07 | FP-Growth: 0.13412714004516602
Apriori: 7.152557373046875e-07 | FP-Growth: 0.5480668544769287
Apriori: 1.1920928955078125e-06 | FP-Growth: 8.561946392059326
Apriori: 9.5367431640625e-07 | FP-Growth: 5.116408586502075
Apriori: 7.152557373046875e-07 | FP-Growth: 0.05773282051086426
Apriori: 1.1920928955078125e-06 | FP-Growth: 4.8218724727630615


In [10]:
results_df

  and should_run_async(code)


Unnamed: 0,Dataset,Apriori,FP-Growth
0,BakeryProcess.csv,1.192093e-06,0.085192
1,MarketOptimisationProcess.csv,7.152557e-07,0.065375
2,TitanicProcess.csv,7.152557e-07,0.011725
3,GroceriesDatasetProcess.csv,9.536743e-07,0.155509
4,BasketAnalysisProcess.csv,7.152557e-07,0.134127
5,TicTacToeProcess.csv,7.152557e-07,0.548067
6,Transactions70kProcess.csv,1.192093e-06,8.561946
7,Transactions50kProcess.csv,9.536743e-07,5.116409
8,CarProcess.csv,7.152557e-07,0.057733
9,Transactions30kProcess.csv,1.192093e-06,4.821872


### Statistical Analysis

In this scenario, the choice of statistical tests for multiple or pairwise comparisons depends on our specific objectives. However, before proceeding to these comparisons, it's crucial to assess the normality and homoscedasticity (equal variances) of our data.


![Steps Test Statistical](https://media.discordapp.net/attachments/1179836822440902738/1180048286271418418/flowchart-Test_1.png?ex=657c009b&is=65698b9b&hm=2a7003943fa9cd6ae996484f94ad55850850b141d9e21b80585481f4b0ad2090&=&format=webp&quality=lossless)


To evaluate these aspects, we will employ two key statistical tests: the D'Agostino-Pearson test and the Levene's test. The D'Agostino-Pearson test will help us determine whether our data distribution deviates from normality, while Levene's test will assess the homogeneity of variances across different groups. These tests are essential prerequisites that will ensure the validity and reliability of our subsequent statistical analyses and comparisons, providing a solid foundation for our findings.

In [11]:
results = results_df # Es una métrica a mimizar
criterion = True # Si es minimizar


  and should_run_async(code)


In [12]:
from statds.normality import d_agostino_pearson
from statds.homoscedasticity import levene_test

results_to_test = results.copy()
alpha = 0.05
columns = list(results_to_test.columns)
results_normality = []

for i in range(1, len(columns)):
    results_normality.append(d_agostino_pearson(results_to_test[columns[i]].to_numpy(), alpha))

statistic_list, p_value_list, cv_value_list, hypothesis_list = zip(*results_normality)

results_test = pd.DataFrame({"Algorithm": columns[1:], "Statistic": statistic_list, "p-value": p_value_list, "Results": hypothesis_list})
print(results_test)


statistic, p_value, rejected_value, hypothesis = levene_test(results_to_test, alpha, center='mean')
print(f"Statistic {statistic}, Rejected Value {rejected_value}, Hypothesis: {hypothesis}")

   Algorithm  Statistic   p-value  \
0    Apriori   3.971633  0.137268   
1  FP-Growth   5.030418  0.080846   

                                             Results  
0  Same distributions (fail to reject H0) with al...  
1  Same distributions (fail to reject H0) with al...  
Statistic 27.350576777760303, Rejected Value 5.3177, Hypothesis: Different distributions (reject H0) with alpha 0.05


  and should_run_async(code)


In [13]:
from statds.no_parametrics import binomial

statistic, rejected_value, p_value, hypothesis = binomial(results_to_test[columns[1:]], alpha)
print(hypothesis)
print(f"Statistic {statistic}, Rejected Value {rejected_value}, p-value {p_value}")

Different distributions (reject H0) with alpha 0.05
Statistic 10, Rejected Value None, p-value 0.0009765625


  and should_run_async(code)


### Conclusion

After conducting this comprehensive study, we can conclude that, at a significance level of 0.05 (alpha), these algorithms do exhibit significant differences overall. This nuanced insight highlights the importance of context and specific comparative analyses in understanding the relative strengths and weaknesses of different machine learning algorithms.