<a href="https://colab.research.google.com/github/lcbjrrr/quantai/blob/main/M5_Py_ML_Apriori.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Association Rules: Apriori

This dataset could be used to analyze purchasing patterns, identify frequently bought items, or explore relationships between the purchase of different items (e.g., if people who buy apples are also more likely to buy yogurt).

* **Apple:** Whether apples were purchased.
* **Bread:** Whether bread was purchased.
* **Butter:** Whether butter was purchased.
* **Cheese:** Whether cheese was purchased.
* **Corn:** Whether corn was purchased.
* **Dill:** Whether dill was purchased.
* **Eggs:** Whether eggs were purchased.
* **Ice cream:** Whether ice cream was purchased.
* **Kidney Beans:** Whether kidney beans were purchased.
* **Milk:** Whether milk was purchased.
* **Nutmeg:** Whether nutmeg was purchased.
* **Onion:** Whether onions were purchased.
* **Sugar:** Whether sugar was purchased.
* **Unicorn:** While listed, it's highly unlikely that unicorns were actually purchased. This could be a placeholder, a joke entry, or a mislabeled item.
* **Yogurt:** Whether yogurt was purchased.
* **Chocolate:** Whether chocolate was purchased.



In [None]:
import pandas as pd
groceries = pd.read_csv('https://raw.githubusercontent.com/lcbjrrr/quantai/refs/heads/main/datasets/basket_analysis.csv')
groceries.head()

Unnamed: 0,Apple,Bread,Butter,Cheese,Corn,Dill,Eggs,Ice cream,Kidney Beans,Milk,Nutmeg,Onion,Sugar,Unicorn,Yogurt,chocolate
0,False,True,False,False,True,True,False,True,False,False,False,False,True,False,True,True
1,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False
2,True,False,True,False,False,True,False,True,False,True,False,False,False,False,True,True
3,False,False,True,True,False,True,False,False,False,True,True,True,False,False,False,False
4,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False


Apriori is a classic algorithm in data mining. It is used for frequent itemset mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis

In order to find frequent item combinations in a dataset (like grocery shopping baskets) and generate association rules:

1.  **`apriori(...)`**: Finds itemsets appearing in at least 20% of the baskets.
2.  **`association_rules(...)`**: Creates rules like "If X, then Y" from those itemsets, keeping only rules with at least 45% confidence (meaning if someone buys X, there's at least a 45% chance they also buy Y).

Filtered by the following parameters:
- **Support**: How often an item or combination appears (Support shows popularity)
- **Confidence**: How often Y is bought given that X is bought (confidence shows reliability of the "If X then Y" rule).
- **Lift(X → Y) = Confidence(X → Y) / Support(Y)**: How much more likely is Y given X, compared to Y's usual likelihood? (Lift > 1 is good)
- **Leverage(X → Y) = Support(X and Y) - (Support(X) * Support(Y))**: How much more often do X and Y appear together than expected by random chance? (Closer to 0 means they're independent)
- **Conviction(X → Y) = (1 - Support(Y)) / (1 - Confidence(X → Y)**: How wrong would the rule be if X and Y were unrelated? (Higher is better)


In [None]:
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

model = apriori(groceries, min_support=0.20, use_colnames=True)
association_rules(model, metric="confidence", min_threshold=0.45,num_itemsets=model.shape[0])

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(Butter),(Ice cream),0.42042,0.41041,0.207207,0.492857,1.200889,1.0,0.034662,1.162571,0.288629,0.332263,0.139837,0.498868
1,(Ice cream),(Butter),0.41041,0.42042,0.207207,0.504878,1.200889,1.0,0.034662,1.170579,0.283728,0.332263,0.145722,0.498868
2,(Butter),(Kidney Beans),0.42042,0.408408,0.202202,0.480952,1.177626,1.0,0.030499,1.139764,0.260247,0.322684,0.122625,0.488025
3,(Kidney Beans),(Butter),0.408408,0.42042,0.202202,0.495098,1.177626,1.0,0.030499,1.147905,0.254963,0.322684,0.128848,0.488025
4,(Butter),(chocolate),0.42042,0.421421,0.202202,0.480952,1.141262,1.0,0.025028,1.114693,0.213564,0.316119,0.102892,0.480381
5,(chocolate),(Butter),0.421421,0.42042,0.202202,0.47981,1.141262,1.0,0.025028,1.114169,0.213933,0.316119,0.10247,0.480381
6,(Cheese),(Kidney Beans),0.404404,0.408408,0.2002,0.49505,1.212143,1.0,0.035038,1.171583,0.293849,0.326797,0.146454,0.492623
7,(Kidney Beans),(Cheese),0.408408,0.404404,0.2002,0.490196,1.212143,1.0,0.035038,1.168284,0.295838,0.326797,0.144043,0.492623
8,(Ice cream),(chocolate),0.41041,0.421421,0.202202,0.492683,1.169098,1.0,0.029246,1.140467,0.245323,0.321145,0.123167,0.486246
9,(chocolate),(Ice cream),0.421421,0.41041,0.202202,0.47981,1.169098,1.0,0.029246,1.133412,0.249991,0.321145,0.117708,0.486246


# Practice: Apriori

Analyze the Credit History and check if it is possible to identify any rule(s) about it using the Apriori algorithm.
- What level of Support do you believe is necessary?
- And what do you think about trustworthiness?
- Compare these rules with the tree generated in the previous exercise

data: `http://raw.githubusercontent.com/lcbjuk/ML/master/dados/RiscoCredito%20-%20okk.csv `

In [None]:
cred = pd.read_csv('http://raw.githubusercontent.com/lcbjuk/ML/master/dados/RiscoCredito%20-%20okk.csv')
cred.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Inadimplente      100 non-null    int64  
 1   Genero            100 non-null    int64  
 2   GrauEscolaridade  100 non-null    int64  
 3   Profissao         100 non-null    int64  
 4   Renda             100 non-null    float64
dtypes: float64(1), int64(4)
memory usage: 4.0 KB


  and should_run_async(code)


In [None]:
cred['Mulher']=cred['Genero'].map({1:0,2:1})
escolars = pd.get_dummies(cred['GrauEscolaridade'])
escolars.columns=['fund_incomp',	'fund_comp',	'medio_incomp',	'sup_incomp',	'sup_comp','auto_didata']
profs = pd.get_dummies(cred['Profissao'])
profs.columns=[	'vendedor',	'corretor',	'atendente',	'youtuber',	'programador']
cred=pd.concat([cred,escolars,profs],axis=1 )
cred['renda_alta'] = cred['Renda']>4000
cred=cred.drop(['Genero','GrauEscolaridade','Profissao','Renda'],axis=1)
cred.head(2)

  and should_run_async(code)


Unnamed: 0,Inadimplente,Mulher,fund_incomp,fund_comp,medio_incomp,sup_incomp,sup_comp,auto_didata,vendedor,corretor,atendente,youtuber,programador,renda_alta
0,1,0,False,False,True,False,False,False,True,False,False,False,False,False
1,0,1,False,False,False,False,True,False,False,False,False,False,True,False


In [None]:
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
regras = apriori(cred, min_support=0.25,use_colnames=True )
association_rules(regras,num_itemsets=regras.shape[0])

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(medio_incomp),(Inadimplente),0.43,0.61,0.43,1.0,1.639344,1.0,0.1677,inf,0.684211,0.704918,1.0,0.852459
1,(atendente),(Inadimplente),0.27,0.61,0.27,1.0,1.639344,1.0,0.1053,inf,0.534247,0.442623,1.0,0.721311
2,"(Mulher, Inadimplente)",(medio_incomp),0.27,0.43,0.27,1.0,2.325581,1.0,0.1539,inf,0.780822,0.627907,1.0,0.813953
3,"(Mulher, medio_incomp)",(Inadimplente),0.27,0.61,0.27,1.0,1.639344,1.0,0.1053,inf,0.534247,0.442623,1.0,0.721311


# Activity: Apriori

Using one of the following databases, try to identify business rules with Apriori. Identify rules that conclude a category variable, use this as a label and run a Decision Tree, compare the rules of the tree with those of Apriori.
Don't forget to perform your analyses/conclusions along with your codes (use the +Text button).

- **option 1**:` https://raw.githubusercontent.com/lcbjrrr/quantai/refs/heads/main/activities/fraud%20-%20easy.csv`
- **option 2**: `https://raw.githubusercontent.com/lcbjrrr/quantai/refs/heads/main/activities/fraud%20-%20auto.csv`

