#  Frequent Itemset Mining: Apriori Alternatives

In this notebook, we will apply **apriori**, **FP-Growth**, and **maximal frequent itemset** methods on the congressional voting records dataset. You can learn more about this dataset here: https://archive.ics.uci.edu/ml/datasets/congressional+voting+records

 ### Import required Libraries

In [1]:
import pandas as pd
import numpy as np
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules, fpgrowth, fpmax
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
%matplotlib inline

### T1: Data Loading

The data is located here: `/dsa/data/DSA-8410/association-mining/house-vote/house-votes-84.csv`


In [2]:
df = pd.read_csv('/dsa/data/DSA-8410/association-mining/house-vote/house-votes-84.csv')
df.head()

Unnamed: 0,Class Name,handicapped-infants,water-project-cost-sharing,adoption-of-the-budget-resolution,physician-fee-freeze,el-salvador-aid,religious-groups-in-schools,anti-satellite-test-ban,aid-to-nicaraguan-contras,mx-missile,immigration,synfuels-corporation-cutback,education-spending,superfund-right-to-sue,crime,duty-free-exports,export-administration-act-south-africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


### T2: Show the number of transactions

In [3]:
df.shape[0]

435

### T3: Transform the dataset to a binary incidence matrix for applying itemset mining methods

In [5]:
df_trans = pd.get_dummies(df)

df_trans.head()

Unnamed: 0,Class Name_democrat,Class Name_republican,handicapped-infants_?,handicapped-infants_n,handicapped-infants_y,water-project-cost-sharing_?,water-project-cost-sharing_n,water-project-cost-sharing_y,adoption-of-the-budget-resolution_?,adoption-of-the-budget-resolution_n,...,superfund-right-to-sue_y,crime_?,crime_n,crime_y,duty-free-exports_?,duty-free-exports_n,duty-free-exports_y,export-administration-act-south-africa_?,export-administration-act-south-africa_n,export-administration-act-south-africa_y
0,0,1,0,1,0,0,0,1,0,1,...,1,0,0,1,0,1,0,0,0,1
1,0,1,0,1,0,0,0,1,0,1,...,1,0,0,1,0,1,0,1,0,0
2,1,0,1,0,0,0,0,1,0,0,...,1,0,0,1,0,1,0,0,1,0
3,1,0,0,1,0,0,0,1,0,0,...,1,0,1,0,0,1,0,0,0,1
4,1,0,0,0,1,0,0,1,0,0,...,1,0,0,1,0,0,1,0,0,1


### T4: Indentify Frequent Patterns with FP-Growth Method. Use min_support = 0.3. Show the number of itemsets per itemset length.

In [8]:
freq_items_fp = fpgrowth(df_trans, min_support=0.3, use_colnames=True)

freq_items_fp = freq_items_fp.reindex(columns=['itemsets', 'support'])
freq_items_fp['length'] = freq_items_fp['itemsets'].apply(lambda x: len(x))

print(f"Total number of frequent itemsets = {freq_items_fp.shape[0]}")

Total number of frequent itemsets = 973


In [13]:
# Number of itemsets per length

freq_items_fp['length'].value_counts()

3    313
4    270
2    174
5    134
6     43
1     33
7      6
Name: length, dtype: int64

### T5: Generate Association Rules from Frequent Itemsets with min 90% confidence.

* Show the total number of rules

In [15]:
rules = association_rules(freq_items_fp, metric="confidence", min_threshold=0.9)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,"( duty-free-exports_n, crime_y)",( religious-groups-in-schools_y),0.432184,0.625287,0.390805,0.904255,1.446144,0.120565,3.913665
1,"( handicapped-infants_n, duty-free-exports_n)",( religious-groups-in-schools_y),0.335632,0.625287,0.314943,0.938356,1.50068,0.105076,6.078672
2,"( handicapped-infants_n, duty-free-exports_n)",( crime_y),0.335632,0.570115,0.305747,0.910959,1.597851,0.114398,4.82794
3,( el-salvador-aid_y),( religious-groups-in-schools_y),0.487356,0.625287,0.452874,0.929245,1.486109,0.148136,5.295939
4,( el-salvador-aid_y),( crime_y),0.487356,0.570115,0.445977,0.915094,1.605105,0.168128,5.063091


### T6: Identify the top 5 rules with high confidence where `consequents` are only `Class Name_democrat`. Similarly, infer the top 5 rules with high confidence where `consequents` are only `Class Name_republican`. 

* Iterate over these two subsets of rules and print only antecedents, consequents, and confidence.
* Based on these rules, characterize democrat and republican congress members

In [34]:
# Setting up temp and converting consequents to sets

rules_temp = rules

rules_temp['consequents'] =  rules_temp['consequents'].apply(set)

In [46]:
# Find sets only containing string 'Class Name_democrat'

rules_dem = rules_temp[rules_temp['consequents'] == { 'Class Name_democrat'}]

dem_top_5 = rules_dem.sort_values(by=['confidence'], ascending=False).head()

# Printing

dem_top_5[['antecedents', 'consequents', 'confidence']]

Unnamed: 0,antecedents,consequents,confidence
1356,"( physician-fee-freeze_n, duty-free-exports_y...",{Class Name_democrat},1.0
2627,"( physician-fee-freeze_n, adoption-of-the-bud...",{Class Name_democrat},1.0
2675,"( physician-fee-freeze_n, adoption-of-the-bud...",{Class Name_democrat},1.0
2701,"( physician-fee-freeze_n, adoption-of-the-bud...",{Class Name_democrat},1.0
1401,"( physician-fee-freeze_n, duty-free-exports_y...",{Class Name_democrat},1.0


In [51]:
dem_top_5['antecedents'].values

array([frozenset({' physician-fee-freeze_n', ' duty-free-exports_y', ' adoption-of-the-budget-resolution_y', ' aid-to-nicaraguan-contras_y'}),
       frozenset({' physician-fee-freeze_n', ' adoption-of-the-budget-resolution_y', ' superfund-right-to-sue_n', ' aid-to-nicaraguan-contras_y', ' el-salvador-aid_n'}),
       frozenset({' physician-fee-freeze_n', ' adoption-of-the-budget-resolution_y', ' superfund-right-to-sue_n', ' el-salvador-aid_n', ' anti-satellite-test-ban_y'}),
       frozenset({' physician-fee-freeze_n', ' adoption-of-the-budget-resolution_y', ' superfund-right-to-sue_n', ' aid-to-nicaraguan-contras_y', ' el-salvador-aid_n', ' anti-satellite-test-ban_y'}),
       frozenset({' physician-fee-freeze_n', ' duty-free-exports_y', ' aid-to-nicaraguan-contras_y', ' el-salvador-aid_n'})],
      dtype=object)

In [47]:
# Find sets only containing string 'Class Name_republican'

rules_dem = rules_temp[rules_temp['consequents'] == { 'Class Name_republican'}]

rep_top_5 = rules_dem.sort_values(by=['confidence'], ascending=False).head()

rep_top_5[['antecedents', 'consequents', 'confidence']]

Unnamed: 0,antecedents,consequents,confidence
666,"( physician-fee-freeze_y, synfuels-corporatio...",{Class Name_republican},0.978261
604,"( el-salvador-aid_y, adoption-of-the-budget-r...",{Class Name_republican},0.971631
623,"( mx-missile_n, adoption-of-the-budget-resolu...",{Class Name_republican},0.97037
608,"( crime_y, adoption-of-the-budget-resolution_...",{Class Name_republican},0.963768
595,"( adoption-of-the-budget-resolution_n, physic...",{Class Name_republican},0.958904


In [52]:
rep_top_5['antecedents'].values

array([frozenset({' physician-fee-freeze_y', ' synfuels-corporation-cutback_n'}),
       frozenset({' el-salvador-aid_y', ' adoption-of-the-budget-resolution_n', ' physician-fee-freeze_y'}),
       frozenset({' mx-missile_n', ' adoption-of-the-budget-resolution_n', ' physician-fee-freeze_y'}),
       frozenset({' crime_y', ' adoption-of-the-budget-resolution_n', ' physician-fee-freeze_y'}),
       frozenset({' adoption-of-the-budget-resolution_n', ' physician-fee-freeze_y'})],
      dtype=object)

According to the rules above from the top 5 of each subset, Democrats are more likley to not put a freeze on physician fee, they have adopted the budget resolution, and do not support a superfund right to sue. 

For republicans, they do not adopt the budget resolution, and they want a freeze on physician fees.

### T7. Show the number of maximal frequent itemsets for min support = 0.3 

In [17]:
max_patterns = fpmax(df_trans, min_support=0.3, use_colnames=True)

# for readability 
max_patterns = max_patterns.reindex(columns=['itemsets', 'support'])
max_patterns['length'] = max_patterns['itemsets'].apply(lambda x: len(x))

print(f"Total number of maximal frequent patterns = {max_patterns.shape[0]}")
max_patterns

Total number of maximal frequent patterns = 179


Unnamed: 0,itemsets,support,length
0,( synfuels-corporation-cutback_y),0.344828,1
1,"( education-spending_n, religious-groups-in-s...",0.301149,2
2,"( adoption-of-the-budget-resolution_y, religi...",0.303448,2
3,"( physician-fee-freeze_n, religious-groups-in...",0.301149,3
4,"( aid-to-nicaraguan-contras_y, physician-fee-...",0.303448,4
...,...,...,...
174,"( crime_y, export-administration-act-south-af...",0.340230,2
175,"( crime_y, religious-groups-in-schools_y, sy...",0.328736,3
176,"( adoption-of-the-budget-resolution_y, synfue...",0.305747,2
177,"( export-administration-act-south-africa_y, s...",0.381609,2
