# CSC172 Association Mining - Association Rule Mining for Cardiovascular Comorbidity and Risk Factor Assessment

Implement the Apriori Algorithm to effectively determine frequent itemsets and generate strong association rules for comorbidity patterns and implement complete association rule mining including data preprocessing, frequent itemset mining, rule generation, parameter tuning, and evaluation.

### Rules are assessed using the following criteria:

1. **Support:** Fraction of transactions containing the itemsets in both X and Y.
2. **Confidence:** Probability that transactions with X also include Y.
3. **Lift:** The ratio of observed support to that expected if X and Y were independent.

### Architecture Design
| Step | Action | Description |
| :---: | :--- | :--- |
| 1 | Input Data | Raw patient records from the CDC BRFSS heart disease dataset (Kaggle). |
| 2 | Preprocessing | Discretization and one-hot encoding (e.g., AgeCategory, BMI into Obese/Non-Obese, SleepTime into Short/Normal/Long) |
| 3 | Transactional Data | Output: Sparse DataFrame (each column is a specific medical feature/risk level). |
| 4 | Apriori Algorithm | Find Frequent Itemsets using a tuned minimum Support threshold. |
| 5 | Rule Generation | Generate rules (A $\implies$ B) based on a minimum Confidence threshold. |
| 6 | Evaluation & Filtering | Calculate Lift, Conviction, and Leverage. Filter for rules where Lift $>$ 1.0 (positive correlation). |
| 7 | Final Output | Ranked list of comorbidity rules, network graph visualization, item frequency bar charts, transaction length distribution, and co-occurrence heatmap |

In [16]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx

In [None]:
# Input dataset from Kaggle

df = pd.read_csv(r'C:\Users\Admin\Documents\heart_2020_cleaned.csv\heart_2020_cleaned.csv')

print(df.head())
print(df.shape)


  HeartDisease    BMI Smoking AlcoholDrinking Stroke  PhysicalHealth  \
0           No  16.60     Yes              No     No             3.0   
1           No  20.34      No              No    Yes             0.0   
2           No  26.58     Yes              No     No            20.0   
3           No  24.21      No              No     No             0.0   
4           No  23.71      No              No     No            28.0   

   MentalHealth DiffWalking     Sex  AgeCategory   Race Diabetic  \
0          30.0          No  Female        55-59  White      Yes   
1           0.0          No  Female  80 or older  White       No   
2          30.0          No    Male        65-69  White      Yes   
3           0.0          No  Female        75-79  White       No   
4           0.0         Yes  Female        40-44  White       No   

  PhysicalActivity  GenHealth  SleepTime Asthma KidneyDisease SkinCancer  
0              Yes  Very good        5.0    Yes            No        Yes  
1       

In [None]:
# Preprocess dataset for Apriori

# 1. Handle missing values
df = df.dropna()

# 2. Discretize numeric attributes (e.g., BMI, PhysicalHealth, MentalHealth, SleepTime) using pd.cut()
# optional: classify types of obese based on bmi value
df['BMI_cat'] = pd.cut(df['BMI'], bins=[0, 18.5, 25, 30, 100], labels=['Underweight', 'Normal', 'Overweight', 'Obese'])

# values represents no physical unhealthy days
df['PhysicalHealth_cat'] = pd.cut(df['PhysicalHealth'], bins=[0, 3, 7, 30], labels=['Low', 'Moderate', 'High'])

# values represents no mental unhealthy days
df['MentalHealth_cat'] = pd.cut(df['MentalHealth'], bins=[0, 3, 7, 30], labels=['Low', 'Moderate', 'High'])

# number of hours a person sleeps
df['SleepTime_cat'] = pd.cut(df['SleepTime'], bins=[0, 4, 7, 9, 24], labels=['Very Short', 'Short', 'Normal', 'Very Long'])

print(df.head())

Empty DataFrame
Columns: [HeartDisease, BMI, Smoking, AlcoholDrinking, Stroke, PhysicalHealth, MentalHealth, DiffWalking, Sex, AgeCategory, Race, Diabetic, PhysicalActivity, GenHealth, SleepTime, Asthma, KidneyDisease, SkinCancer, BMI_cat, PhysicalHealth_cat, MentalHealth_cat, SleepTime_cat]
Index: []

[0 rows x 22 columns]
