# Preprocessing - Association Rule Mining for Cardiovascular Comorbidity and Risk Factor Assessment

In data preprocessing, perform discretization and one-hot encoding.

1. Handling Missing Values: Missing or incomplete records are either imputed using appropriate statistical methods (e.g., mean, median, or mode) or removed if necessary, to prevent bias in the analysis.

2. Discretization (Binning): Continuous variables such as BMI, PhysicalHealth, MentalHealth, and SleepTime are transformed into categorical bins. This step converts numerical ranges into discrete categories (e.g., low, medium, high), which is required for association rule mining algorithms like Apriori that work on categorical data.

3. One-Hot Encoding: Categorical features are converted into binary vectors using one-hot encoding. This ensures that each category is represented as a separate column with values of 0 or 1, making the data suitable for mining frequent itemsets.

4. Final Dataset Verification: After preprocessing, the dataset is checked to confirm that all variables are properly discretized or encoded, and that it contains no missing or inconsistent values. This ensures reliable and interpretable association rules.

### Rules are assessed using the following criteria:

1. **Support:** Fraction of transactions containing the itemsets in both X and Y.
2. **Confidence:** Probability that transactions with X also include Y.
3. **Lift:** The ratio of observed support to that expected if X and Y were independent.

### Architecture Design
| Step | Action | Description |
| :---: | :--- | :--- |
| 1 | Input Data | Raw patient records from the CDC BRFSS heart disease dataset (Kaggle). |
| 2 | Preprocessing | Discretization and one-hot encoding (e.g., AgeCategory, BMI into Obese/Non-Obese, SleepTime into Short/Normal/Long) |
| 3 | Transactional Data | Output: Sparse DataFrame (each column is a specific medical feature/risk level). |
| 4 | Apriori Algorithm | Find Frequent Itemsets using a tuned minimum Support threshold. |
| 5 | Rule Generation | Generate rules (A $\implies$ B) based on a minimum Confidence threshold. |
| 6 | Evaluation & Filtering | Calculate Lift, Conviction, and Leverage. Filter for rules where Lift $>$ 1.0 (positive correlation). |
| 7 | Final Output | Ranked list of comorbidity rules, network graph visualization, item frequency bar charts, transaction length distribution, and co-occurrence heatmap |

In [2]:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx

## 1. Input dataset

In [9]:
# Input dataset from Kaggle

df = pd.read_csv(r'C:\Users\Admin\Documents\dataset_association_mining\heart_2020_cleaned.csv')

print(df.head())
print(df.shape)

  HeartDisease    BMI Smoking AlcoholDrinking Stroke  PhysicalHealth  \
0           No  16.60     Yes              No     No             3.0   
1           No  20.34      No              No    Yes             0.0   
2           No  26.58     Yes              No     No            20.0   
3           No  24.21      No              No     No             0.0   
4           No  23.71      No              No     No            28.0   

   MentalHealth DiffWalking     Sex  AgeCategory   Race Diabetic  \
0          30.0          No  Female        55-59  White      Yes   
1           0.0          No  Female  80 or older  White       No   
2          30.0          No    Male        65-69  White      Yes   
3           0.0          No  Female        75-79  White       No   
4           0.0         Yes  Female        40-44  White       No   

  PhysicalActivity  GenHealth  SleepTime Asthma KidneyDisease SkinCancer  
0              Yes  Very good        5.0    Yes            No        Yes  
1       

## 2. Check variable description of Numerical Columns

In [10]:
print(df[['BMI','PhysicalHealth','MentalHealth','SleepTime']].describe())


                 BMI  PhysicalHealth   MentalHealth      SleepTime
count  319795.000000    319795.00000  319795.000000  319795.000000
mean       28.325399         3.37171       3.898366       7.097075
std         6.356100         7.95085       7.955235       1.436007
min        12.020000         0.00000       0.000000       1.000000
25%        24.030000         0.00000       0.000000       6.000000
50%        27.340000         0.00000       0.000000       7.000000
75%        31.420000         2.00000       3.000000       8.000000
max        94.850000        30.00000      30.000000      24.000000


In [11]:
# Handle missing values

print(df.isna().sum())  # shows how many NaNs in each column
df = df.dropna() # drop rows with NaN values

HeartDisease        0
BMI                 0
Smoking             0
AlcoholDrinking     0
Stroke              0
PhysicalHealth      0
MentalHealth        0
DiffWalking         0
Sex                 0
AgeCategory         0
Race                0
Diabetic            0
PhysicalActivity    0
GenHealth           0
SleepTime           0
Asthma              0
KidneyDisease       0
SkinCancer          0
dtype: int64


In [12]:
# Discretize numeric attributes (e.g., BMI, PhysicalHealth, MentalHealth, SleepTime) using pd.cut()
# optional: classify types of obese based on bmi value
df['BMI_cat'] = pd.cut(df['BMI'], bins=[0, 18.5, 25, 30, 100], labels=['Underweight', 'Normal', 'Overweight', 'Obese'])

# values represents no physical unhealthy days
# include 0 in 'None' category
df['PhysicalHealth_cat'] = pd.cut(df['PhysicalHealth'], bins=[-1, 0, 3, 7, 30], labels=['None', 'Low', 'Moderate', 'High'])

# values represents no mental unhealthy days
# include 0 in 'None' category
df['MentalHealth_cat'] = pd.cut(df['MentalHealth'], bins=[-1, 0, 3, 7, 30], labels=['None', 'Low', 'Moderate', 'High'])

# number of hours a person sleeps
df['SleepTime_cat'] = pd.cut(df['SleepTime'], bins=[0, 4, 7, 9, 24], labels=['Very Short', 'Short', 'Normal', 'Very Long'])

print(df.head())

  HeartDisease    BMI Smoking AlcoholDrinking Stroke  PhysicalHealth  \
0           No  16.60     Yes              No     No             3.0   
1           No  20.34      No              No    Yes             0.0   
2           No  26.58     Yes              No     No            20.0   
3           No  24.21      No              No     No             0.0   
4           No  23.71      No              No     No            28.0   

   MentalHealth DiffWalking     Sex  AgeCategory  ... PhysicalActivity  \
0          30.0          No  Female        55-59  ...              Yes   
1           0.0          No  Female  80 or older  ...              Yes   
2          30.0          No    Male        65-69  ...              Yes   
3           0.0          No  Female        75-79  ...               No   
4           0.0         Yes  Female        40-44  ...              Yes   

   GenHealth SleepTime Asthma  KidneyDisease SkinCancer      BMI_cat  \
0  Very good       5.0    Yes             No      

## 4. Transactional Data
Output: Sparse DataFrame (each column is a specific medical feature/risk level).

In [13]:
# Select the categorical columns
categorical_cols = ['HeartDisease', 'BMI_cat', 'Smoking', 'AlcoholDrinking', 'Stroke', 'PhysicalHealth_cat', 'MentalHealth_cat', 
                    'DiffWalking', 'Sex', 'AgeCategory', 'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth', 'SleepTime_cat', 
                    'Asthma', 'KidneyDisease', 'SkinCancer']

df_categorical = df[categorical_cols]

# One-hot encode the categorical variables
df_encoded = pd.get_dummies(df_categorical)
print('\n Dataframe after performing One-Hot Encoding:')
display(df_encoded)


 Dataframe after performing One-Hot Encoding:


Unnamed: 0,HeartDisease_No,HeartDisease_Yes,BMI_cat_Underweight,BMI_cat_Normal,BMI_cat_Overweight,BMI_cat_Obese,Smoking_No,Smoking_Yes,AlcoholDrinking_No,AlcoholDrinking_Yes,...,SleepTime_cat_Very Short,SleepTime_cat_Short,SleepTime_cat_Normal,SleepTime_cat_Very Long,Asthma_No,Asthma_Yes,KidneyDisease_No,KidneyDisease_Yes,SkinCancer_No,SkinCancer_Yes
0,True,False,True,False,False,False,False,True,True,False,...,False,True,False,False,False,True,True,False,False,True
1,True,False,False,True,False,False,True,False,True,False,...,False,True,False,False,True,False,True,False,True,False
2,True,False,False,False,True,False,False,True,True,False,...,False,False,True,False,False,True,True,False,True,False
3,True,False,False,True,False,False,True,False,True,False,...,False,True,False,False,True,False,True,False,False,True
4,True,False,False,True,False,False,True,False,True,False,...,False,False,True,False,True,False,True,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
319790,False,True,False,False,True,False,False,True,True,False,...,False,True,False,False,False,True,True,False,True,False
319791,True,False,False,False,True,False,False,True,True,False,...,False,True,False,False,False,True,True,False,True,False
319792,True,False,False,True,False,False,True,False,True,False,...,False,True,False,False,True,False,True,False,True,False
319793,True,False,False,False,False,True,True,False,True,False,...,False,False,False,True,True,False,True,False,True,False


## 5. Save preprocessed data in csv file format

In [15]:
df_preprocessed = df_encoded.copy()

# Save the preprocessed dataframe to a new CSV file
df_preprocessed.to_csv(r'C:\Users\Admin\Documents\dataset_association_mining\heart_disease_preprocessed.csv', index=False)