## Association Rule Mining - Finds patterns of things that occur together
 

## Why to use in healthcare:
- Which conditions commonly occur together?
- Which symptoms appear together?
- Which patient behaviors co‑occur?
- Which risk factors lead to certain outcomes?
It’s like discovering hidden relationships in data.


In [4]:
import pandas as pd
import numpy as np
df=pd.read_csv(r"D:\HealthCare System\diabetic_data_cleaned.csv")

## Binning the columns 

In [5]:
# Inpatient visits
df['inpatient_cat'] = pd.cut(
    df['number_inpatient'],
    bins=[-1, 0, 1, 100],
    labels=['0', '1', '2+']
)

# Emergency visits
df['emergency_cat'] = pd.cut(
    df['number_emergency'],
    bins=[-1, 0, 1, 100],
    labels=['0', '1', '2+']
)

# Outpatient visits
df['outpatient_cat'] = pd.cut(
    df['number_outpatient'],
    bins=[-1, 0, 1, 100],
    labels=['0', '1', '2+']
)

# Medications
df['medication_cat'] = pd.cut(
    df['num_medications'],
    bins=[-1, 10, 20, 100],
    labels=['Low', 'Medium', 'High']
)


# Time in hospital
df['time_in_hospital_cat'] = pd.cut(
    df['time_in_hospital'],
    bins=[0, 3, 6, 20],
    labels=['Short', 'Medium', 'Long']
)

# Number of diagnoses
df['diagnoses_cat'] = pd.cut(
    df['number_diagnoses'],
    bins=[0, 4, 7, 20],
    labels=['Low', 'Medium', 'High']
)
# Display the first few rows of the modified DataFrame
print(df.head())

   race  gender  age  admission_type_id  discharge_disposition_id  \
0     3       0    0                  6                        25   
1     3       0    1                  1                         1   
2     1       0    2                  1                         1   
3     3       1    3                  1                         1   
4     3       1    4                  1                         1   

   admission_source_id  time_in_hospital  medical_specialty  \
0                    1                 1                 38   
1                    7                 3                  0   
2                    7                 2                  0   
3                    7                 2                  0   
4                    7                 1                  0   

   num_lab_procedures  num_procedures  ...  admission_type_desc  \
0                  41               0  ...                    7   
1                  59               0  ...                    1   
2    

## Final feature selection for Apriori

In [6]:
df_assoc = df[
    [
        'race', 'gender', 'age',
        'inpatient_cat', 'emergency_cat', 'outpatient_cat',
        'medication_cat', 'time_in_hospital_cat', 'diagnoses_cat',
        'max_glu_serum', 'A1Cresult',
        'readmitted'
    ]
]
print(df_assoc.head())

   race  gender  age inpatient_cat emergency_cat outpatient_cat  \
0     3       0    0             0             0              0   
1     3       0    1             0             0              0   
2     1       0    2             1             0             2+   
3     3       1    3             0             0              0   
4     3       1    4             0             0              0   

  medication_cat time_in_hospital_cat diagnoses_cat  max_glu_serum  A1Cresult  \
0            Low                Short           Low              3          3   
1         Medium                Short          High              3          3   
2         Medium                Short        Medium              3          3   
3         Medium                Short        Medium              3          3   
4            Low                Short        Medium              3          3   

   readmitted  
0           2  
1           1  
2           2  
3           2  
4           2  


## Convert ALL columns to strings

In [7]:
df_assoc = df_assoc.astype(str)

## One hot encoding

In [8]:
df_encoded = pd.get_dummies(df_assoc, drop_first=True)
df_encoded.head()

Unnamed: 0,race_1,race_2,race_3,race_4,race_5,gender_1,gender_2,age_1,age_2,age_3,...,diagnoses_cat_Low,diagnoses_cat_Medium,max_glu_serum_1,max_glu_serum_2,max_glu_serum_3,A1Cresult_1,A1Cresult_2,A1Cresult_3,readmitted_1,readmitted_2
0,False,False,True,False,False,False,False,False,False,False,...,True,False,False,False,True,False,False,True,False,True
1,False,False,True,False,False,False,False,True,False,False,...,False,False,False,False,True,False,False,True,True,False
2,True,False,False,False,False,False,False,False,True,False,...,False,True,False,False,True,False,False,True,False,True
3,False,False,True,False,False,True,False,False,False,True,...,False,True,False,False,True,False,False,True,False,True
4,False,False,True,False,False,True,False,False,False,False,...,False,True,False,False,True,False,False,True,False,True


In [9]:
from mlxtend.frequent_patterns import apriori, association_rules
frequent_itemsets = apriori(df_encoded, min_support=0.05, use_colnames=True)
print("Total Frequent Itemsets:", frequent_itemsets.shape[0])


Total Frequent Itemsets: 836


In [10]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)



In [11]:
rules = rules[rules['lift'] > 1]
rules = rules.sort_values(by='lift', ascending=False).head(10)
#print("Total Association Rules after Lift Filtering:", rules.shape[0])

In [12]:
pd.set_option('display.max_colwidth', None)


In [13]:
rules_all = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)
rules = rules[rules['lift'] > 1.1] 

rules = rules.sort_values(by='lift', ascending=False)
rules.head(10)

#print("Total rules before lift filtering:", rules_all.shape[0])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
919,"(gender_1, medication_cat_Low, readmitted_2)",(time_in_hospital_cat_Short),0.076843,0.483344,0.058526,0.761637,1.575765,1.0,0.021385,2.167515,0.395802,0.116665,0.538642,0.441362
1519,"(gender_1, medication_cat_Low, max_glu_serum_3, readmitted_2)",(time_in_hospital_cat_Short),0.072873,0.483344,0.055461,0.761057,1.574566,1.0,0.020238,2.162257,0.393586,0.110754,0.53752,0.4379
1522,"(gender_1, medication_cat_Low, readmitted_2)","(max_glu_serum_3, time_in_hospital_cat_Short)",0.076843,0.459643,0.055461,0.721739,1.570218,1.0,0.02014,1.941909,0.393374,0.115297,0.485043,0.4212
1388,"(A1Cresult_3, race_3, medication_cat_Low, readmitted_2)",(time_in_hospital_cat_Short),0.089146,0.483344,0.067262,0.754519,1.56104,1.0,0.024174,2.104671,0.394576,0.133132,0.524866,0.44684
1179,"(gender_1, A1Cresult_3, medication_cat_Low, race_3)",(time_in_hospital_cat_Short),0.07757,0.483344,0.058507,0.754244,1.560469,1.0,0.021014,2.10231,0.389371,0.116453,0.524333,0.437645
1605,"(max_glu_serum_3, gender_1, race_3, medication_cat_Low, A1Cresult_3)",(time_in_hospital_cat_Short),0.072293,0.483344,0.054507,0.753976,1.559915,1.0,0.019565,2.100021,0.38691,0.108769,0.523814,0.433374
1572,"(A1Cresult_3, medication_cat_Low, diagnoses_cat_Medium, max_glu_serum_3)",(time_in_hospital_cat_Short),0.08039,0.483344,0.06059,0.753698,1.559339,1.0,0.021734,2.097648,0.39006,0.120423,0.523276,0.439527
1649,"(max_glu_serum_3, race_3, readmitted_2, medication_cat_Low, A1Cresult_3)",(time_in_hospital_cat_Short),0.08326,0.483344,0.062722,0.753334,1.558587,1.0,0.022479,2.094557,0.390943,0.124478,0.522572,0.441551
1077,"(A1Cresult_3, medication_cat_Low, readmitted_2)",(time_in_hospital_cat_Short),0.126172,0.483344,0.094884,0.752025,1.555879,1.0,0.0339,2.083499,0.408863,0.184373,0.520038,0.474166
1578,"(A1Cresult_3, medication_cat_Low, max_glu_serum_3, readmitted_2)",(time_in_hospital_cat_Short),0.119107,0.483344,0.08946,0.751093,1.553951,1.0,0.031891,2.0757,0.404679,0.174389,0.518235,0.468089


<!-- # Rule 1970 : (medication_cat_High_True, time_in_hospital_cat_Long_True, race_3, A1Cresult_3)
  ## Interpretation: 
                     High medication + long stay + high A1C → high comorbidity.
                     A1Cresult_3 means poor long‑term glucose control, which aligns with multiple chronic conditions.
                     Lift = 1.39 → strong association.


# Rule 1941 : (medication_cat_High_True, time_in_hospital_cat_Long_True, race_3) + (diagnoses_cat_High_True, max_glu_serum_3)
  ## Interpretation : 
                  Patients with high meds + long stay are not only multi‑morbid but also show abnormal glucose levels. -->



## In this dataset, high comorbidity is represented by the category diagnoses_cat_High_True, which indicates that the patient has a large number of diagnoses.
## The association rules show that patients with high medication burden, long hospital stays, and poor glucose control are significantly more likely to have high comorbidity.
## This reflects a clinically meaningful pattern: patients with multiple chronic conditions tend to require more medications and longer inpatient care.


# Insight 1: High Medication Burden + Long Hospital Stay → High Comorbidity

Across multiple rules, patients who had:
- High medication usage (medication_cat_High_True)
- Long hospital stays (time_in_hospital_cat_Long_True)
were strongly associated with:
- High diagnosis count (diagnoses_cat_High_True)
Lift values between 1.36 and 1.40 indicate that these patients are 36–40% more likely to have multiple chronic conditions compared to the average patient.

## Interpretation:
Patients who require many medications and remain hospitalized for extended periods typically have multiple comorbidities, reflecting a higher overall disease burden.

# Insight 2: Poor Glucose Control Reinforces High Comorbidity
Several rules added glucose‑related indicators to the above pattern:
- Abnormal glucose levels (max_glu_serum_3)
- High A1C values (A1Cresult_3)
When these features appeared alongside high medication use and long hospital stays, the confidence remained above 80%, and lift stayed above 1.36.


## Interpretation:
Poor glycemic control often co‑occurs with multiple chronic conditions. These patients tend to require more intensive management and longer inpatient care.

# Insight 3: Age and Medium Medication Use Also Predict High Comorbidity
One rule showed that even patients with:
- Medium medication usage
- Abnormal glucose levels
- Older age (age_8)
were associated with:
- High diagnosis count


## Interpretation:
Even moderate medication complexity combined with older age and abnormal glucose levels signals a higher likelihood of multiple chronic conditions.








In [14]:
rules_all = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)
rules = rules_all[rules_all['lift'] > 1.1]
rules = rules.sort_values(by='lift', ascending=False)

top_rules = rules[['antecedents','consequents','support','confidence','lift']].head(10)


In [15]:
top_rules.to_csv(r"D:\\HealthCare System\top_association_rules.csv", index=False)
