# Project 7: Data mining with market basket

Select a dataset of interest to you and perform a market basket analysis, including finding frequent itemsets and mining association rules. Do not use a shopping cart dataset - select(or create) another kind of dataset and think of how to frame it as a market basket problem. You can use whatever implementation of the A Priori algorithm you want, from the book, from here: http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/, or anything else you find.

This assignment is a little more subjective than previous assignments. Before starting, discuss your dataset with me. You will be graded on the quality of your explanation as well as the code. There are no performance goals to meet as this is a data mining project, but the model does need to be carefully tuned to select frequent itemsets and association rules with high support, confidence and lift.  Your write-up should discuss what dataset you chose and why, what parameters you selected and why, give examples of itemsets and rules. You should wrap it up with a conclusion about what you 'discovered' about this dataset using this method.

In [135]:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules


## Disease Symptoms and Patient Profile Dataset

I found this Dataset on Kaggle, and it unfortunately does not have much background information, but I found many articles that use it for data mining and machine learning purposes. (https://www.nature.com/articles/s41598-024-69029-8#Bib1). My hope from applying the market basket analysis on this dataset is to find the most common symptoms that occur together and the most common patient profiles that have the same symptoms.

In [136]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("uom190346a/disease-symptoms-and-patient-profile-dataset")

print("Path to dataset files:", path)

Path to dataset files: /home/jaeho/.cache/kagglehub/datasets/uom190346a/disease-symptoms-and-patient-profile-dataset/versions/2


In [137]:
df = pd.read_csv(path + "/Disease_symptom_and_patient_profile_dataset.csv")
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 349 entries, 0 to 348
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Disease               349 non-null    object
 1   Fever                 349 non-null    object
 2   Cough                 349 non-null    object
 3   Fatigue               349 non-null    object
 4   Difficulty Breathing  349 non-null    object
 5   Age                   349 non-null    int64 
 6   Gender                349 non-null    object
 7   Blood Pressure        349 non-null    object
 8   Cholesterol Level     349 non-null    object
 9   Outcome Variable      349 non-null    object
dtypes: int64(1), object(9)
memory usage: 27.4+ KB


Unnamed: 0,Disease,Fever,Cough,Fatigue,Difficulty Breathing,Age,Gender,Blood Pressure,Cholesterol Level,Outcome Variable
0,Influenza,Yes,No,Yes,Yes,19,Female,Low,Normal,Positive
1,Common Cold,No,Yes,Yes,No,25,Female,Normal,Normal,Negative
2,Eczema,No,Yes,Yes,No,25,Female,Normal,Normal,Negative
3,Asthma,Yes,Yes,No,Yes,25,Male,Normal,Normal,Positive
4,Asthma,Yes,Yes,No,Yes,25,Male,Normal,Normal,Positive


In [138]:
# Map Yes/No to True/False
df['Fever'] = df['Fever'].map({'Yes': True, 'No': False})
df['Cough'] = df['Cough'].map({'Yes': True, 'No': False})
df['Fatigue'] = df['Fatigue'].map({'Yes': True, 'No': False})
df['Difficulty Breathing'] = df['Difficulty Breathing'].map({'Yes': True, 'No': False})

df = pd.get_dummies(df, columns=['Gender', 'Blood Pressure', 'Cholesterol Level'])
df = df.drop(columns=['Blood Pressure_Normal', 'Cholesterol Level_Normal'])
df.head()

Unnamed: 0,Disease,Fever,Cough,Fatigue,Difficulty Breathing,Age,Outcome Variable,Gender_Female,Gender_Male,Blood Pressure_High,Blood Pressure_Low,Cholesterol Level_High,Cholesterol Level_Low
0,Influenza,True,False,True,True,19,Positive,True,False,False,True,False,False
1,Common Cold,False,True,True,False,25,Negative,True,False,False,False,False,False
2,Eczema,False,True,True,False,25,Negative,True,False,False,False,False,False
3,Asthma,True,True,False,True,25,Positive,False,True,False,False,False,False
4,Asthma,True,True,False,True,25,Positive,False,True,False,False,False,False


I chose to drop the "normal" columns for blood pressure and cholesterol because they should be indicating the normal state of the patient, and I don't think they will be useful for the analysis.

In [139]:
# Create a new column with True/False values based on the 'Outcome Variable' column
df['Diagnosis'] = df['Outcome Variable'].map({'Positive': True, 'Negative': False})
# Create columns from entries in the 'Disease' column with values from the 'Diagnosis' column
pivot_table = df.pivot_table(index=df.index, columns='Disease', values='Diagnosis', fill_value=False)
for column in pivot_table.columns:
    pivot_table[column] = pivot_table[column].map({1: True, 0: False})
df = pd.concat([df.drop(['Disease', 'Outcome Variable', 'Diagnosis'], axis='columns'), pivot_table], axis='columns')
df.head()

Unnamed: 0,Fever,Cough,Fatigue,Difficulty Breathing,Age,Gender_Female,Gender_Male,Blood Pressure_High,Blood Pressure_Low,Cholesterol Level_High,...,Tonsillitis,Tourette Syndrome,Tuberculosis,Turner Syndrome,Typhoid Fever,Ulcerative Colitis,Urinary Tract Infection,Urinary Tract Infection (UTI),Williams Syndrome,Zika Virus
0,True,False,True,True,19,True,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
1,False,True,True,False,25,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,True,True,False,25,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,True,True,False,True,25,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,True,True,False,True,25,False,True,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [140]:
# Age Range Boolean Columns
age_bins    = [0,   12,     19,            35,      50,            65,      100]
age_labels  = ['Child', 'Teen', 'Young Adult', 'Adult', 'Middle Aged', 'Senior']

# Map ages to ranges
df['Age Range'] = pd.cut(df['Age'], bins=age_bins, labels=age_labels, right=False)
df = pd.get_dummies(df, columns=['Age Range'])
df = df.drop(columns=['Age'])
df.head()

Unnamed: 0,Fever,Cough,Fatigue,Difficulty Breathing,Gender_Female,Gender_Male,Blood Pressure_High,Blood Pressure_Low,Cholesterol Level_High,Cholesterol Level_Low,...,Urinary Tract Infection,Urinary Tract Infection (UTI),Williams Syndrome,Zika Virus,Age Range_Child,Age Range_Teen,Age Range_Young Adult,Age Range_Adult,Age Range_Middle Aged,Age Range_Senior
0,True,False,True,True,True,False,False,True,False,False,...,False,False,False,False,False,False,True,False,False,False
1,False,True,True,False,True,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
2,False,True,True,False,True,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
3,True,True,False,True,False,True,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
4,True,True,False,True,False,True,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False


I created new columns for age that would make it more compatible with the mlxtend apriori algorithm. Instead of having the age as an integer, I created a columns that has the age range of the patient.

In [141]:
frequent_itemsets = apriori(df, min_support=0.3, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets = frequent_itemsets.sort_values(by=['support'], ascending=False)
frequent_itemsets.head()

Unnamed: 0,support,itemsets,length
2,0.69341,(Fatigue),1
3,0.504298,(Gender_Female),1
0,0.501433,(Fever),1
4,0.495702,(Gender_Male),1
1,0.47851,(Cough),1


Due to the sparsity of the dataset, I chose to set the minimum support to 0.3 so that I can get a good number of frequent itemsets and association rules.

In [142]:
frequent_itemsets[frequent_itemsets['length'] > 1].head()

Unnamed: 0,support,itemsets,length
11,0.363897,"(Blood Pressure_High, Fatigue)",2
9,0.34957,"(Fatigue, Gender_Female)",2
12,0.346705,"(Cholesterol Level_High, Fatigue)",2
10,0.34384,"(Fatigue, Gender_Male)",2
8,0.329513,"(Fever, Fatigue)",2


This table shows the itemsets with more than 1 item in them, and from the itemsets with the highest support, I can see that fatigue is an element of all of them. I think this might be because of the fact that fatigue is a very general symptom that is also very common. As a result, I later decided to remove fatigue from the dataset to see if I can find more interesting itemsets.

In [143]:
confidence_rules = association_rules(frequent_itemsets, num_itemsets=349, metric="confidence", min_threshold=0.1)
confidence_rules = confidence_rules.sort_values(by='confidence', ascending=False)
confidence_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(Blood Pressure_High),(Fatigue),0.47851,0.69341,0.363897,0.760479,1.096724,1.0,0.032093,1.280014,0.169118,0.450355,0.218759,0.642636
4,(Cholesterol Level_High),(Fatigue),0.475645,0.69341,0.346705,0.728916,1.051205,1.0,0.016888,1.130977,0.092896,0.421603,0.115809,0.614458
13,(Age Range_Adult),(Fatigue),0.429799,0.69341,0.30086,0.7,1.009504,1.0,0.002832,1.021968,0.016511,0.365854,0.021495,0.566942
7,(Gender_Male),(Fatigue),0.495702,0.69341,0.34384,0.693642,1.000334,1.0,0.000115,1.000757,0.000663,0.40678,0.000756,0.594755
3,(Gender_Female),(Fatigue),0.504298,0.69341,0.34957,0.693182,0.999671,1.0,-0.000115,0.999257,-0.000663,0.412162,-0.000743,0.598657


In [144]:
lift_rules = association_rules(frequent_itemsets, num_itemsets=349, metric="lift", min_threshold=0.7)
lift_rules = lift_rules.sort_values(by='lift', ascending=False)
lift_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
10,(Blood Pressure_High),(Cholesterol Level_High),0.47851,0.475645,0.320917,0.670659,1.409999,1.0,0.093316,1.592133,0.557594,0.506787,0.371912,0.672679
11,(Cholesterol Level_High),(Blood Pressure_High),0.475645,0.47851,0.320917,0.674699,1.409999,1.0,0.093316,1.603099,0.554547,0.506787,0.376208,0.672679
0,(Blood Pressure_High),(Fatigue),0.47851,0.69341,0.363897,0.760479,1.096724,1.0,0.032093,1.280014,0.169118,0.450355,0.218759,0.642636
1,(Fatigue),(Blood Pressure_High),0.69341,0.47851,0.363897,0.524793,1.096724,1.0,0.032093,1.097396,0.287659,0.450355,0.088752,0.642636
5,(Fatigue),(Cholesterol Level_High),0.69341,0.475645,0.346705,0.5,1.051205,1.0,0.016888,1.048711,0.158879,0.421603,0.046448,0.614458


### With Fatigue column dropped

In [145]:
df = df.drop(columns=['Fatigue'])
df.head()

Unnamed: 0,Fever,Cough,Difficulty Breathing,Gender_Female,Gender_Male,Blood Pressure_High,Blood Pressure_Low,Cholesterol Level_High,Cholesterol Level_Low,Acne,...,Urinary Tract Infection,Urinary Tract Infection (UTI),Williams Syndrome,Zika Virus,Age Range_Child,Age Range_Teen,Age Range_Young Adult,Age Range_Adult,Age Range_Middle Aged,Age Range_Senior
0,True,False,True,True,False,False,True,False,False,False,...,False,False,False,False,False,False,True,False,False,False
1,False,True,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
2,False,True,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
3,True,True,True,False,True,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
4,True,True,True,False,True,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False


In [146]:
frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets = frequent_itemsets.sort_values(by=['support'], ascending=False)
frequent_itemsets.head()

Unnamed: 0,support,itemsets,length
3,0.504298,(Gender_Female),1
0,0.501433,(Fever),1
4,0.495702,(Gender_Male),1
1,0.47851,(Cough),1
5,0.47851,(Blood Pressure_High),1


In [147]:
frequent_itemsets[frequent_itemsets['length'] > 1].head()

Unnamed: 0,support,itemsets,length
25,0.320917,"(Blood Pressure_High, Cholesterol Level_High)",2
12,0.297994,"(Fever, Blood Pressure_High)",2
9,0.272206,"(Fever, Cough)",2
11,0.252149,"(Fever, Gender_Male)",2
10,0.249284,"(Fever, Gender_Female)",2


Without Fatigue, I can see that High Blood Pressure and High Cholesterol often occur together. I did some research and realized this is often the case in real life. (https://www.ahajournals.org/doi/10.1161/circ.106.25.3329)

- Atherosclerosis: Elevated LDL cholesterol leads to plaque buildup in arteries, narrowing them and increasing blood pressure.
- Endothelial Dysfunction: High cholesterol impairs the endothelium, reducing its ability to regulate blood vessel dilation, contributing to hypertension.

In [148]:
confidence_rules = association_rules(frequent_itemsets, num_itemsets=349, metric="confidence", min_threshold=0.5)
confidence_rules = confidence_rules.sort_values(by='confidence', ascending=False)
confidence_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
1,(Cholesterol Level_High),(Blood Pressure_High),0.475645,0.47851,0.320917,0.674699,1.409999,1.0,0.093316,1.603099,0.554547,0.506787,0.376208,0.672679
0,(Blood Pressure_High),(Cholesterol Level_High),0.47851,0.475645,0.320917,0.670659,1.409999,1.0,0.093316,1.592133,0.557594,0.506787,0.371912,0.672679
3,(Blood Pressure_High),(Fever),0.47851,0.501433,0.297994,0.622754,1.24195,1.0,0.058054,1.321599,0.373574,0.436975,0.243341,0.60852
2,(Fever),(Blood Pressure_High),0.501433,0.47851,0.297994,0.594286,1.24195,1.0,0.058054,1.285363,0.390749,0.436975,0.222009,0.60852
5,(Cough),(Fever),0.47851,0.501433,0.272206,0.568862,1.134474,1.0,0.032266,1.156399,0.227299,0.384615,0.135247,0.55586


In [149]:
lift_rules = association_rules(frequent_itemsets, num_itemsets=349, metric="lift", min_threshold=1.2)
lift_rules = lift_rules.sort_values(by='lift', ascending=False)
lift_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
0,(Blood Pressure_High),(Cholesterol Level_High),0.47851,0.475645,0.320917,0.670659,1.409999,1.0,0.093316,1.592133,0.557594,0.506787,0.371912,0.672679
1,(Cholesterol Level_High),(Blood Pressure_High),0.475645,0.47851,0.320917,0.674699,1.409999,1.0,0.093316,1.603099,0.554547,0.506787,0.376208,0.672679
2,(Fever),(Blood Pressure_High),0.501433,0.47851,0.297994,0.594286,1.24195,1.0,0.058054,1.285363,0.390749,0.436975,0.222009,0.60852
3,(Blood Pressure_High),(Fever),0.47851,0.501433,0.297994,0.622754,1.24195,1.0,0.058054,1.321599,0.373574,0.436975,0.243341,0.60852


These association rules make sense but seem almost too obvious. I think this is because the dataset is very small and does not have enough variety to find more interesting rules. So I tried finding a larger dataset and found "SymbiPredict" from Mendeley Data

## Larger Dataset (SymbiPredict)

In [150]:
df = pd.read_csv("./symbipredict_2022.csv")
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4961 entries, 0 to 4960
Columns: 133 entries, itching to prognosis
dtypes: int64(132), object(1)
memory usage: 5.0+ MB


Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,blackheads,scurring,skin_peeling,silver_like_dusting,small_dents_in_nails,inflammatory_nails,blister,red_sore_around_nose,yellow_crust_ooze,prognosis
0,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal Infection
1,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal Infection
2,1,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal Infection
3,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal Infection
4,1,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Fungal Infection


In [151]:
for column in (df.columns):
    if column != 'prognosis':
        df[column] = df[column].map({1: True, 0: False}, na_action='ignore')
df = pd.get_dummies(df, columns=['prognosis'])
df.head()

Unnamed: 0,itching,skin_rash,nodal_skin_eruptions,continuous_sneezing,shivering,chills,joint_pain,stomach_pain,acidity,ulcers_on_tongue,...,prognosis_Osteoarthritis,prognosis_Paralysis (brain hemorrhage),prognosis_Peptic Ulcer Disease,prognosis_Pneumonia,prognosis_Psoriasis,prognosis_Tuberculosis,prognosis_Typhoid,prognosis_Urinary Tract Infection,prognosis_Varicose Veins,prognosis_Vertigo
0,True,True,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,True,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,True,False,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,True,True,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,True,True,True,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [152]:
frequent_itemsets = apriori(df, min_support=0.1, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets = frequent_itemsets.sort_values(by=['support'], ascending=False)
frequent_itemsets.head()

Unnamed: 0,support,itemsets,length
5,0.392864,(fatigue),1
4,0.389236,(vomiting),1
7,0.27696,(high_fever),1
13,0.234227,(loss_of_appetite),1
12,0.233018,(nausea),1


In [153]:
frequent_itemsets[frequent_itemsets['length'] > 1].head()

Unnamed: 0,support,itemsets,length
25,0.198952,"(vomiting, nausea)",2
29,0.198952,"(high_fever, fatigue)",2
27,0.17698,"(vomiting, abdominal_pain)",2
50,0.159847,"(yellowing_of_eyes, loss_of_appetite)",2
33,0.157428,"(fatigue, loss_of_appetite)",2


In [154]:
confidence_rules = association_rules(frequent_itemsets, num_itemsets=4961, metric="confidence", min_threshold=0.9)
confidence_rules = confidence_rules.sort_values(by='confidence', ascending=False)
confidence_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
23,"(yellowish_skin, abdominal_pain, loss_of_appet...",(yellowing_of_eyes),0.106229,0.165894,0.10381,0.97723,5.890688,1.0,0.086187,36.631156,0.928918,0.616766,0.972701,0.801495
7,"(vomiting, yellowing_of_eyes)",(loss_of_appetite),0.113485,0.234227,0.109857,0.968028,4.132865,1.0,0.083276,23.951679,0.855075,0.461864,0.958249,0.718524
9,"(yellowing_of_eyes, fatigue)",(loss_of_appetite),0.112276,0.234227,0.108647,0.967684,4.131395,1.0,0.082349,23.696421,0.853814,0.45678,0.9578,0.71577
3,"(yellowish_skin, loss_of_appetite)",(yellowing_of_eyes),0.133038,0.165894,0.1282,0.963636,5.808748,1.0,0.10613,22.937916,0.954881,0.750885,0.956404,0.868209
0,(yellowing_of_eyes),(loss_of_appetite),0.165894,0.234227,0.159847,0.963548,4.113736,1.0,0.12099,21.007707,0.907453,0.665268,0.952398,0.822996


In [155]:
lift_rules = association_rules(frequent_itemsets, num_itemsets=4961, metric="lift", min_threshold=1.2)
lift_rules = lift_rules.sort_values(by='lift', ascending=False)
lift_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
185,"(abdominal_pain, yellowing_of_eyes)","(yellowish_skin, loss_of_appetite)",0.114695,0.133038,0.10381,0.905097,6.80331,1.0,0.088551,9.135214,0.963524,0.721289,0.890533,0.8427
184,"(yellowish_skin, loss_of_appetite)","(abdominal_pain, yellowing_of_eyes)",0.133038,0.114695,0.10381,0.780303,6.80331,1.0,0.088551,4.029666,0.98391,0.721289,0.75184,0.8427
197,"(yellowish_skin, loss_of_appetite)","(yellowing_of_eyes, nausea)",0.133038,0.111066,0.100181,0.75303,6.780006,1.0,0.085405,3.599363,0.983327,0.696078,0.722173,0.827513
200,"(yellowing_of_eyes, nausea)","(yellowish_skin, loss_of_appetite)",0.111066,0.133038,0.100181,0.901996,6.780006,1.0,0.085405,8.846226,0.959023,0.696078,0.886957,0.827513
179,"(yellowish_skin, abdominal_pain, loss_of_appet...",(yellowing_of_eyes),0.106229,0.165894,0.10381,0.97723,5.890688,1.0,0.086187,36.631156,0.928918,0.616766,0.972701,0.801495


### Drop yellowing columns

In [156]:
df = df.drop(columns=['yellowing_of_eyes', 'yellowish_skin'])

In [157]:
frequent_itemsets = apriori(df, min_support=0.1, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets = frequent_itemsets.sort_values(by=['support'], ascending=False)
frequent_itemsets.head()

Unnamed: 0,support,itemsets,length
5,0.392864,(fatigue),1
4,0.389236,(vomiting),1
7,0.27696,(high_fever),1
12,0.234227,(loss_of_appetite),1
11,0.233018,(nausea),1


In [158]:
frequent_itemsets[frequent_itemsets['length'] > 1].head()

Unnamed: 0,support,itemsets,length
25,0.198952,"(high_fever, fatigue)",2
22,0.198952,"(vomiting, nausea)",2
24,0.17698,"(vomiting, abdominal_pain)",2
28,0.157428,"(fatigue, loss_of_appetite)",2
23,0.156219,"(vomiting, loss_of_appetite)",2


In [159]:
confidence_rules = association_rules(frequent_itemsets, num_itemsets=4961, metric="confidence", min_threshold=0.9)
confidence_rules = confidence_rules.sort_values(by='confidence', ascending=False)
confidence_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
1,(dark_urine),(abdominal_pain),0.115904,0.209837,0.111066,0.958261,4.566698,1.0,0.086745,18.930995,0.883415,0.517371,0.947177,0.74378
2,"(chills, fatigue)",(high_fever),0.112276,0.27696,0.107438,0.956912,3.455051,1.0,0.076342,16.780547,0.800439,0.381259,0.940407,0.672415
0,(malaise),(fatigue),0.142713,0.392864,0.135457,0.949153,2.41598,1.0,0.07939,11.940335,0.683656,0.338539,0.91625,0.646972
3,"(headache, nausea)",(vomiting),0.112276,0.389236,0.106229,0.94614,2.430762,1.0,0.062527,11.339851,0.663051,0.26874,0.911815,0.609528
4,"(high_fever, malaise)",(fatigue),0.112276,0.392864,0.106229,0.94614,2.408312,1.0,0.062119,11.272485,0.658731,0.266296,0.911288,0.608268


In [160]:
lift_rules = association_rules(frequent_itemsets, num_itemsets=4961, metric="lift", min_threshold=1.2)
lift_rules = lift_rules.sort_values(by='lift', ascending=False)
lift_rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
37,(dark_urine),(abdominal_pain),0.115904,0.209837,0.111066,0.958261,4.566698,1.0,0.086745,18.930995,0.883415,0.517371,0.947177,0.74378
36,(abdominal_pain),(dark_urine),0.209837,0.115904,0.111066,0.529299,4.566698,1.0,0.086745,1.878253,0.988433,0.517371,0.46759,0.74378
70,"(high_fever, fatigue)",(malaise),0.198952,0.142713,0.106229,0.533941,3.741359,1.0,0.077836,1.839439,0.914698,0.451199,0.456356,0.639146
75,(malaise),"(high_fever, fatigue)",0.142713,0.198952,0.106229,0.74435,3.741359,1.0,0.077836,3.133382,0.854693,0.451199,0.680856,0.639146
69,(nausea),"(vomiting, headache)",0.233018,0.130619,0.106229,0.455882,3.490173,1.0,0.075792,1.597782,0.930245,0.412686,0.374132,0.634577


The confidence and lift values are higher than the previous dataset but at the cost of the support.

# Conclusion

From these market basket analyses, I have discovered that many symptoms reasonably occur together: Symptoms like 'dark_urine' and abdominal_pain' occuring together is intuitively reasonable as they are both related to the gastrointestinal system; chills, fatigue, and fever are all symptoms of the flu; and from experience, having a headache and nausea usually tells me I am vomiting soon.