In [14]:
import numpy as np
import pandas as pd

### PLEASE READ:
<br>
Welcome! <br> <br>
After running Latent Class Analysis in R (02_lca.Rmd), you will get 7 CSV files which are 7 subgroups as the result of LCA. <br>

We chose **nclass = 7** as the optimal number of subgroups based on a combination of achieving the lowest Bayesian information criteria(BIC) and Akaike information criteria(AIC) based on our result. <br>
<br>

In order to find out the characteristics of each subgroup, we need to check several conditions:<br>
- Check if the number of patients in the subgroups is greater than 5 % of a total number of patients.
- (CAUTION: Due to the randomness of LCA results, our analysis and your analysis will be slightly different. However, the overall trend should be similar. Please check the final_report.pdf on the GitHub for your reference.)<br>
- Check CSV files(even though there might be 5 % group(s), make sure to check all 7 CSV files) to see 6 groups(Cardiopulmonary, Young, Hepatic&Addiction, Complicated Diabetics, Uncomplicated Diabetics, Cardiac) with specific comorbidity.
- At this point, you should have 7 CSV files saved in processed_data folder. This should have been completed on __03_lca.Rmd__

In [15]:
check_sub1 = pd.read_csv("../processed_data/subgroup_1.csv")
check_sub2 = pd.read_csv("../processed_data/subgroup_2.csv")
check_sub3 = pd.read_csv("../processed_data/subgroup_3.csv")
check_sub4 = pd.read_csv("../processed_data/subgroup_4.csv")
check_sub5 = pd.read_csv("../processed_data/subgroup_5.csv")
check_sub6 = pd.read_csv("../processed_data/subgroup_6.csv")
check_sub7 = pd.read_csv("../processed_data/subgroup_7.csv")

In [16]:
# helper function to convert morbidity absence(0) and presence(1) to overall percentage
def convert2percentage(check_df):
    sub_num = check_df[
        ['congestive_heart_failure', 'cardiac_arrhythmias', 
         'valvular_disease', 'pulmonary_circulation', 
         'peripheral_vascular', 'hypertension',
         'paralysis', 'other_neurological',
         'chronic_pulmonary', 'diabetes_uncomplicated',
         'diabetes_complicated', 'hypothyroidism',
         'renal_failure', 'liver_disease',
         'peptic_ulcer', 'aids',
         'lymphoma', 'metastatic_cancer',
         'solid_tumor', 'rheumatoid_arthritis',
         'coagulopathy', 'obesity',
         'weight_loss', 'fluid_electrolyte',
         'blood_loss_anemia', 'deficiency_anemias',
         'alcohol_abuse', 'drug_abuse',
         'psychoses', 'depression']]
        
    percentage_results = {}
    
    for col in sub_num.columns:
        percentages = np.round(sub_num[col].sum() / check_df.shape[0], 4) * 100
        percentage_results[col] = percentages
    return percentage_results

In [17]:
# get top 5 morbidities on the subgroup
def top5_morbidities(perc_dict):
    res = dict(sorted(perc_dict.items(), key = lambda x: x[1], reverse = True)[: 5])
    return res

#### PLEASE READ:
From this point, your analysis and our analysis will be slightly different. That is okay! We are going to check the patterns (top 5 morbidities in the groups) and figure out their groups' characteristics (1. Cardiopulmonary, 2. Young, 3. Hepatic&Addiction, 4. Complicated Diabetics, 5. Uncomplicated Diabetics, 6. Cardiac) on research paper! 

In [18]:
# check where our subgroup 1 belongs to
perc_sub1 = convert2percentage(check_sub1)
print(top5_morbidities(perc_sub1))

# it seems this group is 5. UNCOMPLICATED DIABETES, then name the dataframe as sub5
sub5 = perc_sub1

{'hypertension': 93.53, 'diabetes_uncomplicated': 34.19, 'cardiac_arrhythmias': 23.68, 'chronic_pulmonary': 15.479999999999999, 'fluid_electrolyte': 13.639999999999999}


In [19]:
# check where our subgroup 2 belongs to
perc_sub2 = convert2percentage(check_sub2)
print(top5_morbidities(perc_sub2))

# it seems this group is 6. CARDIAC, then name the dataframe as sub6
sub6 = perc_sub2

{'hypertension': 66.71000000000001, 'cardiac_arrhythmias': 65.81, 'congestive_heart_failure': 49.07, 'fluid_electrolyte': 39.12, 'diabetes_uncomplicated': 24.779999999999998}


In [20]:
# check where our subgroup 3 belongs to
perc_sub3 = convert2percentage(check_sub3)
print(top5_morbidities(perc_sub3))

# it seems this group is 2. YOUNG, then name the dataframe as sub2
sub2 = perc_sub3

{'alcohol_abuse': 19.42, 'fluid_electrolyte': 18.07, 'drug_abuse': 15.25, 'depression': 14.580000000000002, 'other_neurological': 11.700000000000001}


In [21]:
# check where our subgroup 4 belongs to
perc_sub4 = convert2percentage(check_sub4)
print(top5_morbidities(perc_sub4))

# it seems this group is 1. CARDIOPULMONARY, then name the dataframe as sub1
sub1 = perc_sub4

{'chronic_pulmonary': 92.78999999999999, 'congestive_heart_failure': 75.05, 'pulmonary_circulation': 67.10000000000001, 'hypertension': 65.92, 'cardiac_arrhythmias': 63.17}


In [22]:
# check where our subgroup 5 belongs to
perc_sub5 = convert2percentage(check_sub5)
print(top5_morbidities(perc_sub5))

# WELL, NOT SURE WHAT THIS GROUP IS. DOESN'T SEEM IT BELONGS TO 6 GIVEN CATEGORIES.
# I KNOW THE NUMBER OF PATIENTS IN THIS GROUP IS GREATER THAN THE 5% OF THE TOTAL, 
# BUT DUE TO THE RANDOMNESS OF LCA, WE DID NOT INCLUDE THIS GROUP AS PART OF OUR ANALYSIS.
# YOU MAY SEE THE DIFFERENT VALUE.

{'cardiac_arrhythmias': 26.179999999999996, 'fluid_electrolyte': 17.669999999999998, 'chronic_pulmonary': 17.05, 'congestive_heart_failure': 12.26, 'metastatic_cancer': 10.620000000000001}


In [23]:
# check where our subgroup 6 belongs to
perc_sub6 = convert2percentage(check_sub6)
print(top5_morbidities(perc_sub6))

# it seems this group is 4. COMPLICATED DIABETES, then name the dataframe as sub4
sub4 = perc_sub6

{'diabetes_complicated': 98.16, 'hypertension': 81.77, 'renal_failure': 56.28999999999999, 'congestive_heart_failure': 44.379999999999995, 'fluid_electrolyte': 37.2}


In [24]:
# check where our subgroup 7 belongs to
perc_sub7 = convert2percentage(check_sub7)
print(top5_morbidities(perc_sub7))
# it seems this group is 3. HEPATIC & ADDICTION, then name the dataframe as sub3
sub3 = perc_sub7

{'liver_disease': 60.47, 'fluid_electrolyte': 59.67, 'alcohol_abuse': 44.93, 'hypertension': 41.449999999999996, 'coagulopathy': 40.28}


Now, we have 6 groups selected! <br>
At this point, you will have 6 dataframes names down below: (make sure you assign LCA groups to the right groups)
- sub1: Cardiopulmonary
- sub2: Young
- sub3: Hepatic & Addiction
- sub4: Complicated Diabetes
- sub5: Uncomplicated Diabetes
- sub6: Cardiac

### Here is our version of Supplementary Table 3 on research paper

In [25]:
# Assuming sub1_num to sub6_num are already defined dataframes
total_subs_perc_dict = {
    'Subgroup1': sub1,
    'Subgroup2': sub2,
    'Subgroup3': sub3,
    'Subgroup4': sub4,
    'Subgroup5': sub5,
    'Subgroup6': sub6
}

In [26]:
percentage_df = pd.DataFrame(total_subs_perc_dict)
percentage_df

Unnamed: 0,Subgroup1,Subgroup2,Subgroup3,Subgroup4,Subgroup5,Subgroup6
congestive_heart_failure,75.05,0.74,12.41,44.38,7.6,49.07
cardiac_arrhythmias,63.17,4.66,15.42,35.54,23.68,65.81
valvular_disease,38.05,0.76,2.88,11.66,6.55,19.75
pulmonary_circulation,67.1,1.35,3.71,0.98,1.07,0.92
peripheral_vascular,14.77,0.86,2.6,20.14,7.83,11.45
hypertension,65.92,4.12,41.45,81.77,93.53,66.71
paralysis,0.92,2.98,2.67,1.35,3.01,3.3
other_neurological,7.51,11.7,18.61,8.84,7.03,11.4
chronic_pulmonary,92.79,5.76,17.54,14.61,15.48,17.83
diabetes_uncomplicated,28.62,0.78,18.43,0.0,34.19,24.78


### Does it seem reasonable?
Compare your result with the final_result.pdf on our Team 2 github repository!<br>
Check the rest of eda_notebooks folder to check the age distribution in the group, subgroups' circular plots, etc.!