In [2]:
import pandas as pd
from medpredictor import Graph, Config, DataFrameOperations, Utils, MetricsDisplay

### **Setting up the dataframe**

In [8]:
df = pd.read_csv(Config.data_cleaned_path)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253680 entries, 0 to 253679
Data columns (total 22 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   Diabetes_012          253680 non-null  object 
 1   HighBP                253680 non-null  object 
 2   HighChol              253680 non-null  object 
 3   CholCheck             253680 non-null  object 
 4   BMI                   253680 non-null  float64
 5   Smoker                253680 non-null  object 
 6   Stroke                253680 non-null  object 
 7   HeartDiseaseorAttack  253680 non-null  object 
 8   PhysActivity          253680 non-null  object 
 9   Fruits                253680 non-null  object 
 10  Veggies               253680 non-null  object 
 11  HvyAlcoholConsump     253680 non-null  object 
 12  AnyHealthcare         253680 non-null  object 
 13  NoDocbcCost           253680 non-null  object 
 14  GenHlth               253680 non-null  object 
 15  

In [9]:
df.head()

Unnamed: 0,Diabetes_012,HighBP,HighChol,CholCheck,BMI,Smoker,Stroke,HeartDiseaseorAttack,PhysActivity,Fruits,...,AnyHealthcare,NoDocbcCost,GenHlth,MentHlth,PhysHlth,DiffWalk,Sex,Age,Education,Income
0,No diabetes,Yes,Yes,Yes,40.0,Yes,No,No,No,No,...,Yes,No,Poor,18.0,15.0,Yes,Female,60 to 64,High school graduate,"$15,000 to less than $20,000"
1,No diabetes,No,No,No,25.0,Yes,No,No,Yes,No,...,No,Yes,Good,0.0,0.0,No,Female,50 to 54,College graduate,"Less than $10,000"
2,No diabetes,Yes,Yes,Yes,28.0,No,No,No,No,Yes,...,Yes,Yes,Poor,30.0,30.0,Yes,Female,60 to 64,High school graduate,"$75,000 or more"
3,No diabetes,Yes,No,Yes,27.0,No,No,No,Yes,Yes,...,Yes,No,Very good,0.0,0.0,No,Female,70 to 74,Some high school,"$35,000 to less than $50,000"
4,No diabetes,Yes,Yes,Yes,24.0,No,No,No,Yes,Yes,...,Yes,No,Very good,3.0,0.0,No,Female,70 to 74,Some college or technical school,"$20,000 to less than $25,000"


In [13]:
df.describe()

Unnamed: 0,BMI,MentHlth,PhysHlth
count,253680.0,253680.0,253680.0
mean,28.382364,3.184772,4.242081
std,6.608694,7.412847,8.717951
min,12.0,0.0,0.0
25%,24.0,0.0,0.0
50%,27.0,0.0,0.0
75%,31.0,2.0,3.0
max,98.0,30.0,30.0


From this numeric features we can get this insights:

**Body Mass Index (BMI)**
* The mean for this index is ~28.38.
* The standard deviation for this feature is ~6.61. As we can see, this is a low value, so data's dispersion is low too.
* The minimum index registered in this dataset is 12. This indicates us that at least one respondent has underweight,  which might require medical attention.
* The maximum index registered in this dataset is 98. This is a less common index's value that could be due to errors in data entry.
* The 50% of the respondets answered 27 as theirs body mass index, which is around the upper limit of the 'normal' BMI range  taking into account respondent's age and sex.
* The 75% of the respondents ansewered 31 as theirs body mass index. According to respondent's age and sex, this could be dangerous. A BMI of 31 falls within the overweight or obese category, which can increase health risks.

**Days of poor mental health (MetHlth)**
* The mean for this feature is ~3.18. 
* The standard deviation for this feature is ~7.41, indicating high variability in the number of poor mental health days among respondents.
* The high standard deviation suggests that while many respondents have few poor mental health days, a subset experiences significantly more, potentially requiring targeted mental health interventions.
* The minimum number of days registered in this dataset is 0. This indicates that at least one respondent is in good mental health.
* The maximum number of days registerd in this dataset is 30. This indicates that at least one person needs to take care of his/her mental health. 
* 75% of the respondents reported having 2 or fewer days of bad mental health.. This is very positive, because it indicates that the respondents are in good mental health.

**Physical illness or injury days in past 30 days (PhysHlth)**
* The mean of 4 days suggests that, on average, respondents experienced physical illness or injury for about 4 days in the past month, which might be a concern depending on the nature and severity of these conditions.
* The standard deviation for this feature is ~8.72. As we can see, this is a high value, so data's dispersion is high too.
* The minimum number of days registered in this dataset is 0. This positive, because at least one respondent didn't suffer any illness or injury.
* The maximum number of days registered in this dataset is 30. This indicates that at least one person suffered an illness or injury for 30 days.
* 75% of the respondents reported experiencing 3 or fewer days of physical illness or injury. 

### **Utils**

In [12]:
# code section to add new functions

### **Analysis respondents' demographic features**