Importing the dataset

In [8]:
from datasets import load_dataset
import numpy as np
import pandas as pd

dataset = load_dataset("mstz/heart_failure", split="train")
df = pd.DataFrame(dataset)

In the following code we will divide the dataframe in two. One containing the data about the participants who passed away throughout the study (deceased) and the second one with data of the participants who stayed alive during the study duration (survived)

In [9]:
deceased = df[df['is_dead']==1]
survived = df[df['is_dead']==0]
print(f'Participants who passed away throughout the study on average were {deceased["age"].mean():.2f} years old')
print(f"Participants who stayed alive during the study's duration had an average age of {survived['age'].mean():.2f} years")

Participants who passed away throughout the study on average were 65.21 years old
Participants who stayed alive during the study's duration had an average age of 58.76


We checked the dataframe contains the right type of data in each column. In summary we confirmed there are 5 boolean variables, and 8 numerical columns (3 of them are int64, data type and 5 are float64 datatype). We also confirmed there are not missing values in this dataframe as all columns are filled with 299 non-null values for each row.

In [10]:
df.info()
df.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                                           Non-Null Count  Dtype  
---  ------                                           --------------  -----  
 0   age                                              299 non-null    int64  
 1   has_anaemia                                      299 non-null    bool   
 2   creatinine_phosphokinase_concentration_in_blood  299 non-null    float64
 3   has_diabetes                                     299 non-null    bool   
 4   heart_ejection_fraction                          299 non-null    float64
 5   has_high_blood_pressure                          299 non-null    bool   
 6   platelets_concentration_in_blood                 299 non-null    float64
 7   serum_creatinine_concentration_in_blood          299 non-null    float64
 8   serum_sodium_concentration_in_blood              299 non-null    float64
 9   is_male                         

Unnamed: 0,age,has_anaemia,creatinine_phosphokinase_concentration_in_blood,has_diabetes,heart_ejection_fraction,has_high_blood_pressure,platelets_concentration_in_blood,serum_creatinine_concentration_in_blood,serum_sodium_concentration_in_blood,is_male,is_smoker,days_in_study,is_dead
0,75,False,582.0,False,20.0,True,265000.0,1.9,130.0,True,False,4,1
1,55,False,7861.0,False,38.0,False,263358.03,1.1,136.0,True,False,6,1
2,65,False,146.0,False,20.0,False,162000.0,1.3,129.0,True,True,7,1
3,50,True,111.0,False,20.0,False,210000.0,1.9,137.0,True,False,7,1
4,65,True,160.0,True,20.0,False,327000.0,2.7,116.0,False,False,8,1


Basic statistics of the 8 numerical variables for the complete dataframe

In [4]:
df.describe()

Unnamed: 0,age,creatinine_phosphokinase_concentration_in_blood,heart_ejection_fraction,platelets_concentration_in_blood,serum_creatinine_concentration_in_blood,serum_sodium_concentration_in_blood,days_in_study,is_dead
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.829431,581.839465,38.083612,263358.029264,1.39388,136.625418,130.26087,0.32107
std,11.894997,970.287881,11.834841,97804.236869,1.03451,4.412477,77.614208,0.46767
min,40.0,23.0,14.0,25100.0,0.5,113.0,4.0,0.0
25%,51.0,116.5,30.0,212500.0,0.9,134.0,73.0,0.0
50%,60.0,250.0,38.0,262000.0,1.1,137.0,115.0,0.0
75%,70.0,582.0,45.0,303500.0,1.4,140.0,203.0,1.0
max,95.0,7861.0,80.0,850000.0,9.4,148.0,285.0,1.0


In this dataset we found a total of 96 smokers in which 92 are males and 4 are women.

In [14]:
print(df.groupby(['is_smoker']).size())
print(df.groupby(['is_smoker','is_male']).size())

is_smoker
False    203
True      96
dtype: int64
is_smoker  is_male
False      False      101
           True       102
True       False        4
           True        92
dtype: int64
