# Heart Failure Prediction Analysis

## by Justin Sierchio

In this analysis, we will be looking at heart failure conditions. Ideally, we would like to be able to answer the following questions:

<ul>
    <li>What patient aspects are most correlated to heart failures?</li>
    <li>Can we predict if a patient will have a heart failure incident?</li>
    <li>What are some other conclusions we might able to draw from this analysis?</li>
</ul>

This data is in .csv file format and is from Kaggle at: https://www.kaggle.com/andrewmvd/heart-failure-clinical-data/download. More information related to the dataset can be found at: https://www.kaggle.com/andrewmvd/heart-failure-clinical-data.

## Notebook Initialization

In [1]:
# Import Relevant Libraries
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt

print('Initial libraries loaded into workspace!')

Initial libraries loaded into workspace!


In [2]:
# Upload Datasets for Study
df_HEART = pd.read_csv("heart_failure_clinical_records_dataset.csv");

print('Datasets uploaded!');

Datasets uploaded!


In [3]:
# Display 1st 5 rows from Heart failure dataset
df_HEART.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0,0,8,1


As a final step, let's list how the dataset defines each of the terms (using the Kaggle definitions).

<ul>
    <li> age: age of the patient</li>
    <li> anaemia: Decrease of red blood cells or hemoglobin (boolean).</li>
    <li> creatinine_phosphokinase: Level of the CPK enzyme in the blood (mcg/L)</li>
    <li> diabetes: If the patient has diabetes (boolean)</li>
    <li> election_fraction: Percentage of blood leaving the heart at each contraction (percentage).</li>
    <li> high_blood_pressure: If the patient has hypertension (boolean).</li>
    <li> platelets: Platelets in the blood (kiloplatelets/mL)</li>
    <li> serum_creatinine: Level of serum creatinine in the blood (mg/dL)</li>
    <li> serum_sodium: Level of serum sodium in the blood (mEq/L)</li>
    <li> sex: female (0) or male (1).</li>
    <li> smoking: If the patient smokes or not (boolean)</li>
    <li> time: Follow-up period (days). </li>
    <li> DEATH_EVENT: If the patient deceased during the follow-up period (boolean). </li>

## Data Cleaning

Let's first make sure that the data is sufficiently cleaned for analysis.

In [4]:
# Find the shape of the data
df_HEART.shape

(299, 13)

So we can see that we have 299 different patients and 13 variables to view. Let's make sure that the dataset is actually complete.

In [5]:
# Find and 'NaN' or 'null' values
df_HEART.isnull().sum()

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

As we can see, the dataset is complete and has values for every row and column.