# VADSTI 2022

# Module 4: Data Exploration and Visualization

# Exploring Pandas Exercise

### Dataset
This dataset contains the medical records of 299 patients who had heart failure, collected during their follow-up period, where each patient profile has 13 clinical features.

Attribute Information:

Thirteen (13) clinical features:

- age: age of the patient (years)
- anaemia: decrease of red blood cells or hemoglobin (boolean)
- high blood pressure: if the patient has hypertension (boolean)
- creatinine phosphokinase (CPK): level of the CPK enzyme in the blood (mcg/L)
- diabetes: if the patient has diabetes (boolean)
- ejection fraction: percentage of blood leaving the heart at each contraction (percentage)
- platelets: platelets in the blood (kiloplatelets/mL)
- sex: woman or man (binary)
- serum creatinine: level of serum creatinine in the blood (mg/dL)
- serum sodium: level of serum sodium in the blood (mEq/L)
- smoking: if the patient smokes or not (boolean)
- time: follow-up period (days)
- [target] death event: if the patient deceased during the follow-up period (boolean)
    
Data Source:
https://archive.ics.uci.edu/ml/datasets/Heart+failure+clinical+records

#### Load the dataset

In [3]:
df

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.00,1.9,130,1,0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1,0,6,1
2,65.0,0,146,0,20,0,162000.00,1.3,129,1,1,7,1
3,50.0,1,111,0,20,0,210000.00,1.9,137,1,0,7,1
4,65.0,1,160,1,20,0,327000.00,2.7,116,0,0,8,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
294,62.0,0,61,1,38,1,155000.00,1.1,143,1,1,270,0
295,55.0,0,1820,0,38,0,270000.00,1.2,139,0,0,271,0
296,45.0,0,2060,1,60,0,742000.00,0.8,138,0,0,278,0
297,45.0,0,2413,0,38,0,140000.00,1.4,140,1,1,280,0


#### Code to randomly add 5% null values into the dataset. [DO NOT CHANGE]

In [6]:
import collections
import random
import numpy as np

replaced = collections.defaultdict(set)
ix = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]
random.shuffle(ix)
to_replace = int(round(.05*len(ix)))
for row, col in ix:
    if len(replaced[row]) < df.shape[1] - 1:
        df.iloc[row, col] = np.nan
        to_replace -= 1
        replaced[row].add(col)
        if to_replace == 0:
            break

#### Check for null values

age                         13
anaemia                     14
creatinine_phosphokinase    13
diabetes                    24
ejection_fraction           18
high_blood_pressure         13
platelets                   11
serum_creatinine            22
serum_sodium                13
sex                         11
smoking                     14
time                        10
DEATH_EVENT                 18
dtype: int64

#### Delete null values in each row of the dataframe ```df``` and reset the index of each rows.

#### Display datatypes for each column

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

#### Display summary statistics of dataset of ```df```. Hint use the ```describe()``` function.

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
count,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0,150.0
mean,61.466667,0.406667,621.213333,0.453333,38.22,0.42,261466.9222,1.377867,136.7,0.626667,0.28,123.74,0.306667
std,11.205623,0.492857,1055.839331,0.499485,10.900323,0.495212,81245.145272,0.956419,4.237496,0.48531,0.450503,79.078708,0.462655
min,40.0,0.0,30.0,0.0,20.0,0.0,25100.0,0.5,113.0,0.0,0.0,4.0,0.0
25%,53.25,0.0,115.75,0.0,30.0,0.0,216500.0,1.0,135.0,0.0,0.0,59.25,0.0
50%,60.0,0.0,330.5,0.0,38.0,0.0,259500.0,1.1,137.0,1.0,0.0,107.5,0.0
75%,68.0,1.0,582.0,1.0,45.0,1.0,302000.0,1.4,139.0,1.0,1.0,197.0,1.0
max,95.0,1.0,7861.0,1.0,70.0,1.0,543000.0,9.0,145.0,1.0,1.0,285.0,1.0


#### Create a new dataframe ```df_diab_blood_pressue```. This dataframe should only consist of people within the age of 40 to 55 with diabetes and highblood pressure. Note: A value of 1 =  Yes and 0 = No.

#### Return the shape of the ```df_diab_blood_pressue``` (Hint: use the ```shape``` datafield)

(5, 13)

#### Create another new dataframe ```df_platelets```. This dataframe should only consist of records with platelets over 200000.

#### Return the shape of the ```df_platelets```

(121, 13)

#### Merge ```df_diab_blood_pressue``` and ```df_platelets``` to a new dataframe ```df_merge```

#### Save dataframe ``df_platelets`` as a ``./export/merged_file.csv`` file