# Data processing


In the `why we love numpy` case we applied linear regression to a random generated dataset. We can apply the same principles to data coming from experiments or studies as long as they are nicely structured in a 2d array format. Unfortunately data from real life cases is often not nicely structured. We need to manipulate the unstructured and/or messy data into a structured or clean form. We need to drop rows and collumns because they are not needed for the analysis or because we cannot use them in case of too many missing values. Maybe we need to relabel collumns or reformat characters into numerical values. Maybe we need to combine data from several sources. Cleaning and manipulating data into a structured form is called **data processing**. Data processing starts with data in its raw form and converts it into a more readable format (tables, graphs etc.), giving it the form and context necessary to be interpreted by computers and utilized by users. 


In previous courses you learned about the basics in programming python and object oriented python. In this course we use python and the libraries `NumPy` and `Pandas`. These libraries are high performance libraries especially suitable for data manipulations and data computations. 

# Data processing Example: Heart failure casus

Cardiovascular diseases kill approximately 17 million people globally every year, and they mainly exhibit as myocardial infarctions and heart failures. Heart failure occurs when the heart cannot pump enough blood to meet the needs of the body. Available electronic medical records of patients quantify symptoms, body features, and clinical laboratory test values, which can be used to perform biostatistics analysis aimed at highlighting patterns and correlations otherwise undetectable by medical doctors. Machine learning, can predict patients’ survival from their data and can individuate the most important features among those included in their medical records[1]. As a datascientist you are required to inspect if the data can be used for modelling and to select the most important features for predicting the patient's survival. Data for the analysis is available in `heart_failure_clinical_records_dataset.csv`. The data description is to be found in the table `data_description.csv`

[1] https://doi.org/10.1186/s12911-020-1023-5

In [1]:
import pandas as pd
import numpy as np

## Step 1: Inspect the data

The first step is inspecting the data and getting an idea about the meaning of the variables, format and units. 

In [2]:
# load and display the meta data, the data that describes the data
md = pd.read_csv('data/data_description.csv', sep=';')
md

Unnamed: 0,Feature,Explanation,Measurement
0,Age,Age of patient,years
1,Anaemia,Decrease of red blood cells or hemoglobin,Boolean
2,High blood pressure,If a patient has hypertension,Boolean
3,Creatinine phosphokinase,Level of the CPK enzyme in the blood,mcg/L
4,Diabetes,If the patient has diabetes,Boolean
5,Ejection fraction,Percentage of blood leaving the heart at each ...,Percentage
6,Sex,Woman or Man,Binary
7,Platelets,Platelets in the blood,kiloplatelets/mL
8,Serum creatinine,Level of creatinine in the blood,mg/dL
9,Serum sodium,Level of sodium in the blood,mEq/L


The death event will be used to predict survival rate and will be the class variable. The variable `death event` is a boolean. If the `death event` is 1 (True) then the patient died. If the `death event` = 0 (False) then the patient survived

In [3]:
# load and display data 
df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
print(f'this dataset contains {len(df)} rows')
df.head(5)

this dataset contains 299 rows


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1.0,0.0,4,1
1,55.0,0,7861,0,38,0,263358.03,1.1,136,1.0,,6,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1.0,1.0,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1.0,0.0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0.0,0.0,8,1


In [4]:
list(df.columns)

['age',
 'anaemia',
 'creatinine_phosphokinase',
 'diabetes',
 'ejection_fraction',
 'high_blood_pressure',
 'platelets',
 'serum_creatinine',
 'serum_sodium',
 'sex',
 'smoking',
 'time',
 'DEATH_EVENT']

Mind you, the column names of the meta data are slightly different than the one from the clinical records. Also the order is different. We must take that into account if we want to make use of the meta data to select a subset of the clinical records. 

### Missing data

Looking at the dataframe values we also see NaN in the column smoking. This means that the data contains missing data. Let us inspect the missing data 

In [5]:
# first inspect missing data
df.isnull().sum()

age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         1
smoking                     1
time                        0
DEATH_EVENT                 0
dtype: int64

The columns sex and smoking do have missing values. When columns have a lot of missing data we can think of dropping the column from the dataframe. In this case we can either fill the column with a guessed value or we can drop the row.

In [6]:
df = df.dropna(axis = 0) # drop NaN rows
print(f'this dataset contains {len(df)} rows')
df.isnull().sum()

this dataset contains 297 rows


age                         0
anaemia                     0
creatinine_phosphokinase    0
diabetes                    0
ejection_fraction           0
high_blood_pressure         0
platelets                   0
serum_creatinine            0
serum_sodium                0
sex                         0
smoking                     0
time                        0
DEATH_EVENT                 0
dtype: int64

In [7]:
df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1.0,0.0,4,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1.0,1.0,7,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1.0,0.0,7,1
4,65.0,1,160,1,20,0,327000.0,2.7,116,0.0,0.0,8,1
5,90.0,1,47,0,40,1,204000.0,2.1,132,1.0,1.0,8,1


Furthermore we can see that all the binary data and boolean Yes/No data is displayed by either a zero or a one. It might be unclear what this means when plotting the data.

In [8]:
df['sex'].value_counts() 

1.0    193
0.0    104
Name: sex, dtype: int64

In the meta data we see the description "Woman" or "man", so we might want to change that.

In [9]:
df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
df['sex'] = df['sex'].astype('category') # make the format categorical
df['sex'] = df['sex'].map({0:"Woman", 1: "Man"}) # map the values to the category
df['sex'].value_counts() 

Man      194
Woman    104
Name: sex, dtype: int64

### Inspect the datatypes

We changed the sex column to category, but what datatypes are the other columns? 

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   age                       299 non-null    float64 
 1   anaemia                   299 non-null    int64   
 2   creatinine_phosphokinase  299 non-null    int64   
 3   diabetes                  299 non-null    int64   
 4   ejection_fraction         299 non-null    int64   
 5   high_blood_pressure       299 non-null    int64   
 6   platelets                 299 non-null    float64 
 7   serum_creatinine          299 non-null    float64 
 8   serum_sodium              299 non-null    int64   
 9   sex                       298 non-null    category
 10  smoking                   298 non-null    float64 
 11  time                      299 non-null    int64   
 12  DEATH_EVENT               299 non-null    int64   
dtypes: category(1), float64(4), int64(8)
memory usage:

We know that some of the integers should be booleans (logical). Let's change that

In [11]:
df["anaemia"] = df["anaemia"].astype('bool')
df["high_blood_pressure"] = df["high_blood_pressure"].astype('bool')
df["diabetes"] = df["diabetes"].astype('bool')
df["smoking"] = df["smoking"].astype('bool')
df["DEATH_EVENT"] = df["DEATH_EVENT"].astype('bool')
df.head()

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,time,DEATH_EVENT
0,75.0,False,582,False,20,True,265000.0,1.9,130,Man,False,4,True
1,55.0,False,7861,False,38,False,263358.03,1.1,136,Man,True,6,True
2,65.0,False,146,False,20,False,162000.0,1.3,129,Man,True,7,True
3,50.0,True,111,False,20,False,210000.0,1.9,137,Man,False,7,True
4,65.0,True,160,True,20,False,327000.0,2.7,116,Woman,False,8,True


In [12]:
df['anaemia'].value_counts() 

False    170
True     129
Name: anaemia, dtype: int64

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 13 columns):
 #   Column                    Non-Null Count  Dtype   
---  ------                    --------------  -----   
 0   age                       299 non-null    float64 
 1   anaemia                   299 non-null    bool    
 2   creatinine_phosphokinase  299 non-null    int64   
 3   diabetes                  299 non-null    bool    
 4   ejection_fraction         299 non-null    int64   
 5   high_blood_pressure       299 non-null    bool    
 6   platelets                 299 non-null    float64 
 7   serum_creatinine          299 non-null    float64 
 8   serum_sodium              299 non-null    int64   
 9   sex                       298 non-null    category
 10  smoking                   299 non-null    bool    
 11  time                      299 non-null    int64   
 12  DEATH_EVENT               299 non-null    bool    
dtypes: bool(5), category(1), float64(3), int64(4)
memo

## Step 2: Explore data

It is useful to understand the range of the data. A function that displays the descriptives of the numerical data is `describe`

In [14]:
df.describe()

Unnamed: 0,age,creatinine_phosphokinase,ejection_fraction,platelets,serum_creatinine,serum_sodium,time
count,299.0,299.0,299.0,299.0,299.0,299.0,299.0
mean,60.833893,581.839465,38.083612,263358.029264,1.39388,136.625418,130.26087
std,11.894809,970.287881,11.834841,97804.236869,1.03451,4.412477,77.614208
min,40.0,23.0,14.0,25100.0,0.5,113.0,4.0
25%,51.0,116.5,30.0,212500.0,0.9,134.0,73.0
50%,60.0,250.0,38.0,262000.0,1.1,137.0,115.0
75%,70.0,582.0,45.0,303500.0,1.4,140.0,203.0
max,95.0,7861.0,80.0,850000.0,9.4,148.0,285.0


What we can see is that the data ranges differ per feature. If we want to use the data for prediction we need to normalize the data later on. We can do that with numpy. From the describe table we can also see that most of the data is not symetric distributed. Let us inspect the distributions by plotting. 

## Step 3: Plotting the data

Plotting of the data helps to to answer questions like are attributes independent from eachother? Or can we assume random distributed data? In the examples below distributions are plotted of the normal distributed data. A bar plot is drawn to investigate number of deaths related to time and a heatmap is created for the attributes. 


In [15]:
#import bokeh and direct the output to the notebook
from bokeh.io import output_notebook

In [16]:
output_notebook()

### Plot distributions of numeric values

In [17]:
#plot numeric values distributions
df_num = df.select_dtypes(include=['float64', 'int64'])

In [18]:
from bokeh.layouts import gridplot
from bokeh.plotting import figure, output_file, show

#use a function to generalize the plotting creation
def make_plot(title, hist, edges):
    p = figure(title=title, tools='', background_fill_color="#fafafa")
    p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
           fill_color="navy", line_color="white", alpha=0.5)
    p.y_range.start = 0
    p.xaxis.axis_label = 'value'
    p.yaxis.axis_label = 'count'
    p.grid.grid_line_color="white"
    return p

# Distribution
g = []
for i in range(len(df_num.columns)):
    hist, edges = np.histogram(df_num[df_num.columns[i]], bins=40)
    p = make_plot(f" {df_num.columns[i]}", hist, edges)
    g.append(p)


#output_file('histogram.html', title="distribution plots")
show(gridplot(g, ncols=4, plot_width=250, plot_height=250, toolbar_location=None))

### Plot number of deaths related to time

First we re-organize the data by creating a new table with number of deaths per time unit

In [19]:
grouped = pd.DataFrame(df.groupby('time')['DEATH_EVENT'].sum())
print(grouped.head(10))

      DEATH_EVENT
time             
4             1.0
6             1.0
7             2.0
8             2.0
10            6.0
11            2.0
12            0.0
13            1.0
14            2.0
15            2.0


Then we use the new table to plot in the barplot

In [20]:
p = figure(title="death events in time", plot_width=950, plot_height=300, toolbar_location=None)
p.vbar(x='time', top='DEATH_EVENT', width=1, source=grouped, color='black')
p.xaxis.axis_label = 'number of days'
p.yaxis.axis_label = 'number of deaths'
show(p)


## Heatmap

To investigate if the attributes are independent from eachother we first remove the class variable. Then we create a correlation matrix. We reshape this into a ColumnDataSource object to be used for the heatmap plot.

In [21]:
df = df.drop(['DEATH_EVENT'],axis = 1)
c = df.corr().abs()
y_range = (list(reversed(c.columns)))
x_range = (list(c.index))
c

Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,smoking,time
age,1.0,0.088006,0.081584,0.101012,0.060098,0.093289,0.052354,0.159187,0.045966,0.015108,0.224068
anaemia,0.088006,1.0,0.190741,0.012729,0.031557,0.038182,0.043786,0.052174,0.041882,0.113222,0.141414
creatinine_phosphokinase,0.081584,0.190741,1.0,0.009639,0.04408,0.07059,0.024463,0.016408,0.05955,0.056099,0.009346
diabetes,0.101012,0.012729,0.009639,1.0,0.00485,0.012732,0.092193,0.046975,0.089551,0.15283,0.033726
ejection_fraction,0.060098,0.031557,0.04408,0.00485,1.0,0.024445,0.072177,0.011302,0.175902,0.067183,0.041729
high_blood_pressure,0.093289,0.038182,0.07059,0.012732,0.024445,1.0,0.049963,0.004935,0.037109,0.060816,0.196439
platelets,0.052354,0.043786,0.024463,0.092193,0.072177,0.049963,1.0,0.041198,0.062125,0.028158,0.010514
serum_creatinine,0.159187,0.052174,0.016408,0.046975,0.011302,0.004935,0.041198,1.0,0.189095,0.029373,0.149315
serum_sodium,0.045966,0.041882,0.05955,0.089551,0.175902,0.037109,0.062125,0.189095,1.0,0.003786,0.08764
smoking,0.015108,0.113222,0.056099,0.15283,0.067183,0.060816,0.028158,0.029373,0.003786,1.0,0.034234


In [22]:
#reshape
dfc = pd.DataFrame(c.stack(), columns=['r']).reset_index()
dfc.head()

Unnamed: 0,level_0,level_1,r
0,age,age,1.0
1,age,anaemia,0.088006
2,age,creatinine_phosphokinase,0.081584
3,age,diabetes,0.101012
4,age,ejection_fraction,0.060098


In [23]:
#transfer to ColumnDataSource object
from bokeh.models import ColumnDataSource
source = ColumnDataSource(dfc)

In [24]:
#plot a heatmap
from bokeh.models import (BasicTicker, ColorBar, ColumnDataSource,
                          LinearColorMapper, PrintfTickFormatter,)
from bokeh.transform import transform
from bokeh.palettes import Viridis256

#create colormapper 
mapper = LinearColorMapper(palette=Viridis256, low=dfc.r.min(), high=dfc.r.max())

#create plot
p = figure(title="correlation heatmap", plot_width=500, plot_height=450,
           x_range=x_range, y_range=y_range, x_axis_location="above", toolbar_location=None)

#use mapper to fill the rectangles in the plot
p.rect(x="level_0", y="level_1", width=1, height=1, source=source,
       line_color=None, fill_color=transform('r', mapper))

#create and add colorbar to the right
color_bar = ColorBar(color_mapper=mapper, location=(0, 0),
                     ticker=BasicTicker(desired_num_ticks=len(x_range)), 
                     formatter=PrintfTickFormatter(format="%.1f"))
p.add_layout(color_bar, 'right')

#draw axis
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.axis.major_label_text_font_size = "10px"
p.axis.major_label_standoff = 0
p.xaxis.major_label_orientation = 1.0

#show
show(p)


## Step 4: Clean data

Based on the inspecting and exploration of the data it is decided to drop the column time. The feature time will not be used for prediction. All the other variables will be used for further analysis. For computation convenience the int64 data is used instead of booleans and categories. Furthermore the data needs to be transformed and normalized. 

In [25]:
import numpy as np

df = pd.read_csv('data/heart_failure_clinical_records_dataset.csv')
df = df.dropna(axis = 0) # drop NaN rows
df = df.drop(['time'],axis = 1) # drop time column
df = df[df.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3).all(axis=1)] # remove outliers

print(f'this dataset contains {len(df)} rows and {len(df.columns)} columns')
df.head()

this dataset contains 280 rows and 12 columns


Unnamed: 0,age,anaemia,creatinine_phosphokinase,diabetes,ejection_fraction,high_blood_pressure,platelets,serum_creatinine,serum_sodium,sex,smoking,DEATH_EVENT
0,75.0,0,582,0,20,1,265000.0,1.9,130,1.0,0.0,1
2,65.0,0,146,0,20,0,162000.0,1.3,129,1.0,1.0,1
3,50.0,1,111,0,20,0,210000.0,1.9,137,1.0,0.0,1
5,90.0,1,47,0,40,1,204000.0,2.1,132,1.0,1.0,1
6,75.0,1,246,0,15,0,127000.0,1.2,137,1.0,0.0,1


## Step 5: Split into features matrix and class vector. Normalize features

In [26]:
y = np.array(df['DEATH_EVENT'])
X = np.array(df.iloc[:,0:11])
print(y.shape)
print(X.shape)
y = y.reshape(-1, 1)
print(y.shape)

(280,)
(280, 11)
(280, 1)


In [27]:
# normaliseer data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler = scaler.fit(X)
X = scaler.transform(X)

We now have a cleaned normalized feature matrix and a class variable vector. We succesfully prepared the dataset for machine learning algorithms in order to predict the heart failure death event.  