# Actors

- Data Scientist: *The hospital employs him to stay in touch with current developments*
- Clinician: *She is a general practitioner, invited to join the discussions about the new setup of the Cardiovasular Disease Department*
- Cardiology expert: *An expert for cardiovascular disease, diagnoses patients day-in day-out*

# Exploring and understanding data

After analyzing the different commercial solutions taking into account the needs of the patients and the problems in the department to be addressed, it was decided to implement the Ultromics solutions.

**Cardiology expert:** Perfect, so after discussing potential tools we have agreed that potentially AI can be piloted within the unit to address some of the issues.

In adition to the Ultromics system, we want to look at a prevention system for General Practitioners (GPs) to identify early onset of cardiovascular disease. This is a preventive measure, but it can also help us with  referral decisions and thereby improve the triaging of patients. I am wondering if we can really trust this system.

**Clinician:** Certainly, if we could predict cardiovascular diseases with some computer system, this might help us to prevent cardiovascular diseases in some patients, and it might reduce the number of patients who require intense treatments or supervision in the long term.

**Data scientist:** What we can do is examine parts of the computer software and test them one by one. After that, we will see the efficiency of the system.

> Can you think about parts that make up such a system? Before you read on, try to recap the intro video and what you have read and heard about machine learning yourself.

> Take some notes below, so you can later compare.

---

**Here's some space for own remarks/ideas**

---

## AI solutions: A general structure

**Data scientist:** Let's sum up what our system will need to consist of. 
1.   It is supposed to read in the clinical information of a patient
2.   As a result, it should give us one or more numbers that we can interpret either as a risk, or as an indicator which test to take, or even as an indicator that a treatment is necessary.

So we are looking for a computer calculation that receives input and produces output. This is what we call an algorithm, and the algorithm for us will be one that does not directly produce the output, but that produces a model -- that's what machine learning is about. Later, we can use the model on the input data. 

---
**Data scientist:** How does the machine learning algorithm know what the model should look like?

---
**Data scientist:** It will learn this from examples. Right know, I will not explain to you how exactly machine learning generates a model from example data -- just wait one week, and you'll learn everything about it. But it is important to know that we need those examples.

So. To build a model capable of helping in cardiovascular diseases prevention, first, we need to collect and store data of previous patients that the system can learn from. The system should learn to predict if someone is likely to develop a disease, so we need to know this already! This will be our dataset: many patients, with as much information we can get about them from the clinical systems, and most importantly: a label attached to each patient if he or she diseased, or if not.

After that, in order to use the data efficiently, we need to analyze and prepare the data. I will show you how I do this in a minute. 

Then, it is time to run the machine learning to create the model (we say: we train the model). With the model, we need to assess the performance of the model, that is, how reliably it will predict the patient risk to disease. 

<figure><center>
<img src='https://drive.google.com/uc?id=1MknQ3lC-rI6DsDg_pMc0le5fOfHRDbB5' width=20% />
</figure>


**Data scientist:** Let's do an example with the hospital data:

## **What are the clinical data?**

First of all, we are going to see what data we have and what type they are.

In recent years, about 70.000 patients have been seen in our department, of which we already know if they got sick with a heart disease. We have always been careful to record all available information digitally, so we can now collect a long table with 70.000 rows, and in each row 12 columns with the collected information.

We call these information items "features". Some say "parameters", or "traits".

So, we have 12 features in total, where the last one is the prediction we would like to make, that is, whether there is a cardiovascular disease or not.

There are 3 types of input features:

- Objective: factual information;
- Examination: results of medical examination;
- Subjective: information given by the patient.

Here's a table of the features, and their types and also how we will abbreviate them for the computer. It's not strictly necessary to abbreviate names for computers, but as a programmer, you end up typing them in over and over again, and at some point, you just want to be faster. So abbreviations are introduced. Be careful to balance shortness with readability, though.


Features:

| Name | Feature Type | Abbr. | Datatype (Unit) |
|:--- |:--- |:--- |:--- |
| Age | Objective Feature | age | int (years) |
| Height | Objective Feature | height | int (cm) |
| Weight | Objective Feature | weight | float (kg) |
| Gender | Objective Feature | gender | categorical code ( 1:male, 2:female )|
| Systolic blood pressure | Examination Feature | ap_hi | int |
| Diastolic blood pressure | Examination Feature | ap_lo | int |
| Cholesterol | Examination Feature | cholesterol | ordinal (1: normal, 2: above normal, 3: well above normal) |
| Glucose | Examination Feature | gluc | ordinal (1: normal, 2: above normal, 3: well above normal) |
| Smoking | Subjective Feature | smoke | binary (0: doesn't smoke, 1: smokes) |
| Alcohol intake | Subjective Feature | alco | binary (0: doesn't drink, 1: drinks) |
| Physical activity | Subjective Feature | active | binary (0: no, 1: yes) |
| Presence or absence of cardiovascular disease | Target Variable | cardio | binary (0: no, 1: yes) |


> Let's go, and try some real data science! Don't worry, you will not need to write a single line of code if you don't want to.
> 
> We have prepared this for you. From here on, you will find some Notebook parts with a "PLAY" arrow in the top left, for example right below here.
>
> You can press this arrow, and the Notebook part (the cell) will be executed. You can see that it was successful if something happens. What that something is depends on what we have programmed there for you. Some cells are "housekeeping". They have to be run once, but nothing interesting will happen. Most of them, though, will result in a graphic, a table, or something else for you to look at.
>
> Don't worry, you can break nothing. And we will always tell you what you can expect.

In [None]:
#@markdown **Imports**
#@markdown > When you "run" this cell, it will produce some text below telling you the progress. Don't mind this, except it is red and says "error" -- this should not happen!
#@markdown > 
#@markdown > Go ahead, and press the PLAY arrow to the left. 
#@markdown > 
#@markdown > When the execution is finished, a bracketed number (e.g. `[1]`) will appear where the PLAY arrow was. 
#@markdown > 
#@markdown > Should you hit an error, let your tutor(s) know.
!pip install scikit-plot

# Data
import pandas as pd
import numpy as np
from sklearn.preprocessing import MaxAbsScaler, MinMaxScaler, StandardScaler

# Plot
import matplotlib.pyplot as plt
import seaborn as sns
import scikitplot as skplt
import cufflinks as cf
from plotly.offline import iplot, init_notebook_mode
cf.go_offline(connected=True)
init_notebook_mode(connected=True)
from matplotlib import rcParams
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from google.colab import widgets as wd
import warnings
warnings.filterwarnings('ignore')

In [None]:
#@markdown **Read the data and play with it.**
#@markdown > This cell reads in the data we need for the rest of the unit. 
#@markdown > 
#@markdown > Go ahead, and press the PLAY arrow to the left.
#@markdown > 
#@markdown > A little bit of text should appear, and afterwards a small window below this cell where you can examine the data.
cardio = pd.read_csv('https://drive.google.com/uc?id=1BtmONEo5xVKkeHCUmYry-C4Ag_YI7bgv', sep=';')
cardio['age'] = (cardio['age'] / 365).round().astype('int')

def show_data(cardio, samples, features):
    #cardio = pd.read_csv('https://drive.google.com/uc?id=1BtmONEo5xVKkeHCUmYry-C4Ag_YI7bgv', sep=';')
    #cardio['age'] = (cardio['age'] / 365).round().astype('int')
    cardio = cardio.iloc[samples:samples+5, 0:features]
    return cardio

samples = widgets.IntSlider(min=0, max=cardio.shape[0]-6, value = 1)
features = widgets.IntSlider(min=0, max=cardio.shape[1], value = 1)
interact(show_data, cardio=fixed(cardio), samples=samples, features=features)

## What are the basic statistics to describe clinical data?

Before applying ML, let's take a look at the tendency of the data to understand it. The first tool is the **mean**.

### Mean


The **mean** is the most basic and important summary statistic. 

$$ \mu = \frac{1}{n} \sum_i x_i$$

It is used to calculate the average of a specific set of numbers describing the central tendency.


In [None]:
# Mean
@interact
def mean(feature=list(cardio.select_dtypes('number').columns)):
    print("Mean {}: ".format(feature) + str(cardio[feature].mean()))

Paying attention to the mean, we can get some conclusions:

- The average age of patients is 53,4 years
- The average height of patients is 164,5 cm
- The average weight of patients is 73,6 kg
- The average ap_hi of patients is 126,17 mmHg
- The average ap_lo of patients is 81,16 mmHg


### Median


**Median** is used to get the middle value of a sorted set of numbers. Median is less sensitive to outliers! And it assures always to return a *real* value from the samples.

<figure>
<br/>
<center>
<img src='https://drive.google.com/uc?id=1Gb_CHlDxF9uBCmTwipmLJhcjpv15SPpi'/>
<br/>
</figure>

In [None]:
# Median
@interact
def median(feature=list(cardio.select_dtypes('number').columns)):
    print("Median {}: ".format(feature) + str(cardio[feature].median()))

We can also get some conclusions from the median:

- The 50 % of patients are older or equal to 54 years
- The 50 % of patients are shorter or equal to 165 cm
- The 50 % of patients are thinner than or equal to 72 kg
- The 50 % of patients have 120 mmHg or more 
- The 50 % of patients have 80 mmHg or less





### Mode

There is another descriptive summary value that we  can consider, which is called the **mode**. The mode indicates the value which appears most frequently. It does not say how often, but we'll come to that. 

It is only a sensible descriptive value for features that have discrete values in a range that is comparatively narrow compared to the number of examples in your data. If you were to measure age in milliseconds for our 70.000 patients, for example, you can be very confident that this value will be different for each of them -- so the mode will not exist.

In [None]:
# Mode
@interact
def mode(feature=list(cardio.select_dtypes('number').columns)):
    print("Mode {}: ".format(feature) + str(max(cardio[feature].mode())))

Analyzing the mode:
- The most frequent age is 56 years
- The most frequent gender is male
- The most frequent height is 165 cm
- The most frequent weight is 65 kg
- The most frequent ap_hi is 120 mmHg
- The most frequent ap_lo is 80 mmHg
- The most frequent cholesterol is normal.
- The most frequent gluc is normal.
- It is more frequent not to smoke.
- It is more frequent not to drink alcohol.
- It is more frequent to be active.



### Standard deviation

Furthermore, we can consider the variation of data. Specially, the **variance**  describes the spread of data.

$$ \sigma^2 = \frac{1}{n} \sum_i (x_i - \mu)^2 $$

The square root of the variance, $\sigma$, is called  **standard deviation**. We define the standard deviation because the variance is hard to interpret.

In [None]:
# Standard deviation
@interact
def std(feature=list(cardio.select_dtypes('number').columns)):
    print(" ")
    print("Mean {}: ".format(feature) + str(cardio[feature].mean()))
    print("Standard deviation {}: ".format(feature) + str(cardio[feature].std()))


Drawing conlusions from standard deviation:

- Age tends to vary 6,7 years above or below the average
- Height tends to vary 6,9 cm above or below the average
- Weight tends to vary 11,9 kg above or below the average
- Ap_hi tends to vary 14,3 mmHg above or below the average
- Ap_lo tends to vary 8,3 mmHg above or below the average


## Exercises
1. Why do you think these statistics are important?

2. Which one provides more useful information?




**==========================WRITE YOUR ANSWERS HERE==========================**