# Actors

- Data Scientist: *The hospital employs him to stay in touch with current developments*
- Clinician: *She is a general practitioner, invited to join the discussions about the new setup of the Cardiovasular Disease Department*
- Cardiology expert: *An expert for cardiovascular disease, diagnoses patients day-in day-out*

# Visually understanding data

## **Outliers**


When we get raw data, we should assess if there are outliers that may disturb our modeling. We haven't talked about "modeling" yet, so just to know for now: some algorithms for machine learning don't work well with outliers in data. 

Outliers, in a general sense, are examples that are very infrequent. Let's be concrete. You don't see typical cardiovascular diseases in children (excluding congenital heart failure here). This means, you will not expect or need your model to be working very well in children. If this is so, then if you happen to have a few examples of children in your dataset, you can safely exclude them. 

The same might go for extremely tall patients. But suppose one such group -- say, the tall ones -- are of interest, but you still have only very little numbers of examples, then they are outliers you don't wish to exclude. This means: be careful, and don't simply exclude "rare" cases automatically. You don't want to use the caterpillar in an archeological excavation site.

So let's see the range of body heights and body weights we have. 

For this purpose, so-called Box Plots are a common analysis tool. You will probably know this from your practice.

In [None]:
# Visualize box plots for the "outlier columns" height/weight -- the outliers should be visible!
df = pd.melt(cardio[['height', 'weight']])
sns.set(style="whitegrid")
plt.figure(figsize=(15,5)) # play with the values (15, 5) to change the plot size. First value is width, second is height.
sns.boxplot(x='variable', y='value', data=df, whis=[2, 98]) # whis are the "whiskers" at 2nd and 98th percentile. They create the black horizontal lines above/below the boxes.

There is another plot type conveying deeper information on where most examples are, called the violin plot. In the middle of each "violin", you can see an indication of the mean and standard deviation (the box size in the above plot).

In [None]:
plt.figure(figsize=(15,5))
sns.violinplot(x='variable', y='value', data=df)

We see that the range is very large. The natural way to reduce the range would probably be to set numeric limits (e.g. on the body height: only consider subjects above 130cm and below 200 cm).

An alternative is to specify percentiles. This means we only look at the data of those above the lower percentile, and below the upper. The (lower) percentile (e.g., the 2% percentile) is defined by the cutoff (threshold) value selected such that 2% of the subjects are below, and the rest above the cutoff value. This means, percentiles are always relative to your population distribution. 

It depends on your problem if an absolute or percentile threshold is more appropriate.

In [None]:
#@title Delete outliers
cardio.drop(cardio[(cardio['height'] > cardio['height'].quantile(0.98)) | (cardio['height'] < cardio['height'].quantile(0.02))].index,inplace=True)
cardio.drop(cardio[(cardio['weight'] > cardio['weight'].quantile(0.98)) | (cardio['weight'] < cardio['weight'].quantile(0.02))].index,inplace=True)

cardio.drop(cardio[(cardio['ap_hi'] > cardio['ap_hi'].quantile(0.98)) | (cardio['ap_hi'] < cardio['ap_hi'].quantile(0.02))].index,inplace=True)
cardio.drop(cardio[(cardio['ap_lo'] > cardio['ap_lo'].quantile(0.98)) | (cardio['ap_lo'] < cardio['ap_lo'].quantile(0.02))].index,inplace=True)

## Data distribution

Moreover, we need to visualize and analyze the distribution to understand the problem and choose the best ML method.

### Histogram

A histogram is a graphical representation which shows the frequency of each value and is used in statistics in order to expose the data distribution graphically for an orderly and clearer visualization.

- Age: Most patients are between 50-65 years.
- Gender: Most patients are male.
- Height: Most patients are between 165-171 cm.
- Weight: Most patients are between 60-80 kg.
- Blood pressure(Ap_hi, Ap_lo): Most patients have 120-125 ap_hi and 80-84 ap_lo
- Cholesterol, Gluc: Most patients have normal cholesterol and glucose.
- Smoke: Most patients don't smoke
- Alcohol intake: Most patients don't drink alcohol
- Activate: Most patients are physically active
- Cardio: The number of patients with cardiovascular disease is almost the same as those without cardiovascular disease.

In [None]:
#@title Histogram visualization
@interact
def histogram(feature=list(cardio.select_dtypes('number').columns)):
    cardio.hist(feature, bins=len(cardio[feature].unique()), figsize=(15,15))

There are some features that give important information regarding the final decision(plot below):

In [None]:
#@title Plot distribution data regarding to the output feature
print(cardio['ap_hi'].max())
print(cardio['ap_hi'].min())
@interact
def distribution(feature = cardio.columns):
    plt.figure(figsize=(15, 15)) 
    cardio[cardio["cardio"] == 0][feature].hist(bins=len(cardio[feature].unique()), color='green', label='Have NO cardiovascular disease', alpha=0.6)
    cardio[cardio["cardio"] == 1][feature].hist(bins=len(cardio[feature].unique()), color='red', label='Have cardiovascular disease', alpha=0.6)
    plt.legend()
    plt.xlabel(feature)

## **Exercises**
1. What conclusions could we draw from the features age, weight, blood pressure(Ap_hi, Ap_lo), cholesterol and glucose regarding the final decision?
 

**==========================WRITE YOUR ANSWERS HERE==========================**