# Actors

- Data Scientist: *The hospital employs him to stay in touch with current developments*
- Clinician: *She is a general practitioner, invited to join the discussions about the new setup of the Cardiovasular Disease Department*
- Cardiology expert: *An expert for cardiovascular disease, diagnoses patients day-in day-out*

# Data processing to prepare for Machine Learning

There are two types of values, categorical values and continuous values.

On the one hand, categorical data is the data that can be stored into groups and don't have mathematical meaning. For example, in the feature "gender" we have 1(male) and 2(female) but they represent groups.

On the other hand, continuous data represent a measurement with a mathematical meaning such as "height" and "ap_hi". That is, unlike the categorical ones, the higher the number, the more important it is.

In [None]:
#@markdown Identify categorical and continuous features
categorical_val = []
continous_val = []

for column in cardio_modified.columns:
    # Considering a column with less than 10 unique values as categorical feature
    if len(cardio_modified[column].unique()) <= 10:
        categorical_val.append(column)
    else:
        continous_val.append(column)

## Categorical features

We have to consider making groups in some features to gain effectiveness in the prediction. For example, *age* is one continuous feature that we can group into 3:

    - Young age 0 : <40
    - Middle age 1 : 40 - 55
    - Elderly age 2 : > 55

Grouping all ages can also help to anonymize data. Imagine in your dataset are only very few, very old or very young patients. Only by knowing roughly where the data came from, they might be identified. 

In [None]:
#@markdown ### Perform age grouping
cardio_grouped = cardio_modified.copy()

def groupAge(age):
    
    if age < 40:
        return 0
    elif 55 > age >= 40:
        return 1
    else:
        return 2
cardio_grouped['age'] = cardio_grouped['age'].apply(lambda x: groupAge(x))

After grouping all ages into three groups, let's assess the new age distribution with another histogram.

In [None]:
#@markdown ### Plot age histrogram
cardio_grouped.hist('age', figsize=(10,10))

In addition, we have to prepare our categorical features by separating the categories into columns in order to give the same importance to each group.

In [None]:
#@markdown ## Split categories into different columns
categorical_val.remove('cardio')
cardio_categorize = pd.get_dummies(cardio_grouped, columns = categorical_val)

display(cardio_categorize)

## Continuous features

On the other hand, normalizing continuous data is very important because our features have different range of values and this could skew the values and falsify the results. We have a range of the feature "weight" between 51-109 kg and a range of the feature "ap_hi" between 100-170 mmHg. If we don't normalize the data, the model will give more importance to mmHg because the range is higher. To avoid false results, let's normalize the data between 0 and 1.

In [None]:
#@markdown ## Min Max scaler

# Normalize continuous features weight, ap_hi and ap_lo 
cardio_scaled = cardio_categorize.copy()
scaler = MinMaxScaler()
columns_to_scale = ['weight', 'ap_hi', 'ap_lo']
cardio_scaled[columns_to_scale] = scaler.fit_transform(cardio_scaled[columns_to_scale])
display(cardio_scaled)
#cardio_scaled.to_csv (r'cardio_scaled.csv', index = False, header=True)

# Data dimensionality

The main problems of working with large amounts of data is the dimensionality of this. When we have many features it can lead to poor performance and meaningless results because of the called "curse of dimensionality". 

Each column in the dataset represents a dimension and having a large number of dimensions in the feature space can mean a problem to compute algorithms and visualize results. Hence, it is desirable to reduce the number of dimensions. 

One of the techniques to solve that problem is Principal Component Analysis (PCA). It is an unsupervised learning algorithm used to reduce the dimensionality. 

<figure>
<br/>
<center>
<img src='https://drive.google.com/uc?id=1wfWKVWfXNqFt5Of2FbjHnraeplIhHoFy' width=500 height=250/>
<br/>
</figure>

It projects the dataset onto a lower dimensional hyperplane keeping as much information as possible.


---
To play with an interactive tool visit http://projector.tensorflow.org/






## **Exercises**
1. Why is it important to categorize the data?

2. Why is it important to normalize the data?

**==========================WRITE YOUR ANSWERS HERE==========================**