# Exercise 1: Access and Preprocessing

## What you will learn:
* Access and clean data
* One-Hot-Encoding
* Scaling
* Split labeled data in training- and validation-partition
* Train and evaluate a *LogisticRegression*-classifier.

## Data description

The `Data`-folder of this repository contains the 
dataset `HeartDiseaseCleveland.csv`. Here is a description of this dataset:


**Features:**

1. age: age in years
2. sex: sex (1 = male; 0 = female)
3. cp: chest pain type 
    - Value 1: typical angina 
    - Value 2: atypical angina 
    - Value 3: non-anginal pain 
    - Value 4: asymptomatic 
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital)
6. chol: serum cholestoral in mg/dl
7. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
8. restecg: resting electrocardiographic results 
    - Value 0: normal 
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) 
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
9. thalach: maximum heart rate achieved
10. exang: exercise induced angina (1 = yes; 0 = no)
11. oldpeak = ST depression induced by exercise relative to rest
12. slope: the slope of the peak exercise ST segment
    - Value 1: upsloping 
    - Value 2: flat 
    - Value 3: downsloping
13. ca: number of major vessels (0-3) colored by flourosopy
14. thal: heartrate
    - Value 3: normal 
    - Value 6: fixed defect
    - Value 7: reversable defect
    
    
**Feature types**
    
- Real-valued attributes: 1,4,5,8,10,12
- Binary attributes: 2,6,9
- Ordered attribute: 11
- Nominal attributes: 3,7,13

**Target (Class label):** 

- 0: no disease
- 1,2,3,4 degree of disease


In [1]:
import pandas as pd
from IPython.display import display
from IPython.display import Image
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler, OneHotEncoder,MinMaxScaler,normalize

## Tasks
**Task 1:** Load this `.csv`-file into a pandas dataframe.

**Task 2:** Check if there is missing data in this file. If so, display all rows with missing values. 

**Task 3:** Then replace the missing values by the median of the corresponding column.

**Task 4:** The nominal features are in columns 2 (cp),6 (fbs) and 12 (thal). Apply the pandas-dataframe method `get_dummies()` in order to calculate a dataframe-representation with one-hot-encoded nominal features.

**Task 5:** All columns up to the last are the features, the last column is the class label. Split the dataset in a numpy array `X`, which only contains the features and a numpy array `y_raw`, which contains only the class labels 

**Task 6:** In this experiment a binary classifier shall be implemented, which differentiates the classes disease and no disease. For this, all non-zero class-labels in `y_raw`shall be mapped to 1. The new binary class-label-array shall be named `y`:

**Task 7:** Some machine learning algorithms perform bad, if the value ranges of the features differ significantly. For example in the preprocessed dataframe of the previous code-cell the value-range of many columns is $[0,1]$, but some features, such as `thalach` and `trestbps`, have much higher values. In particular clustering-algorithms or all algorithms, which apply a gradient-descent-based learning approach, require features with similar value ranges. Transform all features-columns, such that their value range is $[0,1]$.

**Task 8:** Split the set of labeled data into disjoint training- and validation-partitions and train a `LogisticRegression`-classifier with the training partition. After training, display the parameters of the learned model.

In [2]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression

**Task 10:** Calculate the learned model's prediction on the validation data. Determine the accuracy and the confusion matrix on the validation data.

In [3]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [5]:
import sys
sys.path.append("..") # add parent-directory to PYTHONPATH
import utilsJM

**Task 11:** Discuss the confusion matrix. Are you satisfied with this performance? How can it be improved?