# Problem description

This is an analysis for the DrivenData competition on predicting Heart Disease: https://www.drivendata.org/competitions/54/machine-learning-with-a-heart/page/107/

The goal is to predict the binary class ```heart_disease_present```, which represents whether or not a patient has heart disease:

- ```0``` represents no heart disease present
- ```1``` represents heart disease present

There are 14 columns in the dataset, where the ```patient_id``` column is a unique and random identifier. The remaining 13 features are described in the section below.
- ```slope_of_peak_exercise_st_segment``` (type: int): the slope of the peak exercise ST segment, an electrocardiography read out indicating quality of blood flow to the heart
- ```thal``` (type: categorical): results of thallium stress test measuring blood flow to the heart, with possible values ```normal```, ```fixed_defect```, ```reversible_defect```
- ```resting_blood_pressure``` (type: int): resting blood pressure
- ```chest_pain_type``` (type: int): chest pain type (4 values)
- ```num_major_vessels``` (type: int): number of major vessels (0-3) colored by flourosopy
- ```fasting_blood_sugar_gt_120_mg_per_dl``` (type: binary): fasting blood sugar > 120 mg/dl
- ```resting_ekg_results``` (type: int): resting electrocardiographic results (values 0,1,2)
- ```serum_cholesterol_mg_per_dl``` (type: int): serum cholestoral in mg/dl
- ```oldpeak_eq_st_depression``` (type: float): oldpeak = ST depression induced by exercise relative to - rest, a measure of abnormality in electrocardiograms
- ```sex``` (type: binary): ```0```: female, ```1```: male
- ```age``` (type: int): age in years
- ```max_heart_rate_achieved``` (type: int): maximum heart rate achieved (beats per minute)
- ```exercise_induced_angina``` (type: binary): exercise-induced chest pain (```0```: False, ```1```: True)

Performance is evaluated according to binary log loss.

The format for the submission file is two columns with the ```patient_id``` and ```heart_disease_present```. This competition uses log loss as its evaluation metric, so the ```heart_disease_present``` values you should submit are the probabilities that a patient has heart disease (not the binary label).

# Preparation of Environment

## Get the required libraries

In [None]:
# Load required packages


## Get the data

In [None]:
#import define and create the data folder


In [None]:
#download the data


In [None]:
#import the data into the notebook and defined first column as index (patient id)


In [None]:
#create one training dataframe


# Data Preparation

In [None]:
#check how the first rows of the dataset look like


In [None]:
#check the types of the columns


In [None]:
#create lists of categorical, numerical and label columns


In [None]:
#convert categorical columns to categories


In [None]:
#check if it worked


In [None]:
#check the content of the colums again


In [None]:
#check for missing values


In [None]:
#check for duplicate rows


## Vizualization

### Numerical Features

In [None]:
#investigate descriptive statistics for numeric features


In [None]:
#investigate distribution propoerties kurtosis and skewness of numeric features


In [None]:
#Visualize numeric feature distribution with displots 


In [None]:
#test for normality


In [None]:
#visualize correlations between numerical features in a heat map


In [None]:
#Vizualize label separation by numeric features with a box plot


In [None]:
#test the relationships between the numerical features and the label


### Categorical features

In [None]:
#investigate descriptive statistics for categorical features


In [None]:
#visualize the number of cases per category with bar charts


In [None]:
#Vizualize label separation by categorical features with a bar charts


In [None]:
#test the relationships between the categorical features and the label


In [None]:
#visualize class distribution of the label


### Observations

What did you observe?

## Transformation and Feature Engineering

### Aggregating categories

Which categories could you aggregate?

### Transforming numeric variables

Which numeric feature is not normally distributed?

In [None]:
#transform 'oldpeak_eq_st_depression' with a square root and compare the distributions


### Compute new features

In [None]:
#create a new categorical feature called 'rng_depression' from 'oldpeak_eq_st_depression'


In [None]:
#visualize counts of rng_depression categories with a bar chart


### Observations

What did you observe?

# Local modelling

## Preparing data for scikit

In [None]:
#define categorical and numerical features used for modelling


In [None]:
#One hot encode categorical features and concatenate them all into a new df


In [None]:
#add selected numerical features to new df


In [None]:
#standardize numerical features


In [None]:
#save the prepared dataset in a file


In [None]:
#create a numpy array of train features


In [None]:
#create a numpy array of the label


In [None]:
#split the new dataset into a training and test dataset and check their shapes


## Create and test a model using logistic regression

In [None]:
#train the model


In [None]:
#create predictions with the model


In [None]:
#transform probabilities into class scores


In [None]:
#evaluate the predictions with a confusion matrix, accuracy, precision, recall and F1


In [None]:
#calculate the log loss evaluation metric


In [None]:
#evaluate the predictions with the ROC curve and AUC


## Model selection

# Modelling in AMLS

# Modelling with AutoML

# Submission

In [None]:
#define a function to do all the preprocessing with the test data


In [None]:
#check if the preprocessing worked


In [None]:
#create predictions with the model (Do not convert to binary labels, submissions must be made with probabilities)


In [None]:
#add the predictions to the submission file and save it as csv
