# My First Machine Learning Project
## International Master of Bioinformatics, University of Bologna

### Immanuela Antigone Engländer



## 1. The Cleaveland Heart Disease (UCI Repository) Dataset — Classification With Various Models

Heart disease is also sometimes referred to as cardiovascular disease (CVD) and is the most common cause of death in EU countries; accounting for 37% of all deaths in 2017. Second most common cause of death is cancer responsible for 26% of all deaths in EU countries in 2017.

![Main causes of mortatilty in EU countries 2017](https://www.oecd-ilibrary.org/sites/82129230-en/images/images/03-chapter3/media/image8.png)

For more information please refer to the OECD report:

<a href='https://read.oecd.org/10.1787/82129230-en?format=html' title='Health at a Glance: Europe 2020 State of Health in the EU Cycle'><img src='https://assets.oecdcode.org/covers/340/82129230.jpg' alt='Health at a Glance: Europe 2020 State of Health in the EU Cycle'/></a>

The scientific community estimates that up to 90% of CVD is preventable by diminishing risk factors by:
* Healthy eating
* Exercise
* Avoiding tobacco
* Limiting alcohol intake 

Though it is well known in the scientific and medical community that symptoms of CVD vary significantly between women and men, the general population seems to require more information on the topic as mean are still diagnosed 7 to 10 years earlier than women [[1](https://web.archive.org/web/20140817123106/http://whqlibdoc.who.int/publications/2011/9789241564373_eng.pdf?ua=1)].
Developing a predictor could help lay people to assess their risk and recognize the illness at an early and treatable stage.

For more information please refer to https://en.wikipedia.org/wiki/Cardiovascular_disease


## 2. The Dataset

The dataset was created by the following entities:

* Hungarian Institute of Cardiology. Budapest: Andras Janosi, M.D.
* University Hospital, Zurich, Switzerland: William Steinbrunn, M.D.
* University Hospital, Basel, Switzerland: Matthias Pfisterer, M.D.
* V.A. Medical Center, Long Beach and Cleveland Clinic Foundation: Robert Detrano, M.D., Ph.D.

## 3. Attributes of the Dataset

1. age
1. sex
1. chest pain type (4 values)
1. resting blood pressure
1. serum cholestoral in mg/dl
1. fasting blood sugar > 120 mg/dl
1. resting electrocardiographic results (values 0,1,2)
1. maximum heart rate achieved
1. exercise induced angina
1. oldpeak = ST depression induced by exercise relative to rest
1. the slope of the peak exercise ST segment
1. number of major vessels (0-3) colored by flourosopy
1. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

The label can have 5 different values; 1, 2, 3, 4 indicating that the disease is present and 0 indicating the absence of the illness.

## 4. Importing the Required Libraries

In [None]:
!pip install -q colabcode
from colabcode import ColabCode

In [None]:
# Loading module to monitor time:
%load_ext autotime

In [None]:
import numpy as p
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scikit-learn as sklearn

# TensorFlow and tf.keras
import tensorflow as tf
from tensorflow import keras
sns.set() # loading seaborns default theme

In [8]:
# Libraries imported
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import keras
from pandas import read_csv, set_option
from matplotlib import pyplot
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from keras.utils.np_utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.utils.vis_utils import plot_model

## 5. Loading the Data

The data was acquired from the UCI Machine Learning Repository at the following [link](https://archive.ics.uci.edu/ml/datasets/Heart+Disease).

Here I'm loading the data from my GitHub repository. Doing it this way makes it reproducible for other students without the need for downloading the data set.

In [13]:
url = 'https://raw.githubusercontent.com/ilante/AML_91934_exam/main/heart.csv'
data = pd.read_csv(url, header=0)
data 

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [14]:
data.shape

(303, 14)

As we can see that the dataset is made up of 302 observations and 14 colums. The 'target' column corresponds to either 
* 0 no heart disease present or 
* 1 indicating the presence of the disease. It is the attribute to be predicted. 

## 5.1 Exploring the Data

Getting familiar with the dataset is helpful for finding out how the data is distributed and to catch relations between the given variables.

In [15]:
data.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

All data types except old peak are integers; oldpeak refers to the systolic depression induced by exercice relative to rest.

In [21]:
pd.set_option('display.max_columns', None) # to avoid truncation
pd.set_option('display.max_rows', None)
pd.set_option('precision', 10) # decimals after the comma

In [22]:
data.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.3663366337,0.6831683168,0.9669966997,131.6237623762,246.2640264026,0.1485148515,0.5280528053,149.6468646865,0.3267326733,1.0396039604,1.399339934,0.7293729373,2.3135313531,0.5445544554
std,9.0821009898,0.4660108233,1.0320524895,17.5381428135,51.8307509879,0.3561978749,0.5258595964,22.9051611149,0.4697944645,1.1610750221,0.6162261453,1.022606365,0.6122765073,0.4988347842
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


`data.describe()` is a useful function for exploring the dataset; it shows some important statistics such as 
* Count: shows us the number of samples (303)
* Mean
* Standard deviation
* Min - the minimum value of each feature
* The percentiles 25%, 50% and 75%

of the given dataframe.

In [25]:
data.columns

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'], dtype='object')

### 5.1.2 Describing the Column Names in More Detail:

1. **Age**: displays the age of the individual in years
2. **Sex**: displays the gender of the individual using the following format :
    1 = male
    0 = female
3. **cp**: chest-pain type experienced by the individual using the following format :
  * 1 = typical angina
  * 2 = atypical angina
  * 3 = non — anginal pain
  * 4 = asymptotic
4. **trestbps**: resting blood pressure value of an individual in mmHg (unit)
5. **chol**: serum cholesterol in mg/dl (unit)
6. **fbs**: fasting blood sugar compares the fasting blood sugar value of an individual with 120mg/dl.
  * If fasting blood sugar > 120mg/dl then : 1 (true)
  * else : 0 (false)
7. **restecg**: resting ECG 
  * 0 = normal
  * 1 = having ST-T wave abnormality
  * 2 = left ventricular hyperthrophy
8. **thalach**: maximum heart rate achieved by an individual
9. **exang**: Exercise induced angina :
  * 1 = yes
  * 0 = no
10. **oldpeak** ST depression induced by exercise relative to rest. This is based on a treadmill ECG stress test. The reading would be considered abnormal if there is a **horizontal** or **negative** slope it can be an integer or a float
Peak exercise ST segment :
1 = upsloping
2 = flat
3 = downsloping
Number of major vessels (0–3) colored by flourosopy : displays the value as integer or float.
Thal : displays the thalassemia :
3 = normal
6 = fixed defect
7 = reversible defect
Diagnosis of heart disease : Displays whether the individual is suffering from heart disease or not :
0 = absence
1, 2, 3, 4 = present.