# Machine Learning for Smart Health: Project 2
### Instructor: Juber Rahman
### Student : Vikram Kamthe

#### 1. Select a heart disease database (e.g. MIT-BIH Arrhythmia Database) from Physionet
####    and read the dataset descriptions to identify the beat types (N, V, S, F, Q)

In [2]:
### The MIT-BIH Arrhythmia Database contains 
# 48 half-hour excerpts of two-channel ambulatory ECG recordings
# The recordings were digitized at 360 samples per second per channel with 11-bit resolution over a 10 mV range.
# computer-readable reference annotations for each beat (approximately 110,000 annotations in all) included with the database.

In [3]:
# More information about the subjects can be found here and can become important for modelling - 
#https://archive.physionet.org/physiobank/database/html/mitdbdir/records.htm

In [4]:
# The beat types 0 to 4 correspond to 4 different beat types with specific Names.
# 0 - Normal
# 1 - A
# 2 - B
# 3 - C
# 4 - Missing
# we will use this translation to convert numeric to alphabetic

#### 2. The dataset has been processed and prepared for you and is available at
#### https://drive.google.com/drive/folders/159WV3PR3x5vpWwbbsjCXK5k4tgaNn0Ut?usp=sharing

In [6]:
# import the general modules
import pandas as pd
import numpy as np

In [7]:
# Read the train and test datasets

In [8]:
train = pd.read_csv('mitbih_train.csv', header = None)
test = pd.read_csv('mitbih_test.csv', header = None)

In [9]:
#Add column names
train.columns=["x"+str(i) for i in range(1, 189)]
test.columns=["x"+str(i) for i in range(1, 189)]
train.rename(columns={'x188':'y'}, inplace=True)
test.rename(columns={'x188':'y'}, inplace=True)

In [10]:
train.head()

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,...,x179,x180,x181,x182,x183,x184,x185,x186,x187,y
0,0.977941,0.926471,0.681373,0.245098,0.154412,0.191176,0.151961,0.085784,0.058824,0.04902,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.960114,0.863248,0.461538,0.196581,0.094017,0.125356,0.099715,0.088319,0.074074,0.082621,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.659459,0.186486,0.07027,0.07027,0.059459,0.056757,0.043243,0.054054,0.045946,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.925414,0.665746,0.541436,0.276243,0.196133,0.077348,0.071823,0.060773,0.066298,0.058011,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.967136,1.0,0.830986,0.586854,0.356808,0.248826,0.14554,0.089202,0.117371,0.150235,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
train.shape

(87554, 188)

In [12]:
test.head()

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,...,x179,x180,x181,x182,x183,x184,x185,x186,x187,y
0,1.0,0.758264,0.11157,0.0,0.080579,0.078512,0.066116,0.049587,0.047521,0.035124,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.908425,0.783883,0.531136,0.362637,0.3663,0.344322,0.333333,0.307692,0.296703,0.300366,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.730088,0.212389,0.0,0.119469,0.10177,0.10177,0.110619,0.123894,0.115044,0.132743,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.910417,0.68125,0.472917,0.229167,0.06875,0.0,0.004167,0.014583,0.054167,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.57047,0.399329,0.238255,0.147651,0.0,0.003356,0.040268,0.080537,0.07047,0.090604,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
test.shape

(21892, 188)

In [14]:
#?? It is not clear how these 187 columns were derived for each beat. Or is it simply the ECG value for each bit?

In [15]:
# how many null values?
train.isna().sum()

x1      0
x2      0
x3      0
x4      0
x5      0
       ..
x184    0
x185    0
x186    0
x187    0
y       0
Length: 188, dtype: int64

In [16]:
# how many non zero values in each row - 
np.count_nonzero(train, axis=1)

array([ 99, 135,  94, ..., 124, 124, 119])

In [17]:
# above tells us that there are different number of blank columns for each record

In [18]:
# how many non zero values in each column - 
np.count_nonzero(train, axis=0)

array([84776, 86865, 85965, 76683, 83492, 85324, 85926, 85736, 85943,
       86376, 86231, 86186, 86168, 86327, 86820, 87030, 87080, 87085,
       87129, 87154, 87188, 87165, 87164, 86998, 86762, 86453, 86392,
       86603, 86683, 86777, 86812, 86761, 86606, 86589, 86875, 87057,
       87136, 87140, 87140, 87101, 87081, 87040, 87031, 86992, 86939,
       86910, 86874, 86799, 86711, 86601, 86599, 86582, 86526, 86410,
       86301, 86153, 85996, 85842, 85735, 85604, 85417, 85240, 85098,
       84864, 84677, 84538, 84307, 83987, 83766, 83445, 83188, 82859,
       82527, 82037, 81470, 80890, 80298, 79548, 78693, 77682, 76506,
       75415, 74465, 73707, 72961, 72195, 71466, 70787, 70081, 69115,
       68293, 67448, 66418, 65336, 64054, 62706, 61462, 60126, 58847,
       57435, 55917, 54404, 52781, 51134, 49431, 47725, 46026, 44481,
       43228, 42079, 40980, 40009, 39018, 38073, 36991, 35924, 34797,
       33559, 32219, 30772, 29392, 27923, 26525, 25250, 24011, 22903,
       21818, 20875,

In [19]:
#above tells us that number of cells with 0 are padded in each row records. So last columns will typically have more zero values

In [20]:
#count values in various types of heart beats

In [21]:
train.y.value_counts()

0.0    72471
4.0     6431
2.0     5788
1.0     2223
3.0      641
Name: y, dtype: int64

In [22]:
# Beat type dictionary - 
# 0 - 'N'ormal
# 1 - 'A'
# 2 - 'B'
# 3 - 'C'
# 4 - 'M'issing
beat_type_dict = {0:"N", 1:"A", 2:"B", 3:"C", 4:"M"}

In [23]:
train['y'] = train['y'].map(beat_type_dict)
test['y'] = test['y'].map(beat_type_dict)

In [24]:
#making sure that mapping was succesful - 
train['y'].value_counts()

N    72471
M     6431
B     5788
A     2223
C      641
Name: y, dtype: int64

#### 3. Train your machine learning models on the train set (mitbih_train)
#### 4. Evaluate your model on the test set (mitbih_test)
#### 5. Identify/ rank the important features (optional)

In [32]:
# We will use h2o AutoML tools to train the model, evaluate the mode and find out feature importance

In [33]:
#!pip install h2o

In [27]:
import h2o
from h2o.automl import H2OAutoML

In [28]:
h2o.init(max_mem_size='8G')

Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: java version "14.0.1" 2020-04-14; Java(TM) SE Runtime Environment (build 14.0.1+7); Java HotSpot(TM) 64-Bit Server VM (build 14.0.1+7, mixed mode, sharing)
  Starting server from /Users/vikramkamthe/opt/anaconda3/envs/SmartHealthML/lib/python3.8/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/yj/gj315bwd4yn9hmdnq0fwhrlw0000gn/T/tmpb9kv9jyh
  JVM stdout: /var/folders/yj/gj315bwd4yn9hmdnq0fwhrlw0000gn/T/tmpb9kv9jyh/h2o_vikramkamthe_started_from_python.out
  JVM stderr: /var/folders/yj/gj315bwd4yn9hmdnq0fwhrlw0000gn/T/tmpb9kv9jyh/h2o_vikramkamthe_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,Asia/Kolkata
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.34.0.3
H2O_cluster_version_age:,1 month and 5 days
H2O_cluster_name:,H2O_from_python_vikramkamthe_ywuoyv
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,8 Gb
H2O_cluster_total_cores:,16
H2O_cluster_allowed_cores:,16


In [29]:
train_hf = h2o.H2OFrame(train)
test_hf = h2o.H2OFrame(test)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [30]:
#Making sure that y variable is treated as a factor
train_hf['y'] = train_hf['y'].asfactor()
test_hf['y'] = test_hf['y'].asfactor()

In [31]:
aml = H2OAutoML(max_runtime_secs = 7200, seed = 1, project_name = "beat_type_classification", nfolds=0, balance_classes=False, 
                class_sampling_factors=None, max_after_balance_size=5.0, max_runtime_secs_per_model=None, max_models=None, 
                stopping_metric='AUTO', stopping_tolerance=None, stopping_rounds=3, exclude_algos=None, 
                #include_algos = ["DeepLearning", "XGBoost", "StackedEnsemble"], 
                exploitation_ratio=0, modeling_plan=None, preprocessing=None, monotone_constraints=None, 
                keep_cross_validation_predictions=False, keep_cross_validation_models=False, 
                keep_cross_validation_fold_assignment=False, sort_metric='AUTO')
aml.train(y='y', training_frame = train_hf, validation_frame = test_hf)

AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%
Model Details
H2OXGBoostEstimator :  XGBoost
Model Key:  XGBoost_grid_1_AutoML_1_20211112_163926_model_1


Model Summary: 


Unnamed: 0,Unnamed: 1,number_of_trees
0,,110.0




ModelMetricsMultinomial: xgboost
** Reported on train data. **

MSE: 0.0006521739395373405
RMSE: 0.025537696441483137
LogLoss: 0.006338700346612113
Mean Per-Class Error: 0.0002691692970919025
AUC: NaN
AUCPR: NaN
Multinomial auc values: Table is not computed because it is disabled (model parameter 'auc_type' is set to AUTO or NONE) or due to domain size (maximum is 50 domains).
Multinomial auc_pr values: Table is not computed because it is disabled (model parameter 'auc_type' is set to AUTO or NONE) or due to domain size (maximum is 50 domains).

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class


Unnamed: 0,A,B,C,M,N,Error,Rate
0,2024.0,0.0,0.0,0.0,2.0,0.000987,"2 / 2,026"
1,0.0,5168.0,0.0,0.0,0.0,0.0,"0 / 5,168"
2,0.0,0.0,573.0,0.0,0.0,0.0,0 / 573
3,0.0,1.0,0.0,5823.0,1.0,0.000343,"2 / 5,825"
4,1.0,0.0,0.0,0.0,65222.0,1.5e-05,"1 / 65,223"
5,2025.0,5169.0,573.0,5823.0,65225.0,6.3e-05,"5 / 78,815"



Top-5 Hit Ratios: 


Unnamed: 0,k,hit_ratio
0,1,0.999937
1,2,1.0
2,3,1.0
3,4,1.0
4,5,1.0



ModelMetricsMultinomial: xgboost
** Reported on validation data. **

MSE: 0.015784850174583698
RMSE: 0.12563777367728107
LogLoss: 0.06983686171855555
Mean Per-Class Error: 0.13170552022272422
AUC: NaN
AUCPR: NaN
Multinomial auc values: Table is not computed because it is disabled (model parameter 'auc_type' is set to AUTO or NONE) or due to domain size (maximum is 50 domains).
Multinomial auc_pr values: Table is not computed because it is disabled (model parameter 'auc_type' is set to AUTO or NONE) or due to domain size (maximum is 50 domains).

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class


Unnamed: 0,A,B,C,M,N,Error,Rate
0,394.0,4.0,0.0,1.0,157.0,0.291367,162 / 556
1,2.0,1339.0,13.0,6.0,88.0,0.075276,"109 / 1,448"
2,0.0,10.0,120.0,0.0,32.0,0.259259,42 / 162
3,1.0,4.0,0.0,1559.0,44.0,0.030473,"49 / 1,608"
4,14.0,19.0,2.0,4.0,18079.0,0.002153,"39 / 18,118"
5,411.0,1376.0,135.0,1570.0,18400.0,0.018317,"401 / 21,892"



Top-5 Hit Ratios: 


Unnamed: 0,k,hit_ratio
0,1,0.981683
1,2,0.996209
2,3,0.998812
3,4,0.99936
4,5,1.0



Scoring History: 


Unnamed: 0,Unnamed: 1,timestamp,duration,number_of_trees,training_rmse,training_logloss,training_classification_error,training_auc,training_pr_auc,validation_rmse,validation_logloss,validation_classification_error,validation_auc,validation_pr_auc
0,,2021-11-12 17:15:04,0.034 sec,0.0,0.8,1.609438,0.172454,,,0.8,1.609438,0.172392,,
1,,2021-11-12 17:15:30,26.141 sec,5.0,0.285094,0.333022,0.031276,,,0.29256,0.345437,0.035401,,
2,,2021-11-12 17:15:50,46.118 sec,10.0,0.163843,0.136907,0.024526,,,0.181362,0.159082,0.030376,,
3,,2021-11-12 17:16:11,1 min 6.294 sec,15.0,0.129594,0.07707,0.018778,,,0.157258,0.108811,0.027088,,
4,,2021-11-12 17:16:30,1 min 26.219 sec,20.0,0.111704,0.054067,0.01435,,,0.147376,0.091395,0.024804,,
5,,2021-11-12 17:16:50,1 min 45.887 sec,25.0,0.098536,0.041575,0.011102,,,0.141503,0.082942,0.023296,,
6,,2021-11-12 17:17:09,2 min 5.267 sec,30.0,0.088802,0.034143,0.008932,,,0.137738,0.078443,0.022428,,
7,,2021-11-12 17:17:29,2 min 24.847 sec,35.0,0.080128,0.028373,0.007232,,,0.134929,0.075357,0.020921,,
8,,2021-11-12 17:17:49,2 min 44.722 sec,40.0,0.072618,0.024053,0.005798,,,0.133162,0.073449,0.020692,,
9,,2021-11-12 17:18:08,3 min 4.248 sec,45.0,0.065562,0.020453,0.004479,,,0.131512,0.071941,0.02019,,



See the whole table with table.as_data_frame()

Variable Importances: 


Unnamed: 0,variable,relative_importance,scaled_importance,percentage
0,x5,12505.969727,1.0,0.109896
1,x3,7651.562012,0.611833,0.067238
2,x2,6245.037598,0.499365,0.054878
3,x6,5863.927734,0.46889,0.051529
4,x12,5439.962402,0.434989,0.047804
5,x4,4309.022949,0.344557,0.037866
6,x15,4073.611084,0.325733,0.035797
7,x35,3684.303711,0.294604,0.032376
8,x1,3496.267334,0.279568,0.030723
9,x7,2014.085083,0.16105,0.017699



See the whole table with table.as_data_frame()




#### h2o AutoML has determined XGBoost to be best performing model for our scenario. 
![image.png](attachment:68af8565-8bea-4781-a9a7-5288e515ab86.png)

#### Based on the below confustion Matrix on validation data, we can see that the model is performing pretty 
![image.png](attachment:dd376788-27cf-4e09-89e5-b4e8cdc14a03.png)

#### Variable importance for this model can be seen below - 
![image.png](attachment:555c4b65-395d-411b-bc10-dfaac457bfd1.png)