# ELEC0118: Fourth Year MEng Projects 
# Sensor Arrays for Movement Sensing
## Members: Ken Yew Piong, Ka Shing Liong, Jing Wei Chan, Chin Yang Tan

# 1. Introduction
Wearable technology has considerable potential in healthcare. This project is concerned with the use of motion sensors such as accelerometers in the self-management of symptoms of a neurological condition called Parkinson’s Disease (PD) which has multiple motor symptoms including tremor and slowness of movement.

Motion sensors are already widely used in fitness monitoring, but the raw data is not readily available from commercial systems. For this project, data will be streamed and stored from the sensor array device containing accelerometers and gyroscopes that is worn by the patient. The signals from these sensors are processed to extract features which are characteristic of particular movements. The challenge is to identify the signals for the particular movements from other intended movements and study the variation of these movements during the day.

## 1.1 Motivation
This notebook will investigate signal processing algorithms to extract the relevant movement data and suggest parameters that can be clearly provided to clinicians to quantify the variation in relevant movement during a 24-hour period.

We will first visualise the given data through many plotting methods such as boxplots, scatter plots, heatmaps and histograms to gain a better understanding on the correlation of each feature. We will also be using dimensionality reduction methods to visualise the data more elegantly and capture an overall understanding of the trend of the data. **IN PROGRESS - PENDING ACTUAL DATA** 

We will then focus on running the following machine learning models for this case and then compute their corresponding performance scores to determine the most effective and feasible model for this application. 
1. Linear Logistic Regression
2. K-Nearest Neighbours Classifier
3. Linear Support Vector Machine
4. Kernel RBF Support Vector Machine
5. Adaptive Gradient Boosting
6. Random Forest
7. Neural Networks

## Importing the Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, cross_validate
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, accuracy_score,confusion_matrix,precision_score, recall_score, roc_curve, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier

# 2. Data Loading, Wrangling and Pre-processing

In [2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/parkinsons/parkinsons.data"
dataset = pd.read_csv(url)
df = dataset.copy() # df - dataframe
print("The dimensions of the dataframe (cols x rows):", df.shape) # (195, 24)

The dimensions of the dataframe (cols x rows): (195, 24)


In [3]:
df.describe()

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,Shimmer:DDA,NHR,HNR,status,RPDE,DFA,spread1,spread2,D2,PPE
count,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,...,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0
mean,154.228641,197.104918,116.324631,0.00622,4.4e-05,0.003306,0.003446,0.00992,0.029709,0.282251,...,0.046993,0.024847,21.885974,0.753846,0.498536,0.718099,-5.684397,0.22651,2.381826,0.206552
std,41.390065,91.491548,43.521413,0.004848,3.5e-05,0.002968,0.002759,0.008903,0.018857,0.194877,...,0.030459,0.040418,4.425764,0.431878,0.103942,0.055336,1.090208,0.083406,0.382799,0.090119
min,88.333,102.145,65.476,0.00168,7e-06,0.00068,0.00092,0.00204,0.00954,0.085,...,0.01364,0.00065,8.441,0.0,0.25657,0.574282,-7.964984,0.006274,1.423287,0.044539
25%,117.572,134.8625,84.291,0.00346,2e-05,0.00166,0.00186,0.004985,0.016505,0.1485,...,0.024735,0.005925,19.198,1.0,0.421306,0.674758,-6.450096,0.174351,2.099125,0.137451
50%,148.79,175.829,104.315,0.00494,3e-05,0.0025,0.00269,0.00749,0.02297,0.221,...,0.03836,0.01166,22.085,1.0,0.495954,0.722254,-5.720868,0.218885,2.361532,0.194052
75%,182.769,224.2055,140.0185,0.007365,6e-05,0.003835,0.003955,0.011505,0.037885,0.35,...,0.060795,0.02564,25.0755,1.0,0.587562,0.761881,-5.046192,0.279234,2.636456,0.25298
max,260.105,592.03,239.17,0.03316,0.00026,0.02144,0.01958,0.06433,0.11908,1.302,...,0.16942,0.31482,33.047,1.0,0.685151,0.825288,-2.434031,0.450493,3.671155,0.527367


## 2.1 Dataframe Attribute Information

### Matrix column entries (attributes):
**name**: ASCII subject name and recording number

**MDVP:Fo(Hz)**: Average vocal fundamental frequency

**MDVP:Fhi(Hz)**: Maximum vocal fundamental frequency

**MDVP:Flo(Hz)**: Minimum vocal fundamental frequency

**MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP**: Several measures of variation in fundamental frequency

**MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA**: Several measures of variation in amplitude

**NHR,HNR**: Two measures of ratio of noise to tonal components in the voice

**status**: Health status of the subject (one) - Parkinson's, (zero) - healthy

**RPDE,D2**: Two nonlinear dynamical complexity measures

**DFA**: Signal fractal scaling exponent

**spread1,spread2,PPE**: Three nonlinear measures of fundamental frequency variation

## 2.2 Data Restructuring

In [4]:
df["result"]=df["status"] # duplicate status data and store to result
df.drop(["name","status"],axis=1,inplace=True)# dropping columns (status now replaced with result)
df.shape
df

Unnamed: 0,MDVP:Fo(Hz),MDVP:Fhi(Hz),MDVP:Flo(Hz),MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP,MDVP:Shimmer,MDVP:Shimmer(dB),...,Shimmer:DDA,NHR,HNR,RPDE,DFA,spread1,spread2,D2,PPE,result
0,119.992,157.302,74.997,0.00784,0.00007,0.00370,0.00554,0.01109,0.04374,0.426,...,0.06545,0.02211,21.033,0.414783,0.815285,-4.813031,0.266482,2.301442,0.284654,1
1,122.400,148.650,113.819,0.00968,0.00008,0.00465,0.00696,0.01394,0.06134,0.626,...,0.09403,0.01929,19.085,0.458359,0.819521,-4.075192,0.335590,2.486855,0.368674,1
2,116.682,131.111,111.555,0.01050,0.00009,0.00544,0.00781,0.01633,0.05233,0.482,...,0.08270,0.01309,20.651,0.429895,0.825288,-4.443179,0.311173,2.342259,0.332634,1
3,116.676,137.871,111.366,0.00997,0.00009,0.00502,0.00698,0.01505,0.05492,0.517,...,0.08771,0.01353,20.644,0.434969,0.819235,-4.117501,0.334147,2.405554,0.368975,1
4,116.014,141.781,110.655,0.01284,0.00011,0.00655,0.00908,0.01966,0.06425,0.584,...,0.10470,0.01767,19.649,0.417356,0.823484,-3.747787,0.234513,2.332180,0.410335,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
190,174.188,230.978,94.261,0.00459,0.00003,0.00263,0.00259,0.00790,0.04087,0.405,...,0.07008,0.02764,19.517,0.448439,0.657899,-6.538586,0.121952,2.657476,0.133050,0
191,209.516,253.017,89.488,0.00564,0.00003,0.00331,0.00292,0.00994,0.02751,0.263,...,0.04812,0.01810,19.147,0.431674,0.683244,-6.195325,0.129303,2.784312,0.168895,0
192,174.688,240.005,74.287,0.01360,0.00008,0.00624,0.00564,0.01873,0.02308,0.256,...,0.03804,0.10715,17.883,0.407567,0.655683,-6.787197,0.158453,2.679772,0.131728,0
193,198.764,396.961,74.904,0.00740,0.00004,0.00370,0.00390,0.01109,0.02296,0.241,...,0.03794,0.07223,19.020,0.451221,0.643956,-6.744577,0.207454,2.138608,0.123306,0


In [5]:
df.columns = [i for i in range(23)] # renaming columns to numbers 0:23
df.describe() # check if there are no missing values after the data restructuring

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
count,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,...,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0,195.0
mean,154.228641,197.104918,116.324631,0.00622,4.4e-05,0.003306,0.003446,0.00992,0.029709,0.282251,...,0.046993,0.024847,21.885974,0.498536,0.718099,-5.684397,0.22651,2.381826,0.206552,0.753846
std,41.390065,91.491548,43.521413,0.004848,3.5e-05,0.002968,0.002759,0.008903,0.018857,0.194877,...,0.030459,0.040418,4.425764,0.103942,0.055336,1.090208,0.083406,0.382799,0.090119,0.431878
min,88.333,102.145,65.476,0.00168,7e-06,0.00068,0.00092,0.00204,0.00954,0.085,...,0.01364,0.00065,8.441,0.25657,0.574282,-7.964984,0.006274,1.423287,0.044539,0.0
25%,117.572,134.8625,84.291,0.00346,2e-05,0.00166,0.00186,0.004985,0.016505,0.1485,...,0.024735,0.005925,19.198,0.421306,0.674758,-6.450096,0.174351,2.099125,0.137451,1.0
50%,148.79,175.829,104.315,0.00494,3e-05,0.0025,0.00269,0.00749,0.02297,0.221,...,0.03836,0.01166,22.085,0.495954,0.722254,-5.720868,0.218885,2.361532,0.194052,1.0
75%,182.769,224.2055,140.0185,0.007365,6e-05,0.003835,0.003955,0.011505,0.037885,0.35,...,0.060795,0.02564,25.0755,0.587562,0.761881,-5.046192,0.279234,2.636456,0.25298,1.0
max,260.105,592.03,239.17,0.03316,0.00026,0.02144,0.01958,0.06433,0.11908,1.302,...,0.16942,0.31482,33.047,0.685151,0.825288,-2.434031,0.450493,3.671155,0.527367,1.0


## 2.3 Restructured Dataframe Attribute Information

### Matrix column entries (attributes):
**0**: name - ASCII subject name and recording number

**1**: MDVP:Fo(Hz) - Average vocal fundamental frequency

**2**: MDVP:Fhi(Hz) - Maximum vocal fundamental frequency

**3**: MDVP:Flo(Hz) - Minimum vocal fundamental frequency

**4, 5, 6, 7**: MDVP:Jitter(%),MDVP:Jitter(Abs),MDVP:RAP,MDVP:PPQ,Jitter:DDP -  Several measures of variation in fundamental frequency

**8, 9, 10, 11, 12, 13**: MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA -  Several measures of variation in amplitude

**14, 15**: NHR,HNR - Two measures of ratio of noise to tonal components in the voice

**16, 17**: RPDE,D2 - Two nonlinear dynamical complexity measures

**18**: DFA - Signal fractal scaling exponent

**19, 20, 21**: spread1,spread2,PPE - Three nonlinear measures of fundamental frequency variation

**22**: status - Health status of the subject (one) - Parkinson's, (zero) - healthy

## 2.4 Data Wrangling

In [6]:
data = df.values
X_data = data[:, :22] # Take all rows and all columns but the last one
y_data = data[:, 22] # Take all rows and only the last column
# Split data set into 70% train - 30% test 
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.3, random_state=0)

# Standardize the X_train feature to have a mean of 0 and a standard deviation of 1
scaler = StandardScaler() # Create scaler

X_train_scaled = scaler.fit_transform(X_train) # Scale feature
X_test_scaled = scaler.transform(X_test) # Transform feature matrix
num_feature = len(df.columns) - 1

### 2.4.1 Rescaled X_train array

In [7]:
X_train_scaled

array([[ 1.20277786,  3.55007322,  1.46378495, ...,  0.37523479,
         1.38253798,  0.70950026],
       [-0.08490869, -0.40139205,  0.69127853, ..., -0.57555776,
        -0.31740397, -0.64646829],
       [-1.11208247, -0.68412437, -0.30308418, ..., -0.30690231,
        -1.04720163,  1.20993771],
       ...,
       [ 0.41632981,  2.41567671, -0.86941153, ...,  1.69605993,
         1.55536714,  0.84509927],
       [ 2.43201619,  0.56163635,  1.59509347, ..., -0.36324299,
        -0.93743211, -1.26715615],
       [-1.06441596, -0.89097029, -0.36804959, ..., -0.46543789,
        -1.3048217 , -0.32328725]])

### 2.4.2 Rescaled X_test array

In [8]:
X_test_scaled

array([[-1.35887755, -1.00188268, -0.66104715, ..., -1.83845482,
        -0.87167251, -0.28969673],
       [-0.41832753, -0.43650645,  0.37155845, ..., -0.94120095,
        -0.90410363, -0.70662576],
       [ 1.19374787,  0.07167505,  1.93460009, ..., -0.63181426,
        -2.21110636, -1.59259592],
       ...,
       [ 0.80957851,  0.07518158,  0.80306179, ...,  1.31294332,
         1.51043619,  1.04630896],
       [ 0.3139159 ,  0.01670996, -0.89316488, ...,  1.90191687,
         2.45455716,  1.05131043],
       [ 0.06226042, -0.14643695,  0.64581462, ...,  1.31198126,
         1.24165312,  1.26813762]])

## 2.5 Data Pre-processing
### 2.5.1 Principal Component Analysis

In [9]:
pca = PCA()
projected = pca.fit_transform(X_train_scaled)
# print(projected[:, 0])
print("Training data shape :", X_train.shape)
print("Projected data shape :", projected.shape)
print("Explained variance :", np.sum(pca.explained_variance_ratio_))

Training data shape : (136, 22)
Projected data shape : (136, 22)
Explained variance : 1.0


### 2.5.2 Calculating the optimal number of PCA components, k

In [10]:
k = 0
total = sum(pca.explained_variance_)
current_sum = 0
while(current_sum / total < 0.99):
    current_sum += pca.explained_variance_[k]
    k += 1
print(k)
num_feature = k

12


### 2.5.3 Data Pre-processing using PCA with optimal k

In [11]:
pca = PCA(n_components=k, whiten=True)
X_train_scaled = pca.fit_transform(X_train_scaled)
X_test_scaled = pca.transform(X_test_scaled)

# 3. Machine Learning Models

### 3.1 Linear Logistic Regression
This **Linear Logistic Regression** model uses a logistic function to model a binary dependant variable output with a penalty parameter that is $l2-regularised$.

We start with **linear logistic regression** as the baseline of classification models. 

In [12]:
#-------------- 
# Linear Logistic Regression 
#--------------
# Instantiate and train the machine learning model
log_reg = LogisticRegression(solver='lbfgs', penalty='l2')
log_reg.fit(X_train_scaled, y_train)
print("log_reg score on training data:", round(log_reg.score(X_train_scaled,y_train), 4))
print("log_reg score on testing data:", round(log_reg.score(X_test_scaled,y_test), 4))

log_reg score on training data: 0.8824
log_reg score on testing data: 0.8814


In [13]:
# Perform K = 20 fold cross validation and evaluate the performance scores 
score_ACC = np.mean(cross_val_score(log_reg, X_train_scaled, y_train, scoring = 'accuracy', cv=20))
score_AUROC = np.mean(cross_val_score(log_reg, X_train_scaled, y_train, scoring = 'roc_auc', cv=20))
score_P = np.mean(cross_val_score(log_reg, X_train_scaled, y_train, scoring = 'precision', cv=20))
score_AP = np.mean(cross_val_score(log_reg, X_train_scaled, y_train, scoring = 'average_precision', cv=20))
score_F1 = np.mean(cross_val_score(log_reg, X_train_scaled, y_train, scoring = 'f1', cv=20))
score_RECALL = np.mean(cross_val_score(log_reg, X_train_scaled, y_train, scoring = 'recall', cv=20))

# Print the Accuracy and AUROC performance scores of this model
print("log_reg score on training data after 20 fold cross-validation\nAccuracy: %.4f (+/- %.2f), AUROC Score: %.4f (+/- %.2f)" % (score_ACC.mean(), score_ACC.std(), score_AUROC.mean(), score_AUROC.std() ))

# Calculate, save and print the accuracy of the model on the testing data 
y_pred = log_reg.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_pred)
score_ACC_TEST = (cm[0,0] + cm[1,1] )/len(X_test_scaled)
print("log_reg score on testing data after 20 fold cross-validation\nAccuracy: %.4f (+/- %.2f)" % (score_ACC_TEST.mean(), score_ACC_TEST.std()))

log_reg score on training data after 20 fold cross-validation
Accuracy: 0.8458 (+/- 0.00), AUROC Score: 0.8642 (+/- 0.00)
log_reg score on testing data after 20 fold cross-validation
Accuracy: 0.8814 (+/- 0.00)


## 3.2 K-Nearest Neighbours (KNN) Classifier
The **k-NN classification** model classifies by using the majority vote of its neighbours with the object being assigned to the class most common among its k-nearest neighbours. 

In [14]:
#-------------- 
# KNN Classifier
#--------------
# Instantiate and train the machine learning model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)
print("knn score on training data:", round(knn.score(X_train_scaled,y_train), 4))
print("knn score on testing data:", round(knn.score(X_test_scaled,y_test), 4))

knn score on training data: 0.9044
knn score on testing data: 0.9153


In [15]:
# Perform K = 20 fold cross validation and evaluate the performance scores 
score_ACC = np.mean(cross_val_score(knn, X_train_scaled, y_train, scoring = 'accuracy', cv=20))
score_AUROC = np.mean(cross_val_score(knn, X_train_scaled, y_train, scoring = 'roc_auc', cv=20))
score_P = np.mean(cross_val_score(knn, X_train_scaled, y_train, scoring = 'precision', cv=20))
score_AP = np.mean(cross_val_score(knn, X_train_scaled, y_train, scoring = 'average_precision', cv=20))
score_F1 = np.mean(cross_val_score(knn, X_train_scaled, y_train, scoring = 'f1', cv=20))
score_RECALL = np.mean(cross_val_score(knn, X_train_scaled, y_train, scoring = 'recall', cv=20))

# Print the Accuracy and AUROC performance scores of this model
print("knn score on training data after 20 fold cross-validation\nAccuracy: %.4f (+/- %.2f), AUROC Score: %.4f (+/- %.2f)" % (score_ACC.mean(), score_ACC.std(), score_AUROC.mean(), score_AUROC.std() ))

# Calculate, save and print the accuracy of the model on the testing data 
y_pred = knn.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_pred)
score_ACC_TEST = (cm[0,0] + cm[1,1] )/len(X_test_scaled)
print("knn score on testing data after 20 fold cross-validation\nAccuracy: %.4f (+/- %.2f)" % (score_ACC_TEST.mean(), score_ACC_TEST.std()))

knn score on training data after 20 fold cross-validation
Accuracy: 0.8449 (+/- 0.00), AUROC Score: 0.8879 (+/- 0.00)
knn score on testing data after 20 fold cross-validation
Accuracy: 0.9153 (+/- 0.00)


##  3.3 Linear Support Vector Machine (SVM)

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection. It uses a subset of training points in the decision function (called support vectors) allowing it to be memory efficient. The SVM model is used as it is versatile to be used with kernel functions and be effective in high dimensional spaces.

In [16]:
#-------------- 
# Linear SVM 
#--------------
# Instantiate and train the machine learning model
svc = SVC(kernel = "linear")
svc.fit(X_train_scaled, y_train)
print("svc score on training data:", round(svc.score(X_train_scaled,y_train), 4))
print("svc score on testing data:", round(svc.score(X_test_scaled,y_test), 4))

svc score on training data: 0.8897
svc score on testing data: 0.8475


In [17]:
# Perform K = 20 fold cross validation and evaluate the performance scores 
score_ACC = np.mean(cross_val_score(svc, X_train_scaled, y_train, scoring = 'accuracy', cv=20))
score_AUROC = np.mean(cross_val_score(svc, X_train_scaled, y_train, scoring = 'roc_auc', cv=20))
score_P = np.mean(cross_val_score(svc, X_train_scaled, y_train, scoring = 'precision', cv=20))
score_AP = np.mean(cross_val_score(svc, X_train_scaled, y_train, scoring = 'average_precision', cv=20))
score_F1 = np.mean(cross_val_score(svc, X_train_scaled, y_train, scoring = 'f1', cv=20))
score_RECALL = np.mean(cross_val_score(svc, X_train_scaled, y_train, scoring = 'recall', cv=20))

# Print the Accuracy and AUROC performance scores of this model
print("svc score on training data after 20 fold cross-validation\nAccuracy: %.4f (+/- %.2f), AUROC Score: %.4f (+/- %.2f)" % (score_ACC.mean(), score_ACC.std(), score_AUROC.mean(), score_AUROC.std() ))

# Calculate, save and print the accuracy of the model on the testing data 
y_pred = svc.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_pred)
score_ACC_TEST = (cm[0,0] + cm[1,1] )/len(X_test_scaled)
print("svc score on testing data after 20 fold cross-validation\nAccuracy: %.4f (+/- %.2f)" % (score_ACC_TEST.mean(), score_ACC_TEST.std()))

svc score on training data after 20 fold cross-validation
Accuracy: 0.8399 (+/- 0.00), AUROC Score: 0.8517 (+/- 0.00)
svc score on testing data after 20 fold cross-validation
Accuracy: 0.8475 (+/- 0.00)


## 3.4 Kernel RBF SVM
Based on the results we got from Linear SVM, it seems that the feature space provided was not rich enough to linearly descibe all relationships between classes. Using the fact that SVM is susceptible to the kernel trick, we can use the Kernel Radial Basis Function (RBF) SVM model to raise the feature space to infinite dimensions with the RBF kernel and observe any improvements. 

In [18]:
#-------------- 
# Kernel RBF SVM 
#--------------
# Instantiate and train the machine learning model
svc_RBF = SVC(kernel = "rbf", gamma = "auto")
svc_RBF.fit(X_train_scaled, y_train)
print("svc_RBF score on training data:", round(svc_RBF.score(X_train_scaled,y_train), 4))
print("svc_RBF score on testing data:", round(svc_RBF.score(X_test_scaled,y_test), 4))

svc_RBF score on training data: 0.9191
svc_RBF score on testing data: 0.9322


In [19]:
# Perform K = 20 fold cross validation and evaluate the performance scores 
score_ACC = np.mean(cross_val_score(svc_RBF, X_train_scaled, y_train, scoring = 'accuracy', cv=20))
score_AUROC = np.mean(cross_val_score(svc_RBF, X_train_scaled, y_train, scoring = 'roc_auc', cv=20))
score_P = np.mean(cross_val_score(svc_RBF, X_train_scaled, y_train, scoring = 'precision', cv=20))
score_AP = np.mean(cross_val_score(svc_RBF, X_train_scaled, y_train, scoring = 'average_precision', cv=20))
score_F1 = np.mean(cross_val_score(svc_RBF, X_train_scaled, y_train, scoring = 'f1', cv=20))
score_RECALL = np.mean(cross_val_score(svc_RBF, X_train_scaled, y_train, scoring = 'recall', cv=20))

# Print the Accuracy and AUROC performance scores of this model
print("svc_RBF score on training data after 20 fold cross-validation\nAccuracy: %.4f (+/- %.2f), AUROC Score: %.4f (+/- %.2f)" % (score_ACC.mean(), score_ACC.std(), score_AUROC.mean(), score_AUROC.std() ))

# Calculate, save and print the accuracy of the model on the testing data 
y_pred = svc_RBF.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_pred)
score_ACC_TEST = (cm[0,0] + cm[1,1] )/len(X_test_scaled)
print("svc_RBF score on testing data after 20 fold cross-validation\nAccuracy: %.4f (+/- %.2f)" % (score_ACC_TEST.mean(), score_ACC_TEST.std()))

svc_RBF score on training data after 20 fold cross-validation
Accuracy: 0.8768 (+/- 0.00), AUROC Score: 0.8617 (+/- 0.00)
svc_RBF score on testing data after 20 fold cross-validation
Accuracy: 0.9322 (+/- 0.00)


## 3.5 Adaptive Gradient Boosting (AdaBoost)
The **AdaBoost Classifier** is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

This gradient boosting models such as AdaBoost helps weigh the averages of the different weaker classifier models together and help reduce the possibility of imbalance classes impacting the performance so we expect it to perform slightly better than random forest models. 

**Ensemble Learning**

Boosting ensemble methods creates a strong classifier from a number of weaker classifiers by first building a classification model from the training data and then creating another classification model that attempts to correct the errors of the first model and so on. 


In [20]:
#-------------- 
# AdaBoost
#--------------
# Instantiate and train the machine learning model
ada = AdaBoostClassifier()
ada.fit(X_train_scaled, y_train)
print("ada score on training data:", round(ada.score(X_train_scaled,y_train), 4))
print("ada score on testing data:", round(ada.score(X_test_scaled,y_test), 4))

ada score on training data: 1.0
ada score on testing data: 0.9153


In [21]:
# Perform K = 20 fold cross validation and evaluate the performance scores 
score_ACC = np.mean(cross_val_score(ada, X_train_scaled, y_train, scoring = 'accuracy', cv=20))
score_AUROC = np.mean(cross_val_score(ada, X_train_scaled, y_train, scoring = 'roc_auc', cv=20))
score_P = np.mean(cross_val_score(ada, X_train_scaled, y_train, scoring = 'precision', cv=20))
score_AP = np.mean(cross_val_score(ada, X_train_scaled, y_train, scoring = 'average_precision', cv=20))
score_F1 = np.mean(cross_val_score(ada, X_train_scaled, y_train, scoring = 'f1', cv=20))
score_RECALL = np.mean(cross_val_score(ada, X_train_scaled, y_train, scoring = 'recall', cv=20))

# Print the Accuracy and AUROC performance scores of this model
print("ada score on training data after 20 fold cross-validation\nAccuracy: %.4f (+/- %.2f), AUROC Score: %.4f (+/- %.2f)" % (score_ACC.mean(), score_ACC.std(), score_AUROC.mean(), score_AUROC.std() ))

# Calculate, save and print the accuracy of the model on the testing data 
y_pred = ada.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_pred)
score_ACC_TEST = (cm[0,0] + cm[1,1] )/len(X_test_scaled)
print("ada score on testing data after 20 fold cross-validation\nAccuracy: %.4f (+/- %.2f)" % (score_ACC_TEST.mean(), score_ACC_TEST.std()))

ada score on training data after 20 fold cross-validation
Accuracy: 0.7887 (+/- 0.00), AUROC Score: 0.8417 (+/- 0.00)
ada score on testing data after 20 fold cross-validation
Accuracy: 0.9153 (+/- 0.00)


### 3.6 Random Forest
**The Random Forest Classifier** is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. 

In [22]:
#-------------- 
# Random Forest
#--------------
# Instantiate and train the machine learning model
rf = RandomForestClassifier(n_estimators=100)
rf.fit(X_train_scaled, y_train)
print("rf score on training data:", round(rf.score(X_train_scaled,y_train), 4))
print("rf score on testing data:", round(rf.score(X_test_scaled,y_test), 4))

rf score on training data: 1.0
rf score on testing data: 0.9322


In [23]:
# Perform K = 20 fold cross validation and evaluate the performance scores 
score_ACC = np.mean(cross_val_score(rf, X_train_scaled, y_train, scoring = 'accuracy', cv=20))
score_AUROC = np.mean(cross_val_score(rf, X_train_scaled, y_train, scoring = 'roc_auc', cv=20))
score_P = np.mean(cross_val_score(rf, X_train_scaled, y_train, scoring = 'precision', cv=20))
score_AP = np.mean(cross_val_score(rf, X_train_scaled, y_train, scoring = 'average_precision', cv=20))
score_F1 = np.mean(cross_val_score(rf, X_train_scaled, y_train, scoring = 'f1', cv=20))
score_RECALL = np.mean(cross_val_score(rf, X_train_scaled, y_train, scoring = 'recall', cv=20))

# Print the Accuracy and AUROC performance scores of this model
print("rf score on training data after 20 fold cross-validation\nAccuracy: %.4f (+/- %.2f), AUROC Score: %.4f (+/- %.2f)" % (score_ACC.mean(), score_ACC.std(), score_AUROC.mean(), score_AUROC.std() ))

# Calculate, save and print the accuracy of the model on the testing data 
y_pred = rf.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_pred)
score_ACC_TEST = (cm[0,0] + cm[1,1] )/len(X_test_scaled)
print("rf score on testing data after 20 fold cross-validation\nAccuracy: %.4f (+/- %.2f)" % (score_ACC_TEST.mean(), score_ACC_TEST.std()))

rf score on training data after 20 fold cross-validation
Accuracy: 0.8470 (+/- 0.00), AUROC Score: 0.8875 (+/- 0.00)
rf score on testing data after 20 fold cross-validation
Accuracy: 0.9322 (+/- 0.00)


# 4. Deep Learning Models

## 4.1 Neural Network

In [24]:
import keras
from keras.models import Sequential
from keras.layers import Dense,Dropout
model = Sequential()
model.add(Dense(units=64, activation='sigmoid', input_dim=num_feature))
model.add(Dense(units=32, activation='sigmoid'))
model.add(Dropout(rate=0.2))
model.add(Dense(units=1, activation='sigmoid'))
opti=keras.optimizers.Adam(lr=0.01, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False)
# opti=keras.optimizers.RMSprop(lr=0.01, rho=0.9, epsilon=None, decay=0.0)
# opti=keras.optimizers.Adagrad(lr=0.01, epsilon=None, decay=0.0)
model.compile(optimizer=opti, loss='binary_crossentropy', metrics=['accuracy'])

"""
model.fit(X_train_scaled, y_train, epochs=75, batch_size=50, validation_data=(X_test_scaled,y_test))
score=model.evaluate(X_test_scaled,y_test)
score
"""

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])






Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])




Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


'\nmodel.fit(X_train_scaled, y_train, epochs=75, batch_size=50, validation_data=(X_test_scaled,y_test))\nscore=model.evaluate(X_test_scaled,y_test)\nscore\n'

# 5. Evaluation and Results

#### **1. Linear Logistic Regression**
#### **2. K-Nearest Neighbours**
#### **3. Linear SVM**
#### **4. Kernel RBF SVM**
#### **5. AdaBoost**
#### **6. Random Forest**
#### **7. Neural Networks**