# Machine Learning Part

You will explore how best to apply machine learning algorithms, for example Neural Network, Boosted Decision Tree (BDT), Support Vector Machine(SVM) to solve a High Energy Data analysis issue, more specifically,  separating the signal events from the background events.

A set of input samples (simulated with Delphes) is provided in NumPy NPZ format [Download Input](https://drive.google.com/open?id=1r_MZB_crfpij6r3SxPDeU_3JD6t6AxAj). In the input file, there are only 100 samples for training and 100 samples for testing so it won’t take much computing resources to accomplish this task. The signal events are labeled with 1 while the background sample are labeled with 0.

You can apply one machine learning algorithm to this input but be sure to show that you understand how to fine tune your machine learning model to improve the performance. The performance can be evaluated with classification accuracy or Area Under ROC Curve (AUC).


##Loading and Processing Data

In [0]:
## Mount Drive into Colab
from google.colab import drive
drive.mount('/content/drive')
# !cd drive/My\ Drive/ 
!ls

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive
drive  sample_data


In [0]:
import numpy as np
import pandas as pd

file = np.load('drive/My Drive/QIS_EXAM_200Events.npz', allow_pickle=True)

# temp = file['training_input']
# print(temp.item())

train_dataset = file['training_input'].item()
test_dataset = file['test_input'].item()

In [0]:

df_train_0 = pd.DataFrame(train_dataset['0'])
df_test_0  = pd.DataFrame(test_dataset['0'])

df_train_1 = pd.DataFrame(train_dataset['1'])
df_test_1  = pd.DataFrame(test_dataset['1'])

df_train = df_train_0.append(df_train_1)
df_test  = df_test_0.append(df_test_1)

df_train_0 -= df_train.mean()
df_train_1 -= df_train.mean()

df_test_0 -= df_test.mean()
df_test_1 -= df_test.mean()

In [0]:
from sklearn.utils import shuffle

X_train = np.append(df_train_0.values, df_train_1.values, axis = 0)
y_train = np.append(np.zeros(shape = (1,50)), np.ones(shape = (1,50)))

X_test = np.append(df_test_0.values, df_test_1.values, axis = 0)
y_test = np.copy(y_train)

X_train, y_train = shuffle(X_train, y_train)
X_test,  y_test  = shuffle(X_test,  y_test)

##ML Algorithms
###1. Logistic Regression

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

#Quoting from sklearn documentations "For small datasets, ‘liblinear’ is a good choice"
classifier = LogisticRegression(solver='liblinear', class_weight='balanced')
logistic_regression = classifier.fit(X_train, y_train)

y_pred = logistic_regression.predict(X_test)
# print(y_pred)

auc = metrics.roc_auc_score(y_test, y_pred)
print('AUC: %.5f' % auc)

[0. 0. 0. 1. 1. 0. 0. 0. 1. 1. 0. 0. 1. 1. 1. 0. 0. 0. 1. 0. 1. 0. 0. 1.
 0. 0. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0. 0. 1. 0.
 1. 1. 0. 0. 0. 1. 1. 1. 0. 1. 1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0. 1. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 1. 0. 1. 1. 1. 0. 0. 0. 1. 0. 1. 1. 0. 1.
 0. 0. 0. 1.]
AUC: 0.72000


###2. Decision Tree Classifier

In [0]:
from sklearn.tree import DecisionTreeClassifier

#Any value of max_depth other than 2 or 3 significantly decreases the accuracy.
#Entropy seems to perform better almost always than the gini criteria
decision_tree =  DecisionTreeClassifier(criterion = "entropy", random_state = 32, max_depth = 3, min_samples_split = 3)

decision_tree.fit(X_train,y_train)
y_pred_dt = decision_tree.predict(X_test)

# print(y_pred_dt)
auc = metrics.roc_auc_score(y_test, y_pred_dt)
print('AUC: %.5f' % auc)

AUC: 0.72000


###3. Random Forest Classifier

In [0]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

#Any value of max_depth other than 2 or 3 significantly decreases the accuracy.
#Entropy seems to perform better almost always than the gini criteria
#The performance of the random forest increases with no of estimators(#trees) and peaks at around 100 trees
random_forest_classifier = RandomForestClassifier(criterion = "entropy", max_depth = 2, random_state = 32, n_estimators = 100, min_samples_split = 3)

scores = cross_val_score(random_forest_classifier, X_train, y_train, cv=10)
print("Cross Validation Average Accuracy: {:.5f}".format(scores.mean()))

random_forest_classifier.fit(X_train, y_train)
score = random_forest_classifier.score(X_test, y_test)

print("Random Forest Test Set Accuracy: {:.5f}".format(score))

Cross Validation Average Accuracy: 0.80000
Random Forest Test Set Accuracy: 0.72000


###4. Support Vector Machine

In [0]:
from sklearn import svm
from sklearn.metrics import classification_report

#Default rbf and poly kernels work a bit better than linear kernels
svm_classifier = svm.SVC(kernel = 'poly', break_ties = 'true', random_state = 32)
svm_classifier.fit(X_train, y_train)

print("SVM Test Set Accuracy: {:.5f}".format(svm_classifier.score(X_test, y_test)))

y_pred = svm_classifier.predict(X_test)
print("SVM Prediction Report: \n {}".format(classification_report(y_test, y_pred)))

#Value of C in 1- to 1e1 has the same accuracy
svm_classifier = svm.SVC(C = 1e1, kernel = 'poly', break_ties = 'true', random_state = 32)
svm_classifier.fit(X_train, y_train)

print("SVM Test Set Accuracy: {:.5f}".format(svm_classifier.score(X_test, y_test)))

SVM Test Set Accuracy: 0.72000
SVM Prediction Report: 
               precision    recall  f1-score   support

         0.0       0.68      0.82      0.75        50
         1.0       0.78      0.62      0.69        50

    accuracy                           0.72       100
   macro avg       0.73      0.72      0.72       100
weighted avg       0.73      0.72      0.72       100

SVM Test Set Accuracy: 0.72000


###5. K Means Clustering

In [0]:
from sklearn.cluster import KMeans

#Any other cluster amount reduces accuracy
#Also note, #clusters should be clearly <= 4
num_clusters = 3

kmeans = KMeans(n_clusters = num_clusters, random_state = 32)
clusters_train = kmeans.fit_predict(X_train)
clusters_test = kmeans.predict(X_test)

score_all = 0
for i in range(0, num_clusters):
    svm_classifier = svm.SVC(kernel = 'poly', degree = 3, break_ties = 'true', random_state = 32)
    svm_classifier.fit(X_train[clusters_train == i], y_train[clusters_train == i])

    score = svm_classifier.score(X_test[clusters_test == i], y_test[clusters_test == i])
    print("SVM accuracy for class {}: {:.5f}".format(i, score))
    
    score_all += score

print("K means classification accuracy: {:.5f}".format(score_all / num_clusters))

SVM accuracy for class 0: 0.75000
SVM accuracy for class 1: 0.66038
SVM accuracy for class 2: 0.81481
K means classification accuracy: 0.74173


###6. Naive Bayes Classifier

In [0]:
from sklearn.naive_bayes import GaussianNB

naive_bayes_classifier = GaussianNB()
naive_bayes_classifier.fit(X_train, y_train)

y_pred =  naive_bayes_classifier.predict(X_test)

auc = metrics.roc_auc_score(y_test, y_pred)
print('AUC: %.5f' % auc)

AUC: 0.71000


###7. Boosted Random Forest

In [0]:
from sklearn.ensemble import AdaBoostClassifier

#The Cross validation accuracy seems to decrease with increase in number of estimators. Huh!?
boosted_random_forest = AdaBoostClassifier(random_forest_classifier, n_estimators = 5, random_state = 32)
scores = cross_val_score(boosted_random_forest, X_train, y_train, cv=10)
print("Cross Validation Average Accuracy: {:.5f}".format(scores.mean()))

boosted_random_forest.fit(X_train, y_train)
score = boosted_random_forest.score(X_test, y_test)
print("Boosted Random Forest Test Set Accuracy: {:.5f}".format(score))

Cross Validation Average Accuracy: 0.78000
Boosted Random Forest Test Set Accuracy: 0.61000


###8. Neural Network

In [0]:
import tensorflow as tf
from tensorflow import keras

# Stochastic hyperparameter search
count = 10
max_score = 0

while (count > 0):
  count = count - 1
  
  model = tf.keras.models.Sequential([
    keras.layers.Dense(64, input_dim = 5, activation = 'relu', kernel_initializer='random_uniform', kernel_regularizer = keras.regularizers.l2(0.01)),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(16,  activation = 'relu', kernel_initializer='random_uniform', kernel_regularizer = keras.regularizers.l2(0.01)),
    keras.layers.Dropout(.2),
    keras.layers.Dense(4,  activation = 'relu', kernel_initializer='random_uniform', kernel_regularizer = keras.regularizers.l2(0.01)),
    keras.layers.Dropout(.2),
    keras.layers.Dense(1, activation='sigmoid')
  ])

  model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

  model.fit(X_train, y_train, epochs=20, batch_size = 20, verbose=0)

  test_loss, score = model.evaluate(X_test, y_test)
  print('Test accuracy:', score)
  
  if score > max_score:
    max_score = score
    model.save('my_model')



loaded_model = keras.models.load_model('my_model')
accuracy = loaded_model.evaluate(X_test, y_test, verbose=0)[1]
print("Neural Network Test Data Accuracy: {}".format(accuracy))

Test accuracy: 0.5
Test accuracy: 0.67
Test accuracy: 0.71
Test accuracy: 0.71
Test accuracy: 0.73
Test accuracy: 0.73
Test accuracy: 0.65
Test accuracy: 0.69
Test accuracy: 0.68
Test accuracy: 0.7
Neural Network Test Data Accuracy: 0.7300000190734863



#Analysis and Conclusions.

I have compared various Machine Learning classification Algorithms using AUC as a metric.

*   I have mean centered the data (normalisation), however it cannot be attributed to impacting the AUC value much (See last bullet).

*  The final scores for the various Machine Learning Algorithms implemented are as follows:-

  1.   Logistic Regression : 0.72
  2.   Decision Tree : 0.72
  3.   Random Forest : 0.720
  4.   SVM Classifier : 0.720
  5.   K Means + SVM Classifier : 0.74173
  6.   Naive Bayes' Classifier : 0.71
  7.   AdaBoost - 0.610
  8.   Neural Network : 0.7300




*   The best results were obtained by K Means + SVM Algorithm

*   Since machine learning algorithms are data intensive; Convergence cannot be guaranteed without sufficient data along with a significant variation in accuracy. This is also evident in our observations above