# Algorithms comparison - Unsupervised
In this notebook I am going to use the pre-processed data to train unsupervised algorithms. I will compare its performance using some numerical metrics.

In [1]:
# Importing main libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

## Data
I will import the data and separate it for training and testing.

In [2]:
# Importing the data 
from benchtools.src.datatools import read_multifiles

df = read_multifiles(filename='RD_dataset', nbatch=10)
df.head()

Unnamed: 0,pT_j1,m_j1,eta_j1,phi_j1,E_j1,tau_21_j1,nhadrons_j1,pT_j2,m_j2,eta_j2,phi_j2,E_j2,tau_21_j2,nhadrons_j2,m_jj,deltaR_j12,n_hadrons,label
0,1286.727685,106.912129,0.185508,-2.763676,1313.290435,0.624659,36,1283.220733,63.164215,0.064989,0.393688,1287.481934,0.713248,33,2580.489568,3.159663,109.0,0.0
1,1354.39407,614.269108,0.826505,1.365524,1943.559886,0.311688,84,1325.613761,439.06415,-0.874319,-1.786248,1916.370744,0.276881,97,3859.315047,3.581406,208.0,0.0
2,1214.955723,645.865619,-0.196786,2.040545,1396.840654,0.238205,119,1072.462085,113.76884,0.143831,-1.09033,1089.53063,0.726963,59,2480.769725,3.149348,196.0,0.0
3,1285.227873,516.835248,0.328693,2.975321,1450.485926,0.013429,65,1220.251279,174.796077,0.294854,-0.322661,1285.618789,0.706361,89,2609.893413,3.298155,183.0,0.0
4,1210.415787,129.499352,-0.744836,-2.883347,1567.3453,0.42355,54,1091.785816,155.362262,1.060534,0.264977,1772.340209,0.787662,57,3313.488835,3.629229,169.0,1.0


In [3]:
from sklearn.preprocessing import MinMaxScaler
from benchtools.src.datatools import separate_data
from sklearn.model_selection import train_test_split

# Separating variables from label
X, y = separate_data(df, standardize=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

# Eliminating the columns of mass so that the training is model-free 
X_train_nm = X_train.drop(['m_j1', 'm_j2', 'm_jj'], axis=1)
X_test_nm = X_test.drop(['m_j1', 'm_j2', 'm_jj'], axis=1)
X_nm = X.drop(['m_j1', 'm_j2', 'm_jj'], axis=1)

## Classification
I will use three different classificators and will calculate the following metrics for each:
- Precision
- Recall
- F1 score
- Rand index

According to [sklearn site](https://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation):

> Evaluating the performance of a clustering algorithm is not as trivial as counting the number of errors or the precision and recall of a supervised classification algorithm. In particular any evaluation metric should not take the absolute values of the cluster labels into account but rather if this clustering define separations of the data similar to some ground truth set of classes or satisfying some assumption such that members belong to the same class are more similar than members of different classes according to some similarity metric.

The rand index is a measure of the similarity between two data clusterings. We will use it to address this. However, we will calculate the other variables because we want to compare these algorithms with the supervised ones.

In [4]:
# Importing the metrics
from sklearn.metrics import precision_score, log_loss, recall_score, plot_confusion_matrix,classification_report, f1_score, silhouette_score, rand_score

# I will define a function that calculates all the metrics and returns a dataframe with them 
def metrics(log, log_cols, name, y_test, y_pred, X_pca_train):
    print("="*30)
    print(name)

    print('****Results****')

    # Calculating metrics
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    rand = rand_score(y_test, y_pred)

    # Prin the report
    print(classification_report(y_test, y_pred, target_names=['background','signal']))

    # Inserting the data in the dataframe
    log_entry = pd.DataFrame([[name, recall, precision*100, f1, rand]], columns=log_cols)
    return log_entry

In [5]:
# Importing the algorithms
from sklearn.cluster import KMeans, Birch, OPTICS
from sklearn.mixture import GaussianMixture

# Record for visual comparison 
log_cols=["Classifier", "Precision", "Recall", "F1 score", "Rand score"]
log = pd.DataFrame(columns=log_cols)
classifiers = [KMeans(n_clusters=2, random_state=19)
              , Birch(n_clusters=2)
              , GaussianMixture(n_components=2, random_state=0)
              ]

for clf in classifiers:
    try: del y_pred
    except Exception:
        pass
    # Getting the name
    name = clf.__class__.__name__
    # Training
    clf.fit(X_train_nm)
    # Obtaining predictions
    y_pred = clf.predict(X_test_nm)
    # Obtaining metrics
    log_entry = metrics(log, log_cols, name, y_test, y_pred, X_train_nm)
    log = log.append(log_entry)

KMeans
****Results****
              precision    recall  f1-score   support

  background       0.91      0.50      0.64    139578
      signal       0.09      0.50      0.15     14022

    accuracy                           0.50    153600
   macro avg       0.50      0.50      0.40    153600
weighted avg       0.83      0.50      0.60    153600

Birch
****Results****
              precision    recall  f1-score   support

  background       0.91      0.50      0.64    139578
      signal       0.09      0.50      0.15     14022

    accuracy                           0.50    153600
   macro avg       0.50      0.50      0.40    153600
weighted avg       0.83      0.50      0.60    153600

GaussianMixture
****Results****
              precision    recall  f1-score   support

  background       0.91      0.50      0.64    139578
      signal       0.09      0.50      0.15     14022

    accuracy                           0.50    153600
   macro avg       0.50      0.50      0.40    1536

## Comparison
Here I'll plot the values to compare the algorithms

In [6]:
import numpy as np
import pandas as pd
import seaborn as sns

# Defining a function for ploting the bar plots
def comparative_barplot(log, x, color):
    sns.set_color_codes("muted")
    sns.barplot(x=x, y='Classifier', data=log, color=color)
    plt.xlabel('{}'.format(x))
    plt.title('Classifiers: {}'.format(x))
    plt.show()

In [7]:
colors = ['b', 'y', 'r', 'g', 'o']
columns = log.columns.tolist()
columns.remove('Classifier')
log.head()
#for column,color in zip(columns, colors):
#    comparative_barplot(log, column, color)

Unnamed: 0,Classifier,Precision,Recall,F1 score,Rand score
0,KMeans,0.495222,9.020408,0.15261,0.500005
0,Birch,0.495222,9.020408,0.15261,0.500005
0,GaussianMixture,0.495222,9.020408,0.15261,0.500005


In general, the algorithms did not performed well. The precision, f1 scores and rand index are low. Also, we see that we can't select any of them as better seems all of them seem to have about the same metrics. Checking the classification report above, it can be seen that the classification is indeed really similar.

## Reduction of dimensions
I'll try reducing the dimension of the input to see if the algorithms improve the classification. Principal component analysis, or PCA, is commonly use for this

In [8]:
from sklearn.decomposition import PCA
# Reducing to 5 inputs
pca = PCA(n_components=5)
pca.fit(X_train_nm)

# Transforming the data to this inputs
X_pca_train = pca.transform(X_train_nm)
X_pca_train = pd.DataFrame(X_pca_train)

X_pca_test = pca.transform(X_test_nm)
X_pca_test = pd.DataFrame(X_pca_test)

In [9]:
# Record for visual comparison 
log_cols=["Classifier", "Precision", "Recall", "F1 score", "Rand score"]
log = pd.DataFrame(columns=log_cols)
classifiers = [KMeans(n_clusters=2, random_state=15)
              , Birch(n_clusters=2)
              , GaussianMixture(n_components=2, random_state=0)
              ]

for clf in classifiers:
    # Getting the name
    name = clf.__class__.__name__
    # Training
    clf.fit(X_pca_train)
    # Obtaining predictions
    y_pred = clf.predict(X_pca_test)
    # Obtaining metrics
    log_entry = metrics(log, log_cols, name, y_test, y_pred, X_pca_train)
    log = log.append(log_entry)

KMeans
****Results****
              precision    recall  f1-score   support

  background       0.91      0.50      0.64    139578
      signal       0.09      0.50      0.15     14022

    accuracy                           0.50    153600
   macro avg       0.50      0.50      0.40    153600
weighted avg       0.83      0.50      0.60    153600

Birch
****Results****
              precision    recall  f1-score   support

  background       0.91      0.50      0.65    139578
      signal       0.09      0.50      0.16     14022

    accuracy                           0.50    153600
   macro avg       0.50      0.50      0.40    153600
weighted avg       0.84      0.50      0.60    153600

GaussianMixture
****Results****
              precision    recall  f1-score   support

  background       0.91      0.50      0.65    139578
      signal       0.09      0.50      0.16     14022

    accuracy                           0.50    153600
   macro avg       0.50      0.50      0.40    1536

In [10]:
colors = ['b', 'y', 'r', 'g', 'o']
columns = log.columns.tolist()
columns.remove('Classifier')

log.head()
#for column,color in zip(columns, colors):
#    comparative_barplot(log, column, color)

Unnamed: 0,Classifier,Precision,Recall,F1 score,Rand score
0,KMeans,0.495222,9.020408,0.15261,0.500005
0,Birch,0.504778,9.237917,0.156177,0.500005
0,GaussianMixture,0.504778,9.237917,0.156177,0.500005


Although the classification did not improve, we can see a difference particularly in KMeans. Surprisingly, the precision, recall and f1 score decreased for this classificator. Further exploration has to be done to see how are these classifiers classifying.