# Higgs Machine Learning Challenge Example

This is an example solution to the Kaggle Higgs Machine learning challenge (https://www.kaggle.com/c/higgs-boson).

This script should serve as a starting point for learning how to get the data into some format appropriate for training a model on it. This example uses the popular Python packages Numpy and Pandas which should make your life easier. It also shows basic plotting with matplotlib and a machine learning model built using scikit-learn.

https://numpy.org/

https://pandas.pydata.org/

https://scikit-learn.org/stable/

First we will import the modules we are going to use, including code from the files Plotting and Tools found in the same directory as this file.

In [None]:
import pandas as pd
import numpy as np
%matplotlib notebook
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn import tree
from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score
import Plotting
import Tools
from joblib import dump, load

We will define a few functions to help us out later on.

In [None]:
def drop_neg(value):
    return value == -999.0

def get_class(value,cls):
    return value != cls

Below is a function for training the model. 

First the data is read in using pandas built in csv reader. A filter is applied to select only certain features from the data and then the data is sorted into signal and background classes bassed on the values in the Label collumn. You should repurpose this code to help you wrangle the data into your own models.

This particular model is a DecisionTreeClassifier from scikit-learn bossted by an algorithm called AdaBoost (https://en.wikipedia.org/wiki/AdaBoost). We set some of the parameters for the classifier and the boosting algorithm but to see the full suite of options you should check the scikit-learn documentation.

We also call a function from the Tools file that will train the model. You can adapt this function for your own use.

Next a function from Plotting is called which uses matplotlib (https://matplotlib.org/) to plot the output score of model for the background and signal events respectively. Later on when we call the train function we will look at this plot in more detail.

Finally a call to the function dump is made. This function is from the library joblib (https://joblib.readthedocs.io/en/latest/) and will save our model to the file "example_sol.joblib".

In [None]:
def train():
    df = pd.read_csv("train.csv")

    df = df.filter(regex='DER_mass_MMC|DER_mass_transverse_met_lep|DER_pt_h|DER_deltar_tau_lep|DER_mass_vis|Label|Weight')

    sig = df.drop( df[ np.vectorize(get_class,excluded=['cls'])(value=df.Label,cls="s") ].index ).drop('Label',axis=1)
    bkg = df.drop( df[ np.vectorize(get_class,excluded=['cls'])(value=df.Label,cls="b") ].index ).drop('Label',axis=1)

    sig_weights = sig.Weight.values
    bkg_weights = bkg.Weight.values

    sig = sig.drop('Weight',axis=1)
    bkg = bkg.drop('Weight',axis=1)

    clf = AdaBoostClassifier(tree.DecisionTreeClassifier(max_depth=4),
                             algorithm="SAMME",
                             n_estimators=200)


    Tools.train_mva(clf,sig,bkg,sig_weights,bkg_weights)

    Plotting.plot_output(clf,sig,bkg)

    dump(clf, "example_sol.joblib")

Below is a function that tests our model.

First we read in the validation data and filter the same columns as before, except this time we don't have labels to load.

Then we use joblib's load function to load the model we saved earlier. It is very useful to be able to save your trained models to use later, that way you're not forced to use the model whilst it's still in memory.

The loaded classifier is then used to predict the labels of the validation sample. Those labels are stored into the array in the Label columm.

We sort the values in the array by the value of the decision function.

Lastly a simple for loop demonstrates how to present your models predictions so that we can properly evaluate how well your model performs in the challenge using the AMS score. Please ensure that whatever code you write outputs its submission files to the same specification as this for loop.

In [None]:
def test():
    val = pd.read_csv("validation.csv")
    val = val.filter(regex='EventId|DER_mass_MMC|DER_mass_transverse_met_lep|DER_pt_h|DER_deltar_tau_lep|DER_mass_vis')

    clf = load("example_sol.joblib")

    eva = val.drop('EventId',axis=1)

    val['Label'] = clf.predict(eva.values)
    val['DF'] = clf.decision_function(eva.values)

    val = val.sort_values(by='DF',ascending=False).reset_index(drop=True)

    with open('validation_submission.csv','w') as f:
        f.write('EventId,RankOrder,Class\n')
        for index, row in val.iterrows():
            cls = 's' if row.Label == 1.0 else 'b'
            f.write(str(int(row.EventId)))
            f.write(',')
            f.write(str(index+1))
            f.write(',')
            f.write(cls)
            f.write('\n')
        f.close()

Let's run the train function and see what happens...

In [None]:
train()

... that probably took a little while but hopefully not too long. Part of the challenge of building a machine learning model is designing something that will we give you results in a suitable timeframe. The timeframe for this challenge is to submit your solutions by 15:30 so you should bear this in mind when it comes to the complexity of your model. 

And look! Our plot appeared (hopefully!)

What can you learn from this plot. How are the signal and background events distributed as a function of the model score?

What would the output of a better model look like? And a worse one?

Next lets run the test function and produce the submission file.

In [None]:
test()

You should be able to see the submission file in the same directory as this file. Take a look inside and make sure you understand the submission format so that you can correctly submit your solution later on!

Now it's time for you to start building your own model. Feel to use any code from this example to help you get started but also check the materials provided for more ideas.

Good luck! •ᴗ•