# BIP Framework training example 

In [1]:
using Pkg
Pkg.activate("..")
using BIPs

[32m[1m  Activating[22m[39m project at `/z1-josemm/josemm/MyDocs/personal_BIP/BIPs.jl`


In [2]:
using Statistics
using Pkg.Artifacts

Lets begin by bringing in the dataset. It contains tree splits:
* **train**: the training set with 1M jets
* **validation**: the validation set with 400k jets

And of course later we will use the **test** set with other 400k jets to report the results


In [3]:
dataset_path = "../../../DataLake/raw"
# dataset_path = "/Users/ortner/datasets/toptagging"

train_data_path = dataset_path*"/train.h5"
val_data_path = dataset_path*"/val.h5"

"../../../DataLake/raw/val.h5"

### Reading the data

In order to read the datasets, we call the `read_dataset` function:
to read the TopQuark format

In [4]:
train_jets, train_labels = BIPs.read_data("TQ", train_data_path)
train_labels = [reinterpret(Bool, b == 1.0) for b in train_labels]
print("Number of entries in the training data: ", length(train_jets))

Number of entries in the training data: 1210997

In [5]:
val_jets, val_labels = BIPs.read_data("TQ", val_data_path)
val_labels = [reinterpret(Bool, b == 1.0) for b in val_labels]
print("Number of entries in the validation data: ", length(val_jets))

Number of entries in the validation data: 402999

Lets examine how one of the jets looks like, each one of the entries is one detected particle's four momentum $(E, p_x, p_y, p_z)$.

However,in order to compute the embeddings, it is necesary to convert the jets to a format that can be used by the framework. The function `data2hyp` allows to convert each detected four momentum to the jet basis, a.k.a $(\tilde p_T, \cos(\theta), \sin(\theta), \tilde y, E_T)$

In [6]:
train_transf_jets = data2basis(train_jets; basis="hyp")
val_transf_jets = data2basis(val_jets; basis="hyp")
println("Transformed jets")

Transformed jets


### The embeddings

Once the jets are converted to the jet basis, it is moment to embed the model using the *Invariant Polynomials*. 

The function `build_ip` allocates efficiently the sparse basis, while the `bip_data` computes the invariant representation of each one of the jets.

In [1]:
f_bip, specs, a_basis = build_ip(order=2, levels=5)
    
function bip_data(dataset_jets)
    storage = zeros(length(dataset_jets), length(specs))
    for i = 1:length(dataset_jets)
        storage[i, :] = f_bip(dataset_jets[i])
    end
    storage[:, 2:end]
end

UndefVarError: UndefVarError: build_ip not defined

In [8]:
specs

596-element Vector{Int64}:
 0
 1
 1
 1
 1
 1
 1
 2
 2
 2
 ⋮
 3
 3
 3
 3
 3
 3
 3
 3
 3

In [9]:
train_embedded_jets = bip_data(train_transf_jets)
println("Embedded train jets correclty")
val_embedded_jets = bip_data(val_transf_jets)
println("Embedded test jets correclty")

Embedded train jets correclty


Embedded test jets correclty


In [10]:
length(specs)

596

### Training a classifier model

The embeddings are now created for the dataset. From this point on, the classification itself is absolutelly versatile. For this specific example we will use the out-of-the box classifier `sklearn.linear_model.HistGradientBoostingClassifier` that bines the data and then applies a grandient boosted trees algorithm. 

Now, lets fit a simple model to the data.


In [11]:
using PyCall
@pyimport sklearn.ensemble as sk_ensemble

In [12]:
#GCT = sk_ensemble.HistGradientBoostingClassifier(verbose=false, max_iter=3000, l2_regularization=0.02, learning_rate=0.02, max_depth = 4).fit(train_embedded_jets, train_labels)

# Lest test how we do performance

Now that we understanad the framework, lets see how our model performs on the test set.

In [13]:
test_data_path = "../../../DataLake/raw/test.h5"

"../../../DataLake/raw/test.h5"

In [14]:
test_jets, test_labels = BIPs.read_data("TQ", test_data_path)
test_labels = [reinterpret(Bool, b == 1.0) for b in test_labels]
test_transf_jets = data2basis(test_jets; basis="hyp")
test_embedded_jets = bip_data(test_transf_jets)
print("Embedded test jets correclty")

Embedded test jets correclty

In [15]:
test_preds = GCT.score(test_embedded_jets, test_labels)

UndefVarError: UndefVarError: GCT not defined

In [16]:
using DelimitedFiles

In [17]:
writedlm( "/home/josemm/MyDocs/personal_BIP/BIPs.jl/foo/storage_basis/specs.csv",  specs, ',')

In [18]:
writedlm( "/home/josemm/MyDocs/personal_BIP/BIPs.jl/foo/storage_basis/train_basis.csv",  train_embedded_jets, ',')
writedlm( "/home/josemm/MyDocs/personal_BIP/BIPs.jl/foo/storage_basis/train_labels.csv",  train_labels, ',')

In [19]:
writedlm( "/home/josemm/MyDocs/personal_BIP/BIPs.jl/foo/storage_basis/val_basis.csv",  val_embedded_jets, ',')
writedlm( "/home/josemm/MyDocs/personal_BIP/BIPs.jl/foo/storage_basis/val_labels.csv",  val_labels, ',')

In [20]:
writedlm( "/home/josemm/MyDocs/personal_BIP/BIPs.jl/foo/storage_basis/test_basis.csv",  test_embedded_jets, ',')
writedlm( "/home/josemm/MyDocs/personal_BIP/BIPs.jl/foo/storage_basis/test_labels.csv",  test_labels, ',')

In [22]:
a_basis.spec[[6,6,1]]

3-element Vector{BIPs.BiPolynomials.Modules.ASpec}:
 BIPs.BiPolynomials.Modules.ASpec(1, 0, 0)
 BIPs.BiPolynomials.Modules.ASpec(1, 0, 0)
 BIPs.BiPolynomials.Modules.ASpec(0, 0, 0)