Skip to content
Second-ranked solution to the Kaggle "Flavours of Physics" competition
Python
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
data
Gramolin_ALEPH2015.pdf
LICENSE
README.md
bst1.model
bst2.model
features.py
parameters.py
predict.py
train.py

README.md

Kaggle's Flavours of Physics: the second-ranked solution

This is a solution ranked second on the Private Leaderboard of the Kaggle "Flavours of Physics: Finding τ → μμμ" competition. The model is based on gradient boosting and implemented in Python with the help of the XGBoost library. It is simply a combination of two XGBoost classifiers (boosters) trained on different sets of features. The first booster is an ensemble of 200 decision trees targeting mostly geometric features (such as impact parameters and track isolation variables). The second booster consists of 100 trees trained on purely kinematic features. Final prediction is a weighted average of the probabilities predicted by the individual classifiers (with a weight of 0.78 assigned to the first booster). Combining two independent classifiers allows us to easily pass the correlation test. To pass the agreement test, the only thing needed is to exclude SPDhits from the features used in the training process.

Dependencies

  • The XGBoost library should be installed
  • The standard Python packages numpy, pandas, and csv are required
  • The training and test datasets (the files training.csv and test.csv) can be downloaded from here

How to generate the solution

  1. Put the data files training.csv and test.csv in the data directory.
  2. To train the XGBoost classifiers, run python train.py. The trained boosters will be saved in the files bst1.model and bst2.model, so you can make predictions on new datasets without re-training the model.
  3. To make a prediction, run python predict.py. Results will be written to submission.csv.

Feature engineering

Some new features were designed in addition to the original ones. The original feature SPDhits was not used since it prevents passing the agreement test. Lists of the features used to train each booster are provided below.

Features for the first booster

  • Original features: FlightDistance, FlightDistanceError, LifeTime, IP, IPSig, VertexChi2, dira, pt, DOCAone, DOCAtwo, DOCAthree, IP_p0p2, IP_p1p2, isolationa, isolationb, isolationc, isolationd, isolatione, isolationf, iso, CDF1, CDF2, CDF3, ISO_SumBDT, p0_IsoBDT, p1_IsoBDT, p2_IsoBDT, p0_track_Chi2Dof, p1_track_Chi2Dof, p2_track_Chi2Dof, p0_IP, p0_IPSig, p1_IP, p1_IPSig, p2_IP, p2_IPSig.

  • New features:

    • E is the full energy of the mother particle calculated assuming that the final-state particles p0, p1, and p2 are muons (E = E0 + E1 + E2).
    • FlightDistanceSig is the ratio (FlightDistance / FlightDistanceError).
    • DOCA_sum is the sum (DOCAone + DOCAtwo + DOCAthree).
    • isolation_sum is the sum (isolationa + isolationb + isolationc + isolationd + isolatione + isolationf).
    • IsoBDT_sum is the sum (p0_IsoBDT + p1_IsoBDT + p2_IsoBDT).
    • track_Chi2Dof is calculated as sqrt[(p0_track_Chi2Dof – 1)^2 + (p1_track_Chi2Dof – 1)^2 + (p2_track_Chi2Dof – 1)^2].
    • IP_sum is the sum (p0_IP + p1_IP + p2_IP).
    • IPSig_sum is the sum (p0_IPSig + p1_IPSig + p2_IPSig).
    • CDF_sum is the sum (CDF1 + CDF2 + CDF3).

Features for the second booster

  • Original features: dira, pt, p0_pt, p0_p, p0_eta, p1_pt, p1_p, p1_eta, p2_pt, p2_p, p2_eta.

  • New features:

    • E is the full energy of the mother particle calculated assuming that the final-state particles p0, p1, and p2 are muons (E = E0 + E1 + E2).
    • pz is the longitudinal momentum of the mother particle.
    • beta is the relativistic beta of the mother particle (beta = v / c).
    • gamma is the relativistic gamma of the mother particle (gamma = 1 / sqrt(1 – beta^2)).
    • beta_gamma is beta×gamma calculated as FlightDistance / (LifeTime×c), where c is the speed of light.
    • Delta_E is the difference between energies of the mother particle calculated in two different ways.
    • Delta_M is the difference between masses of the mother particle calculated in two different ways.
    • flag_M equals to 1 if the mass of the mother particle is close to the tau mass; equals to 0 otherwise.
    • E0 is the full energy of the particle p0 calculated as E0 = sqrt[(m_mu)^2 + (p0_p)^2], where m_mu is the muon mass.
    • E1 is the full energy of the particle p1 calculated as E1 = sqrt[(m_mu)^2 + (p1_p)^2], where m_mu is the muon mass.
    • E2 is the full energy of the particle p2 calculated as E2 = sqrt[(m_mu)^2 + (p2_p)^2], where m_mu is the muon mass.
    • E0_ratio is the ratio (E0 / E).
    • E1_ratio is the ratio (E1 / E).
    • E2_ratio is the ratio (E2 / E).
    • p0_pt_ratio is the ratio (p0_pt / pt).
    • p1_pt_ratio is the ratio (p1_pt / pt).
    • p2_pt_ratio is the ratio (p2_pt / pt).
    • eta_01 is the difference (p0_etap1_eta).
    • eta_02 is the difference (p0_etap2_eta).
    • eta_12 is the difference (p1_etap2_eta).
    • t_coll is calculated as (p0_pt + p1_pt + p2_pt) / pt (this equals to unity if the final-state particles p0, p1, and p2 are collinear in the transverse plane).
You can’t perform that action at this time.