![stuff.png](attachment:stuff.png)

# Machine Learning using ROOT's TMVA

The Toolkit for Multivariate Data Analysis with ROOT (TMVA) provides a machine learning environment for the processing and evaluation of multivariate classification. This notebook will cover on packages of ROOT-TMVA:
- Boosted Decision Tree
- Support Vector Machine
- Deep Neural Networks
- H-Matrix Discriminant




In [36]:
import ROOT

import numpy as np
import pandas as pd 



In [37]:
import os, sys, glob, collections

ROOT.gROOT.SetBatch()
ROOT.gStyle.SetOptStat(0)
ROOT.TMVA.Tools.Instance()

<cppyy.gbl.TMVA.Tools object at 0x7fb62ebda430>

#### Creating an output file for the trained dataset 

In [38]:
MethodName 	= "ML"
outFileName	= MethodName + "_output.root"
outputFile 	= ROOT.TFile(outFileName, "RECREATE")

#### Declaring Signal Tree and Background Tree

In [39]:
sigTree	= ROOT.TChain('jetTree')

#### Accessing the open data

In [40]:
inFileName 	= '../files/JetNtuple_RunIISummer16_13TeV_MC_'

for ifile in range (1,4):
	sigTree.Add(inFileName+str(ifile)+"_skimmed.root")

#### ROOT TMVA Data Loader and Factory

In [41]:
dataLoader	= ROOT.TMVA.DataLoader("dataset")
factory		= ROOT.TMVA.Factory(MethodName + "_Train", outputFile, "!V:!Silent:Color:DrawProgressBar:Transformations=I;D;P;G,D:AnalysisType=Classification")

#### Identifying jetTree for Signal and Background

In [42]:
bkgTree = sigTree.Clone("tree_bkg")

sigCut 	= "(isPhysUDS==1)"
bkgCut	= "(isPhysG==1)"

dataLoader.AddSignalTree(sigTree)
dataLoader.AddBackgroundTree(bkgTree)

dataLoader.Print()

DataSetInfo              : [dataset] : Added class "Signal"
                         : Add Tree jetTree of type Signal with 360148 events
DataSetInfo              : [dataset] : Added class "Background"
                         : Add Tree tree_bkg of type Background with 360148 events
OBJ: TMVA::DataLoader	dataset	Configurable


#### Declaring variable for data Loader

In [43]:
varList = ['jetPt', 'jetEta', 'QG_mult', 'QG_ptD', 'QG_axis2']
for var in varList:
	dataLoader.AddVariable(var)

#### Setting weight for Signal and Background

In [44]:
dataLoader.SetSignalWeightExpression("eventWeight")
dataLoader.SetBackgroundWeightExpression("eventWeight")

#### Prepare to Training and Test Tree

In [45]:
dataLoader.PrepareTrainingAndTestTree(sigCut, bkgCut, "SplitMode=Random:NormMode=NumEvents:!V")

### Booking Method of Factory 

****Boosted Decision Tree****

In [46]:
factory.BookMethod(dataLoader, ROOT.TMVA.Types.kBDT, "BDTG", "!H:!V:NTrees=1000:MinNodeSize=2.5%:BoostType=Grad:Shrinkage=0.10:UseBaggedBoost:BaggedSampleFraction=0.5:nCuts=20:MaxDepth=2")

<cppyy.gbl.TMVA.MethodBDT object at 0x7fb61fbbaea0>

Factory                  : Booking method: [1mBDTG[0m
                         : 
                         : the option NegWeightTreatment=InverseBoostNegWeights does not exist for BoostType=Grad
                         : --> change to new default NegWeightTreatment=Pray
                         : Building event vectors for type 2 Signal
                         : Dataset[dataset] :  create input formulas for tree jetTree
                         : Dataset[dataset] :  create input formulas for tree jetTree
                         : Dataset[dataset] :  create input formulas for tree jetTree
                         : Dataset[dataset] :  create input formulas for tree jetTree
                         : Building event vectors for type 2 Background
                         : Dataset[dataset] :  create input formulas for tree jetTree
                         : Dataset[dataset] :  create input formulas for tree jetTree
                         : Dataset[dataset] :  create input formulas 

#### Training dataset

In [47]:
factory.TrainAllMethods()
factory.TestAllMethods()
factory.EvaluateAllMethods()

ROOT.TMVA.TMVAGui(outFileName)

Factory                  : [1mTrain all methods[0m
Factory                  : [dataset] : Create Transformation "I" with events from all classes.
                         : 
                         : Transformation, Variable selection : 
                         : Input : variable 'jetPt' <---> Output : variable 'jetPt'
                         : Input : variable 'jetEta' <---> Output : variable 'jetEta'
                         : Input : variable 'QG_mult' <---> Output : variable 'QG_mult'
                         : Input : variable 'QG_ptD' <---> Output : variable 'QG_ptD'
                         : Input : variable 'QG_axis2' <---> Output : variable 'QG_axis2'
Factory                  : [dataset] : Create Transformation "D" with events from all classes.
                         : 
                         : Transformation, Variable selection : 
                         : Input : variable 'jetPt' <---> Output : variable 'jetPt'
                         : Input : variable 'jetEta' 

0%, time left: unknown
6%, time left: 80 sec
12%, time left: 74 sec
18%, time left: 69 sec
25%, time left: 64 sec
31%, time left: 58 sec
37%, time left: 53 sec
43%, time left: 48 sec
50%, time left: 42 sec
56%, time left: 37 sec
62%, time left: 32 sec
68%, time left: 26 sec
75%, time left: 21 sec
81%, time left: 15 sec
87%, time left: 10 sec
93%, time left: 5 sec
0%, time left: unknown
6%, time left: 2 sec
12%, time left: 2 sec
18%, time left: 2 sec
25%, time left: 2 sec
31%, time left: 1 sec
37%, time left: 1 sec
43%, time left: 1 sec
50%, time left: 1 sec
56%, time left: 1 sec
62%, time left: 1 sec
68%, time left: 1 sec
75%, time left: 0 sec
81%, time left: 0 sec
87%, time left: 0 sec
93%, time left: 0 sec
0%, time left: unknown
6%, time left: 3 sec
12%, time left: 3 sec
18%, time left: 2 sec
25%, time left: 2 sec
31%, time left: 2 sec
37%, time left: 2 sec
43%, time left: 1 sec
50%, time left: 1 sec
56%, time left: 1 sec
62%, time left: 1 sec
68%, time left: 1 sec
75%, time left: 0 

### Plotting the ROC

In [73]:
from plotting_helpers import *

colors = [ROOT.kGreen+2, ROOT.kBlue+2, ROOT.kRed+2, ROOT.kMagenta+2]

METHOD     = "ML"
FILENAME   = METHOD + "_output.root" 
OUTDIR     = "./PNG/"

PLOT_ROC = True
PLOT_KST = True

TAIL = ".png"

In [91]:
def main(fileName):

  file    = ROOT.TFile(fileName, "OPEN")
  dataset = file.Get("dataset")
  
  hNameROC = "_rejBvsS" # name of ROC histograms
  hNameKST = "_Train"   # training histograms

  xTitle = "Signal Efficiency"
  yTitle = "Background Rejection"
  fTitle = yTitle + " versus " + xTitle

  methodDict = {}
  histoList  = {}
  histoDict  = {}

NameError: name 'dataset' is not defined

In [90]:
#
# Access first directory to obtain keys for 2nd directory
#

keyList = dataset.GetListOfKeys()

for key in keyList:
    if "Method" in key.GetName():
        methodDict["mother"] = {}
        methodDict["daughter"] = {}
        methodDict["mother"][key.GetName()] = dataset.Get(key.GetName())

#
# Obtain keys for final directory
#
for method in methodDict["mother"]:

    keyList = methodDict["mother"][method].GetListOfKeys()

    for methodKey in keyList:
      methodName = methodKey.GetName()
      histoDict[methodName] = {}
      methodDict["daughter"][methodName] = methodDict["mother"][method].Get(methodName)

      #
      # Clone histograms
      #
      if PLOT_ROC: 
        histoDict[methodName]["ROC"]      = methodDict["daughter"][methodName].Get("MVA_"+methodName+hNameROC     ).Clone()
      
      if PLOT_KST:
        histoDict[methodName]["SigTrain"] = methodDict["daughter"][methodName].Get("MVA_"+methodName+hNameKST+"_S").Clone()
        histoDict[methodName]["BkgTrain"] = methodDict["daughter"][methodName].Get("MVA_"+methodName+hNameKST+"_B").Clone()
        histoDict[methodName]["SigTest"]  = methodDict["daughter"][methodName].Get("MVA_"+methodName+"_S"         ).Clone()
        histoDict[methodName]["BkgTest"]  = methodDict["daughter"][methodName].Get("MVA_"+methodName+"_B"         ).Clone()

      plotKST(histoDict[methodName], methodName)

plotROC(histoDict)

NameError: name 'dataset' is not defined

In [80]:
 
def plotKST(histoDict, methodName):

  canv = ROOT.TCanvas("canvKST"+methodName, "canv", 800, 800)

  #
  # Setup legends and histograms
  #
  lgd = ROOT.TLegend(0.55, 0.85, 0.95, 0.975)

  sigTrain = histoDict["SigTrain"]
  bkgTrain = histoDict["BkgTrain"]
  sigTest  = histoDict["SigTest"] 
  bkgTest  = histoDict["BkgTest"] 

  ROOT.TMVA.TMVAGlob.NormalizeHists(sigTrain, bkgTrain)
  ROOT.TMVA.TMVAGlob.NormalizeHists(sigTest,  bkgTest)
  ROOT.TMVA.TMVAGlob.SetSignalAndBackgroundStyle(sigTest, bkgTest)
  sigTest.SetLineWidth(1)
  bkgTest.SetLineWidth(1)

  #
  # Setup frame
  #
  frame = SetupFrame(sigTest, bkgTest, methodName)
  frame.Draw()

  canv.GetPad(0).SetLeftMargin(0.105)
  frame.GetYaxis().SetTitleOffset(1.2)

  #
  # Overlay signal and background test histos
  #
  sigTest.Draw("same hist")
  bkgTest.Draw("same hist")
  sigTrain.Draw("e1 same")
  bkgTrain.Draw("e1 same")

  lgd.AddEntry(sigTest,  "Signal (test)",      "F")
  lgd.AddEntry(bkgTest,  "Background (test)",  "F")
  lgd.AddEntry(sigTrain, "Signal (train)",     "E1")
  lgd.AddEntry(bkgTrain, "Background (train)", "E1")

  #
  # K-S Test
  #
  kolS = sigTest.KolmogorovTest(sigTrain, "X")
  kolB = bkgTest.KolmogorovTest(bkgTrain, "X")

  tt1    = ROOT.TText(0.15, 0.89,   "Kolmogorov-Smirnov Test")
  tt2    = ROOT.TText(0.15, 0.865, "Sig = "+str(kolS))
  tt3    = ROOT.TText(0.15, 0.84,  "Bkg = "+str(kolB))

  for text in [tt1, tt2, tt3]:
    text.SetNDC()
    text.SetTextSize(0.032)
    text.AppendPad()


  lgd.Draw()
  canv.Print(OUTDIR+"KST_"+methodName+TAIL)

  frame.Draw("same axis")


def plotROC(histoDict):

  #
  # Declare ROC canvas with grids, and the corresponding attributes
  #
  canv = ROOT.TCanvas("canvROC", "canv", 800, 800)
  canv.SetGrid()
  canv.SetTicks()

  #
  # Plot legends and histograms
  #
  lgd = ROOT.TLegend(0.2, 0.2, 0.6, 0.3)

  i = 0
  for methodName in histoDict:
    histo = histoDict[methodName]["ROC"]
    histo.SetLineColor(colors[i])
    FormatAxisText(histo, xTitle="Signal Efficiency", yTitle="Background Rejection (1-#epsilon_{B})", yOffset=1.2)
    histo.GetXaxis().SetRangeUser(0., 1.0)
    histo.SetMaximum(1.01)
    histo.SetMinimum(0.0)
    histo.SetLineWidth(2)
    histo.Draw("csame")

    lgd.AddEntry(histo, methodName.replace("_", " "), "l")
    i += 1

  lgd.Draw()
  # canv.SetLogy()
  canv.Print(OUTDIR+"ROC"+TAIL)

In [83]:
if __name__== "__main__":
  main(FILENAME)

  print("***** DONE PLOTTING ******")

***** DONE PLOTTING ******


Info in <TFile::Recover>: ML_output.root, recovered key TDirectoryFile:dataset at address 224
