<img src="http://oproject.org/tiki-download_file.php?fileId=8&display&x=450&y=128">
<img src="http://files.oproject.org/tmvalogo.png" height="50%" width="50%">

# DataLoader Example

## Declare Factory
Load the TMVA library, then get the data for training and testing from github and store it into the input file. Setup the output file where we store the results on the training and testing set. 

Then set up the Factory. You can define different models that train on different datasets and add them to the factory and the results for all of them will be saved into the output file for convenience.

In [1]:
TMVA::Tools::Instance();

auto inputFile = TFile::Open("https://raw.githubusercontent.com/iml-wg/tmvatutorials/master/inputdata.root");
auto outputFile = TFile::Open("TMVAOutputBDT.root", "RECREATE");

TMVA::Factory factory("TMVARegression", outputFile,
                      "!V:!Silent:Color:DrawProgressBar:AnalysisType=Regression" ); 

--- Factory                  : You are running ROOT Version: 6.07/07, Apr 1, 2016
--- Factory                  : 
--- Factory                  : _/_/_/_/_/ _|      _|  _|      _|    _|_|   
--- Factory                  :    _/      _|_|  _|_|  _|      _|  _|    _| 
--- Factory                  :   _/       _|  _|  _|  _|      _|  _|_|_|_| 
--- Factory                  :  _/        _|      _|    _|  _|    _|    _| 
--- Factory                  : _/         _|      _|      _|      _|    _| 
--- Factory                  : 
--- Factory                  : ___________TMVA Version 4.2.1, Feb 5, 2015
--- Factory                  : 


## Declare DataLoader
The dataloader defines the feature variables and the target to use for training. With different DataLoaders it's easy to train/test on datasets with different variables. You'll see how to link the DataLoader to the model a bit further down.

In [2]:
TMVA::DataLoader loader("dataset"); 

// Add the feature variables
loader.AddVariable("var1");
loader.AddVariable("var2");
loader.AddVariable("var3");
loader.AddVariable("var4");
loader.AddVariable("var5 := var1-var3"); // create new features
loader.AddVariable("var6 := var1+var2");

loader.AddTarget( "target := var2+var3" ); // define the target for the regression


## Setup Dataset
Tell the DataLoader about the actual training/testing data by linking it to a TTree. The TTree leaves should have the same names as those defined in the loader above. Define cuts if you only want to consider a subset of the data for training and testing. Then split the data into a training set and test set.

In [3]:
TTree *tree;
inputFile->GetObject("Sig", tree);

TCut mycuts = ""; // e.g. TCut mycuts = "abs(var1)<0.5";

loader.AddRegressionTree(tree, 1.0);   // link the TTree to the loader, weight for each event  = 1
loader.PrepareTrainingAndTestTree(mycuts,
                                   "nTrain_Regression=1000:nTest_Regression=1000:SplitMode=Random:NormMode=NumEvents:!V" );

--- DataSetInfo              : Dataset[dataset1] : Added class "Signal"	 with internal class number 0
--- dataset1                 : Add Tree Sig of type Signal with 6000 events
--- DataSetInfo              : Dataset[dataset1] : Added class "Background"	 with internal class number 1
--- dataset1                 : Add Tree Bkg of type Background with 6000 events
--- dataset1                 : Preparing trees for training and testing...
--- DataSetInfo              : Dataset[dataset2] : Added class "Signal"	 with internal class number 0
--- dataset2                 : Add Tree Sig of type Signal with 6000 events
--- DataSetInfo              : Dataset[dataset2] : Added class "Background"	 with internal class number 1
--- dataset2                 : Add Tree Bkg of type Background with 6000 events
--- dataset2                 : Preparing trees for training and testing...


# Book The Regression Method

Book the method for regression. Here we choose the Boosted Decision Tree model. You have to use gradient boosted trees for regression, hence the BDTG and BoostType=Grad. 

Define the hyperparameters: ntrees, boosttype, shrinkage, and the depth. Also define the loss function you want to use: 'AbsoluteDeviation', 'Huber', or 'LeastSquares'. nCuts determines how finely to look at each feature for the best cut. Larger values take more time, but you may get more accurate results. 

Booking a method tells the factory that you want to train this model on this dataset. Book multiple methods with multiple data sets and the factory will train, test, and evaluate all of them. 

In [4]:
// Boosted Decision Trees 
factory.BookMethod(&loader,TMVA::Types::kBDT, "BDTG",
                   TString("!H:!V:NTrees=64::BoostType=Grad:Shrinkage=0.3:nCuts=20:MaxDepth=4:")+
                   TString("RegressionLossFunctionBDTG=AbsoluteDeviation"));

--- Factory                  : Booking method: [1mBDT[0m DataSet Name: [1mdataset1[0m
--- DataSetFactory           : Dataset[dataset1] : Splitmode is: "RANDOM" the mixmode is: "SAMEASSPLITMODE"
--- DataSetFactory           : Dataset[dataset1] : Create training and testing trees -- looping over class "Signal" ...
--- DataSetFactory           : Dataset[dataset1] : Weight expression for class 'Signal': ""
--- DataSetFactory           : Dataset[dataset1] : Create training and testing trees -- looping over class "Background" ...
--- DataSetFactory           : Dataset[dataset1] : Weight expression for class 'Background': ""
--- DataSetFactory           : Dataset[dataset1] : Number of events in input trees (after possible flattening of arrays):
--- DataSetFactory           : Dataset[dataset1] :     Signal          -- number of events       : 6000   / sum of weights: 6000 
--- DataSetFactory           : Dataset[dataset1] :     Background      -- number of events       : 6000   / sum of wei

# Train Method

In [6]:
factory.TrainAllMethods();

--- Factory                  :  
--- Factory                  : Train all methods for Classification ...
--- Factory                  : 
--- Factory                  : current transformation string: 'I'
--- Factory                  : Dataset[dataset1] : Create Transformation "I" with events from all classes.
--- Id                       : Transformation, Variable selection : 
--- Id                       : Input : variable 'var1' (index=0).   <---> Output : variable 'var1' (index=0).
--- Id                       : Input : variable 'var2' (index=1).   <---> Output : variable 'var2' (index=1).
--- Id                       : Input : variable 'var3' (index=2).   <---> Output : variable 'var3' (index=2).
--- Id                       : Preparing the Identity transformation...
--- TFHandler_Factory        : -----------------------------------------------------------
--- TFHandler_Factory        : Variable        Mean        RMS   [        Min        Max ]
--- TFHandler_Factory        : ------

# Test and Evaluate the Model

In [7]:
factory.TestAllMethods();
factory.EvaluateAllMethods();    

--- Factory                  : Test all methods...
--- Factory                  : Test method: BDT for Classification performance
--- BDT                      : Dataset[dataset1] : Evaluation of BDT on testing sample (10000 events)
--- BDT                      : Dataset[dataset1] : Elapsed time for evaluation of 10000 events: [1;31m0.153 sec[0m       
--- Factory                  : Test method: MLP for Classification performance
--- MLP                      : Dataset[dataset1] : Evaluation of MLP on testing sample (10000 events)
--- MLP                      : Dataset[dataset1] : Elapsed time for evaluation of 10000 events: [1;31m0.014 sec[0m       
--- Factory                  : Test method: BDT for Classification performance
--- BDT                      : Dataset[dataset2] : Evaluation of BDT on testing sample (10000 events)
--- BDT                      : Dataset[dataset2] : Elapsed time for evaluation of 10000 events: [1;31m0.135 sec[0m       
--- Factory                  : Tes

## Gather and Plot the Results
Close the output file so that it saves to disk and we can open it without issue.

The results on the training set are stored in dataset/TrainTree and the results on the test set are stored in dataset/TestTree. The BDT predictions are in the "BDTG" leaf and the target is in the "target" leaf. 

Let's plot the residuals for the BDTG predictions.

In [None]:
outputFile->Close();
auto resultsFile = TFile::Open("TMVAOutputBDT.root");
auto resultsTree = resultsFile->Get("dataset/TestTree");
resultsTree->Draw("BDTG-target");