## General setup

To make sure things are working and `hepdata_lib` is available, run the following command:

In [1]:
import hepdata_lib

Welcome to JupyROOT 6.22/06


## Creating your HEPData submission

The `Submission` object represents the whole HEPData entry and thus carries the top-level meta data that is equally valid for all the tables and variables you may want to enter. The object is also used to create the physical submission files you will upload to the HEPData web interface.

When using `hepdata_lib` to make an entry, you always need to create a `Submission` object. Let's do that now, and then add data to it step by step:

In [2]:
from hepdata_lib import Submission
submission = Submission()

In general, a `Submission` should contain details on the actual analysis such as it's abstract as well as links to the actual publication. The abstract should be in a plain text file. For `inspire` there's a special `record_id`, while for links to `arXiv` etc. one should use plain hyperlinks.

In [3]:
submission.read_abstract("abstract.txt")
#submission.add_link("Webpage with all figures and tables", "https://cms-results.web.cern.ch/cms-results/public-results/publications/B2G-16-029/")
#submission.add_link("arXiv", "http://arxiv.org/abs/arXiv:1802.09407")
#submission.add_record_id(1657397, "inspire")

Adding CalcHEP model and LHE headers

In [4]:
submission.add_additional_resource("CalcHEP model.","HNmodel.tar", copy_file=True)
submission.add_additional_resource("CalcHEP LHE headers.","LHEheaders.tar", copy_file=True)

## Adding a table/figure

In HEPData, figures and table will both be `Table` objects. The example here shows reading a plain text file containing the signal effiency times acceptance as a function of resonance mass for different signal models. The file has been uploaded to the `example_files` directory. For your submission, create a new directory, e.g. using the analysis identifier.

Let's have a look at the file:

In [5]:
!head cutflowM500.txt

SELECTION           mumujj   eejj
---------           ------  -----
Trigger             0.692   0.515
LeadingLepton     0.692   0.515
SubleadingLepton     0.692   0.515
FatJet               0.407   0.305
m(ll)                0.379   0.305


The first column is the mass value, the other columns contain the efficiency times acceptance values.

Let's create the table/figure. First, we need to give it a name, which is usually just the identifier in the paper, here "Figure 1". The table also needs a description, which is usually the caption. You also need to describe the location, i.e. where to find it in the publication:

In [6]:
from hepdata_lib import Table
table = Table("Cut-flow table")
table.description = "Cut-flow table mN=0.5TeV, electron, muon channel, 2016."
table.location = "Additional material"

Now we need to provide more information on what is actually shown, which is done via `keywords`. The ones that are available can be taken from the documentation:
- [Observables](https://hepdata-submission.readthedocs.io/en/latest/keywords/observables.html)
- [Phrases](https://hepdata-submission.readthedocs.io/en/latest/keywords/phrases.html)
- [Particles](https://hepdata-submission.readthedocs.io/en/latest/keywords/partlist.html)

In [7]:
table.keywords["observables"] = ["ACC", "EFF"]

Let's read in the file. For this purpose, `numpy` is very handy. Since the first two rows are the header, we skip them:

In [8]:
import numpy as np
data = np.loadtxt("cutflowM500.txt", skiprows=2, usecols=range(1,3))

In [9]:
dataTXT = np.loadtxt("cutflowM500.txt", skiprows=2, usecols=range(0), dtype="str")

`numpy` stores the content as arrays. You can actually see that the entry that was labelled as `NaN` is correctly read in:

In [10]:
from __future__ import print_function
print(data)

[[0.692 0.515]
 [0.692 0.515]
 [0.692 0.515]
 [0.407 0.305]
 [0.379 0.305]]


We will now use this for our `Variable` definitions. The x-axis is usually the independent variable (`is_independent=True`), whereas the other ones are dependent (i.e. a function of the former). You also need to declare whether the variable is binned or not as well as the units. Similar as for the `keywords` used above, it is again important to provide additional information that can be found via the HEPData web interface using the observables and particles linked above. The values assigned are just slices of the `data` array:

In [11]:
from hepdata_lib import Variable
d = Variable("Selection", is_independent=True, is_binned=False, units="")
d.values = dataTXT[:,0]

Effmumujj = Variable("Efficiency mumujj", is_independent=False, is_binned=False, units="")
Effmumujj.values = data[:,0]
Effmumujj.add_qualifier("SQRT(S)", 13, "TeV")

Effeejj = Variable("Efficiency eejj", is_independent=False, is_binned=False, units="")
Effeejj.values = data[:,1]
Effeejj.add_qualifier("SQRT(S)", 13, "TeV")

table.add_variable(d)
table.add_variable(Effmumujj)
table.add_variable(Effeejj)

This is all that's needed for the table/figure. We still need it to the submission:

In [12]:
submission.add_table(table)

Once you've added all tables/figures and the general submission details, you should add a few more keywords to all tables for better identification and searchability, e.g. the centre-of-mass energy:

In [13]:
for table in submission.tables:
    table.keywords["cmenergies"] = [13000]

# Reading histograms for SR plots

### Electron channel

In [16]:
from hepdata_lib import Table
table = Table("Figure 4a")
table.description = "Distributions of \mllj for the data, and the pre-fit backgrounds (stacked histograms), in the SRs of the \eeqq channel. The template for one signal hypothesis is shown overlaid as a yellow solid line. The overflow is included in the last bin. The middle panels show ratios of the data to the pre-fit background prediction and post-fit background yield as red open squares and blue points, respectively. The gray band in the middle panels indicates the systematic component of the post-fit uncertainty. The lower panels show the distributions of the pulls, defined in the text."
table.location = "Data from Figure 4 (upper left)."
table.keywords["observables"] = ["N"]
table.add_image("Figure_004-a.pdf")

In [20]:
from hepdata_lib import RootFileReader

reader = RootFileReader("eejj_PostFit_histograms_L13_M05.root")
reader_data = RootFileReader("eejj_PostFit_histograms_L13_M05.root")
reader_signal = RootFileReader("eejj_PostFit_histograms_L13_M05.root")

TotalBackground = reader.read_hist_1d("prefit/TotalBkg")
#TT = reader.read_hist_1d("shapes_prefit/cat0_singleH/TT")
#QCD = reader.read_hist_1d("shapes_prefit/cat0_singleH/QCDTT")
#WJets = reader.read_hist_1d("shapes_prefit/cat0_singleH/WJets")
#ZJets = reader.read_hist_1d("shapes_prefit/cat0_singleH/ZJets")

Data = reader_data.read_hist_1d("prefit/data_obs")

signal = reader_signal.read_hist_1d("prefit/TotalSig")

In [21]:
from hepdata_lib import Variable, Uncertainty

# x-axis: B quark mass
mmed = Variable("$m(eeJ)$", is_independent=True, is_binned=False, units="TeV")
mmed.values = signal["x"]

# y-axis: N events
sig = Variable("Number of signal events", is_independent=False, is_binned=False, units="")
sig.values = signal["y"]

totalbackground = Variable("Number of background events", is_independent=False, is_binned=False, units="")
totalbackground.values = TotalBackground["y"]

#tt = Variable("Number of ttbar events", is_independent=False, is_binned=False, units="")
#tt.values = TT["y"]

#qcd = Variable("Number of qcd events", is_independent=False, is_binned=False, units="")
#qcd.values = QCD["y"]

#wjets = Variable("Number of wjets events", is_independent=False, is_binned=False, units="")
#wjets.values = WJets["y"]

#zjets = Variable("Number of zjets events", is_independent=False, is_binned=False, units="")
#zjets.values = ZJets["y"]

data = Variable("Number of data events", is_independent=False, is_binned=False, units="")
data.values = Data["y"]

In [25]:
from hepdata_lib import Uncertainty

unc_totalbackground = Uncertainty("total uncertainty", is_symmetric=True)
unc_totalbackground.values = TotalBackground["y"]

unc_data = Uncertainty("Poisson errors", is_symmetric=True)
unc_data.values = Data["y"]

totalbackground.add_uncertainty(unc_totalbackground)
data.add_uncertainty(unc_data)

In [26]:
table.add_variable(mmed)
table.add_variable(sig)
table.add_variable(totalbackground)
#table.add_variable(tt)
#table.add_variable(qcd)
#table.add_variable(zjets)
#table.add_variable(wjets)
table.add_variable(data)

submission.add_table(table)

### Muon channel

In [None]:
from hepdata_lib import Table
table = Table("Figure 4a")
table.description = "Distributions of \mllj for the data, and the pre-fit backgrounds (stacked histograms), in the SRs of the \mmqq channel. The template for one signal hypothesis is shown overlaid as a yellow solid line. The overflow is included in the last bin. The middle panels show ratios of the data to the pre-fit background prediction and post-fit background yield as red open squares and blue points, respectively. The gray band in the middle panels indicates the systematic component of the post-fit uncertainty. The lower panels show the distributions of the pulls, defined in the text."
table.location = "Data from Figure 4 (upper right)."
table.keywords["observables"] = ["N"]
table.add_image("Figure_004-b.pdf")


from hepdata_lib import RootFileReader
reader = RootFileReader("mumujj_PostFit_histograms_L13_M05.root")
reader_data = RootFileReader("mumujj_PostFit_histograms_L13_M05.root")
reader_signal = RootFileReader("mumujj_PostFit_histograms_L13_M05.root")
TotalBackground = reader.read_hist_1d("prefit/TotalBkg")
Data = reader_data.read_hist_1d("prefit/data_obs")
signal = reader_signal.read_hist_1d("prefit/TotalSig")


from hepdata_lib import Uncertainty
unc_totalbackground = Uncertainty("total uncertainty", is_symmetric=True)
unc_totalbackground.values = TotalBackground["y"]
unc_data = Uncertainty("Poisson errors", is_symmetric=True)
unc_data.values = Data["y"]
totalbackground.add_uncertainty(unc_totalbackground)
data.add_uncertainty(unc_data)


table.add_variable(mmed)
table.add_variable(sig)
table.add_variable(totalbackground)
table.add_variable(data)
submission.add_table(table)

## Output file

Now it's time to create the submission for the upload. Here, we choose `example_output` as output directory:

In [29]:
outdir = "HN_output"
submission.create_files(outdir,remove_old=True)

Note that bins with zero content should preferably be omitted completely from the HEPData table.
Note that bins with zero content should preferably be omitted completely from the HEPData table.
Note that bins with zero content should preferably be omitted completely from the HEPData table.
Note that bins with zero content should preferably be omitted completely from the HEPData table.
	 error - submission.yaml is invalid HEPData YAML.
	 error - Duplicate table name: Figure 4a
	 error - Duplicate table data_file: figure_4a.yaml


AssertionError: The tar ball is not valid

In the working directory, you will now find a `submission.tar.gz` file, which you can use for uploading to your HEPData sandbox:

In [None]:
!ls submission.tar.gz

And the `example_output` directory will contain the generated `yaml` files:

In [None]:
!ls HN_output