### Comparing a set of datasets with the dataset popularity listing

This is an attempt to use the dataset popularity listing to to select the most relevant samples among a list of MC samples proposed as legacy.

In [1]:
import pandas as pd
from pandas.io.json import json_normalize

The dataset listing is from CMSDAS (https://cmsweb.cern.ch/das/request?view=plain&limit=50&instance=prod%2Fglobal&input=dataset%3D%2F*%2F*START42*%2FAODSIM)

Read it in to a dataframe.

In [2]:
datasets = pd.read_csv('datasets.txt', sep=" ", header=None)

In [3]:
datasets.head()

Unnamed: 0,0
0,/2B1Jet_TuneZ2_7TeV-alpgen-pythia6/Fall11-PU_S...
1,/2B2C1Jet_TuneZ2_7TeV-alpgen-pythia6/Fall11-PU...
2,/2B2C_TuneZ2_7TeV-alpgen-pythia6/Fall11-PU_S6_...
3,/2B2Jets_TuneZ2_7TeV-alpgen-pythia6/Fall11-PU_...
4,/2B3Jets_TuneZ2_7TeV-alpgen-pythia6/Fall11-PU_...


The dataset popularity json file is from https://cmsweb.cern.ch/popdb/popularity/dataSetTable

Read it in to a dataframe.

In [4]:
datapop = pd.read_json('datasets_start2010-01-01_stop2012-03-31.json')

In [5]:
datapop.head()


Unnamed: 0,DATA,SITENAME
0,"{'NACC': 2900135, 'TOTCPU': 3296253, 'NUSERS':...",summary
1,"{'NACC': 2026464, 'TOTCPU': 2056789, 'NUSERS':...",summary
2,"{'NACC': 1026045, 'TOTCPU': 2053499, 'NUSERS':...",summary
3,"{'NACC': 1262370, 'TOTCPU': 1915483, 'NUSERS':...",summary
4,"{'NACC': 1537482, 'TOTCPU': 1765946, 'NUSERS':...",summary


It comes in two parts, DATA and SITENAME, but only the DATA part is relevant in this context:

In [6]:
datapop['DATA']

0        {'NACC': 2900135, 'TOTCPU': 3296253, 'NUSERS':...
1        {'NACC': 2026464, 'TOTCPU': 2056789, 'NUSERS':...
2        {'NACC': 1026045, 'TOTCPU': 2053499, 'NUSERS':...
3        {'NACC': 1262370, 'TOTCPU': 1915483, 'NUSERS':...
4        {'NACC': 1537482, 'TOTCPU': 1765946, 'NUSERS':...
5        {'NACC': 987978, 'TOTCPU': 1167262, 'NUSERS': ...
6        {'NACC': 923959, 'TOTCPU': 1059471, 'NUSERS': ...
7        {'NACC': 2851106, 'TOTCPU': 966514, 'NUSERS': ...
8        {'NACC': 500687, 'TOTCPU': 922797, 'NUSERS': 1...
9        {'NACC': 15017490, 'TOTCPU': 900783, 'NUSERS':...
10       {'NACC': 688387, 'TOTCPU': 861974, 'NUSERS': 8...
11       {'NACC': 657485, 'TOTCPU': 746178, 'NUSERS': 1...
12       {'NACC': 329100, 'TOTCPU': 712978, 'NUSERS': 6...
13       {'NACC': 391693, 'TOTCPU': 689265, 'NUSERS': 8...
14       {'NACC': 284142, 'TOTCPU': 634980, 'NUSERS': 1...
15       {'NACC': 437517, 'TOTCPU': 567978, 'NUSERS': 1...
16       {'NACC': 243920, 'TOTCPU': 556346, 'NUSERS': 2.

In [7]:
datapop['DATA'][0]


{'COLLNAME': '/WJetsToLNu_TuneZ2_7TeV-madgraph-tauola/Summer11-PU_S4_START42_V11-v1/AODSIM',
 'NACC': 2900135,
 'NUSERS': 3667,
 'RNACC': '2.7',
 'RNUSERS': '1.3',
 'RTOTCPU': '5.2',
 'TOTCPU': 3296253}

`datapop` is a dataframe, but not in the format which is needed for data operations. Single elements in DATA contain the information we need, but we want the keys NACC, NUSERS etc in the column header.

To get a dataframe in a useful format,
see https://stackoverflow.com/questions/29681906/python-pandas-dataframe-from-series-of-dict

In [8]:
type(datapop)

pandas.core.frame.DataFrame

DATA part in the dataframe is a series:

In [9]:
type(datapop['DATA'])

pandas.core.series.Series

Each element in DATA is a dict:

In [10]:
type(datapop["DATA"].ix[0])

dict

We want the keys of this dict as column headers:

In [11]:
datapop["DATA"].ix[0].keys()

dict_keys(['NACC', 'TOTCPU', 'NUSERS', 'COLLNAME', 'RNACC', 'RNUSERS', 'RTOTCPU'])

This command from the instructions above does the magic (get a dataframe from a series of dicts...) :

In [12]:
new_df = pd.DataFrame(list(datapop['DATA']))

In [13]:
new_df.head()

Unnamed: 0,COLLNAME,NACC,NUSERS,RNACC,RNUSERS,RTOTCPU,TOTCPU
0,/WJetsToLNu_TuneZ2_7TeV-madgraph-tauola/Summer...,2900135,3667,2.7,1.3,5.2,3296253
1,/DYJetsToLL_TuneZ2_M-50_7TeV-madgraph-tauola/S...,2026464,3786,1.9,1.3,3.3,2056789
2,/TTJets_TuneZ2_7TeV-madgraph-tauola/Fall11-PU_...,1026045,948,0.9,0.3,3.2,2053499
3,/SingleMu/Run2011B-PromptReco-v1/AOD,1262370,2238,1.2,0.8,3.0,1915483
4,/SingleMu/Run2011A-PromptReco-v4/AOD,1537482,3571,1.4,1.3,2.8,1765946


Sort the dataframe for the number of users:

In [14]:
new_df.sort_values('NUSERS',ascending=False)

Unnamed: 0,COLLNAME,NACC,NUSERS,RNACC,RNUSERS,RTOTCPU,TOTCPU
1,/DYJetsToLL_TuneZ2_M-50_7TeV-madgraph-tauola/S...,2026464,3786,1.9,1.3,3.3,2056789
0,/WJetsToLNu_TuneZ2_7TeV-madgraph-tauola/Summer...,2900135,3667,2.7,1.3,5.2,3296253
4,/SingleMu/Run2011A-PromptReco-v4/AOD,1537482,3571,1.4,1.3,2.8,1765946
7,/SingleMu/Run2011A-May10ReReco-v1/AOD,2851106,3082,2.6,1.1,1.5,966514
3,/SingleMu/Run2011B-PromptReco-v1/AOD,1262370,2238,1.2,0.8,3.0,1915483
18,/DoubleMu/Run2011A-PromptReco-v4/AOD,502407,1959,0.5,0.7,0.8,511015
11,/DoubleElectron/Run2011A-PromptReco-v4/AOD,657485,1942,0.6,0.7,1.2,746178
23,/TTJets_TuneZ2_7TeV-madgraph-tauola/Summer11-P...,329509,1898,0.3,0.7,0.7,434610
40,/DoubleMu/Run2011A-May10ReReco-v1/AOD,680902,1739,0.6,0.6,0.5,312152
32,/DoubleElectron/Run2011A-May10ReReco-v1/AOD,891548,1571,0.8,0.6,0.6,351559


Then find out how to find in this list the dataset names in `datasets`...

In [15]:
new_df.loc[new_df['COLLNAME'] == "/DY1JetToLL_M-10To50_TuneZ2_7TeV-madgraph/Fall11-PU_S6_START42_V14B-v1/AODSIM"]

Unnamed: 0,COLLNAME,NACC,NUSERS,RNACC,RNUSERS,RTOTCPU,TOTCPU
1728,/DY1JetToLL_M-10To50_TuneZ2_7TeV-madgraph/Fall...,932,9,0.0,0.0,0.0,765


In [25]:
datasets[0][1]

'/2B2C1Jet_TuneZ2_7TeV-alpgen-pythia6/Fall11-PU_S6_START42_V14B-v1/AODSIM'

In [29]:
new_df.loc[new_df['COLLNAME'] == datasets[0][1]]

Unnamed: 0,COLLNAME,NACC,NUSERS,RNACC,RNUSERS,RTOTCPU,TOTCPU


In [28]:
datasets.loc[datasets[0] == "/DY1JetToLL_M-10To50_TuneZ2_7TeV-madgraph/Fall11-PU_S6_START42_V14B-v1/AODSIM"]

Unnamed: 0,0
154,/DY1JetToLL_M-10To50_TuneZ2_7TeV-madgraph/Fall...


In [30]:
new_df.loc[new_df['COLLNAME'] == datasets[0][154]]

Unnamed: 0,COLLNAME,NACC,NUSERS,RNACC,RNUSERS,RTOTCPU,TOTCPU
1728,/DY1JetToLL_M-10To50_TuneZ2_7TeV-madgraph/Fall...,932,9,0.0,0.0,0.0,765
