## Magic function for graphs (matplotlib library)

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


## NumPy

NumPy is an extension of python supporting huge multi-dimensional arrays and matrices and containing high-level mathematical functions to process such data.

In [2]:
import numpy

### create numpy array

In [None]:
x = numpy.array(range(10))
print x, type(x), x.dtype, x.shape

In [None]:
y = numpy.array( [(1.5, 2, 3), (4, 5, 6) ])
print y, type(y), y.dtype, y.shape

In [None]:
# necessary indices from array
print x[3:5]
# array elements with step=3
print x[::3]
# array elements with some boolean condition
print x[x%4 ==0]

### linspace &mdash; create 1d mesh 

In [None]:
points = numpy.linspace(0, 1, num=100)
print points

### simple functions

In [None]:
points.max(), points.min(), points.mean(), points.var(), points.sum()

### generate random values and plot histograms

In [None]:
random_normal = numpy.random.normal(size=10000, loc=50, scale=5)
random_exp = numpy.random.exponential(size=5000, scale=5)
hist(random_normal, bins=30, color='red', label='normal')
hist(random_exp, bins=30, color='blue', label='exp')
legend()

In [None]:
random_floats = numpy.random.random(size=10000)
random_classes = numpy.random.random(size=10000) > 0.5
hist(random_floats[random_classes == True], bins=30, color='red', label='greater')
hist(random_floats[random_classes == False], bins=30, color='blue', label='lower')
legend()

In [None]:
hist(random_floats[random_classes == True], bins=30, color='red', label='greater', histtype='step')
hist(random_floats[random_classes == False], bins=30, color='blue', label='lower', histtype='step')
legend()

### sort elements

In [None]:
# return new sorted array
numpy.sort(random_floats)

### argsort 
Returns the indices that would sort an array

In [None]:
random_floats.argsort()

In [None]:
# update array on sort version
random_floats.sort()
random_floats.argsort()

### cumulative sum

In [None]:
random_floats[random_classes == True].cumsum(axis=0)

In [None]:
plot(range(numpy.sum(random_classes == True)), random_floats[random_classes == True].cumsum(axis=0))
xlabel('points')
ylabel('cumsum')

### searchsorted
Find indices where elements should be inserted to maintain order

In [None]:
bins = numpy.linspace(0.1, 0.9, 6)
print bins
bins_index = numpy.searchsorted(bins, random_floats)
print random_floats
print bins_index

Note: here are the first bin with values < 0. and the latest bins is [0.875, 1]

### bincount 
Count number of occurrences of each value in array of non-negative ints

In [None]:
# count events with label 1 in each bin 
print numpy.bincount(bins_index, weights=random_classes)
# count events with label 0 in each bin
print numpy.bincount(bins_index, weights=1 - random_classes)

### percentile

Compute the qth percentile of the data

In [None]:
q50 = numpy.percentile(random_floats, 10)
print 'value', q50
print 'check percentile', 1. * numpy.sum(random_floats < q50) / len(random_floats)

## Exercise

### Generate numpy.array with shape (10, 10), calculate min, max, mean, var for each axis

Hint: use `numpy.reshape` and `axis` parameter in `min` (etc.) function 

### Generate predictions for class 1 from normal + uniform pdf and predictions for class 0 exponential + uniform, size for both classes is 10000

Hint: use `numpy.concatenate` function 

### Plot histograms of predictions for each class, bins=100

### Compute for meshgrid [0, 20-p, 40-p, 60-p, 80-p, 1] (where X-p - X percentile for 1-labeled data) the following metrics:
* efficiencies for each class
* s / sqrt(b + 10), where s - count of 1-labeled data, b - 0-labeled data

### Plot dependence between metric and bin

## `.csv` and `.root` data formats

In [None]:
import pandas

In [None]:
dict_data = {'id': range(100), 'random': numpy.random.normal(size=100), 'sin': numpy.sin(numpy.random.random(100))}
df = pandas.DataFrame(dict_data)
df.head()

### Comma-separated values (CSV) 

This data format is a widely spread among data scientists. CSV is a common format that’s used to divide data into fields that are separated by delimiters (in this case by comma, but tab delimiter '\t' is also spread). Data are represented as table: rows are objects (events in HEP), columns are fields or features (branches in HEP). 

#### save data into `.csv`

In [None]:
df.to_csv('example.csv', sep=',', index=False)

#### read `.csv` file into `pandas.DataFrame`

In [None]:
pandas.read_csv('example.csv', sep=',').head()

### ROOT format

ROOT provides a file format that is a machine-independent compressed binary format, including both the data and its description, and provides an open-source automated tool to generate the data description (or "dictionary") when saving data, and to generate C++ classes corresponding to this description when reading back the data.  The dictionary is used to build and load the C++ code to load the binary objects saved in the ROOT file and to store them into instances of the automatically generated C++ classes.

Details https://root.cern.ch/drupal/content/root-files-1

### `root_numpy` - interface between ROOT and NumPy
http://rootpy.github.io/root_numpy

In [None]:
import root_numpy

#### create `.root` from `pandas.DataFrame`

In [None]:
root_numpy.array2root(df.to_records(), 'example.root', treename='data', mode='recreate')

#### read from `.root` to `numpy.array`

In [None]:
root_data = root_numpy.root2array('example.root', treename='data', branches=['random', 'sin'], 
                                  selection='random > 0.')

In [None]:
root_data

In [None]:
# Note that here numpy.array has named dtype
root_data.dtype

In [None]:
print len(root_data)
hist(root_data['random'], bins=30)
pass

#### convert to `pandas.DataFrame`

In [None]:
pandas.DataFrame(root_data).head()

## Exercise


### Read `example.root` with `id > 10 and (random + sin > 1.4 or random * sin > 0.5)`

### Plot scatter  `random + sin` vs `random * sin` 

Hint: use scatter function from matplotlib