# Lecture 2

## Outline
* ## Basic UNIX operation
* ## Data type in python
    * ### Assignment and variable
    * ### Precision of data types
    * ### Formatting output with string
* ## Input and Output
    * ### Keyboard input
    * ### Writing and reading files
        * ### ASCII file
        * ### Binary file
* ## Explorative tasks 

The delivery of this lecture will be done at a pace for those (70%) of you who are beginners for Python. If you are proficient with Python, you do not have to wait for me. You should go through these cells on your own pace. If you have completed these cells while the lecture is still ongoing, the end of this notebook has a couple of explorative, open-ended tasks for you.

# Basic UNIX operations
![UNIX.png](attachment:UNIX.png)

## Let's check out the termial from datahub
Open a new tab in your browser and go to https://datahub.berkeley.edu/
<p style="font-size:20px"> 
    On the upper right corner, click the button "New", in the dropdown, select "Terminal"
</p>
<p style="font-size:20px"> 
    Let's do a few operations: make a directory, rename/move a directory, do a wget, pwd.</p>
<p style="font-size:20px"> 
    How about the iPython magic % 
</p>

In [None]:
%pwd
%pip install uproot

### os — Miscellaneous operating system interfaces¶
This module provides a portable way of using operating system dependent functionality. 

Here I am using this module to run UNIX shell commands, which will be useful for handling files/directories, etc.

In [None]:
import os

In [None]:
os.system("pwd")

In [None]:
os.system(" echo $PWD ")

In [None]:
os.system("ls -ltr")

In [None]:
# Download a file from internet
os.system("wget https://portal.nersc.gov/project/m3438/physics77/2012.root")

In [None]:
# check out the size of the file
os.system("du -sh 2012.root")

# Data type in Python

### Assignments and variables

A "variable" is a handle to the data that you can name and manipulate. 
A very common concept in programming ! Actual implementation and properties depend on the language. In Python, there are 3 basic variable types: floats (real numbers), ints (discrete integers), and strings (sequences of characters)

The value of giving the variable names is the ability to write more general code: just like a transition from arithmetic to algebra !

In [None]:
import math as m
import cmath as cm



Define the solutions to a quadratic equation:

$ax^2+bx+c=0$

In [None]:
a = -5
b = +50
c = 22

D = b**2 - 4*a*c
x1 = (-b + m.sqrt(D))/(2*a)
x2 = (-b - m.sqrt(D))/(2*a)

print("Solutions to the quadratic equation",
    "are (up to 4 digits): x_1={FirstArg:.3f}, x_2={SecondArg:.3f}".format(SecondArg=x1,FirstArg=x2))

# Let's play with a,b,c and see if we always get real solutions

## Basic built-in types of variables in Python
In programming, data type is an important concept. Variables can store data of different types, and different types can do different things. Python has the following data types built-in by default, in these categories:

In [None]:
x = 27
print(type(x))

In [None]:
# Let's define a function that prints out the value of x and its type
def Type(x):
    print(x, type(x))

In [None]:
x = 27.0
Type(x)

In [None]:
x = '27'
Type(x)

In [None]:
x = [27]
Type(x)

In [None]:
x = (27,)
Type(x)

In [None]:
x = (27)
Type(x)
# not a tuple

In [None]:
x = range(27)
Type(x)


In [None]:
x = {27:'Just a name'}
Type(x)

In [None]:
x = 27 > 5.0
Type(x)

In [None]:
x = {27,}
Type(x)

In [None]:
x = None
Type(x)

In [None]:
x = None > 0
Type(x)

Over the course of the semester, we will be using all these built-in types available in Python

https://www.w3schools.com/python/python_datatypes.asp

![builtindatatype.png](attachment:builtindatatype.png)


### Binary and hexdecimal representation 

https://www.rapidtables.com/convert/number/decimal-to-hex.html

Decimal to Binary / Hexadecimal converter

In [None]:
x = 0b11011  # binary
Type(x)

In [None]:
x = 0x1b #hexadecimal 
Type(x)

### ASCII encoding 
ASCII, standing for "American Standard Code for Information Interchange", is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Most modern character-encoding schemes are based on ASCII, although they support many additional characters.
https://en.wikipedia.org/wiki/ASCII

In [None]:
x = ord('x')
Type(x)

x = chr(120)
Type(x)

x = chr(ord('x'))
Type(x)


### Precision of data types

### Data on computer are represented in binary format
* This originates from how physical device stores data 
* Base-2 representation

$ abcd^2 = a*2^3 + b*2^2 + c*2^1 + d*2^0 $


Example 11012 → decimal?

$ 11012 = 1*2^{3} + 1*2^2 + 0*2^1 + 1*2^0	= 13$

* Smallest memory cell is a “bit” (b)
     * 8 bits = 1 byte  ; 8 b = 1 B
* Measure of memory
     * 1kB = 1024 bytes; 1 MB = 1024 kB, etc.





#### Why do we care?
#### The practical consequences of this are
* All numbers in digital formats are discrete, i.e., they have a finite precision
     * Integers, discrete by definition, e.g., π would be 3
* Real numbers: think about representations in powers of 2
     * 0.125  = ⅛  = 2-3
     * Could you represent 0.1 in powers of 2? How about π?
* Basic data types have max and min values, as well as finite precisions, determined by data type size
    * How much memory is allocated for each type of data
    * Most common: 4 or 8 bytes for integer and real values 

#### Example: real numbers (floating-point numbers) are represented by three fields:
##### Sign, exponent, and fraction

![float32.png](attachment:float32.png)

see https://en.wikipedia.org/wiki/Single-precision_floating-point_format#IEEE_754_standard:_binary32 for more

#### Now let's check out the size of a few variables with different types

We will import sys module for this taks

* sys — System-specific parameters and functions¶
    * https://docs.python.org/3/library/sys.html

In [None]:
import sys



In [None]:
x = 27.0
Type(x)

In [None]:
sys.getsizeof(x) 
# what's the unit of this value? Bytes or bits?

In [None]:
x = 27
Type(x)
sys.getsizeof(x)

Interesting thing about Python is that it would allocate more bytes for data if needed, so there is no practical limit to the integer value


In [None]:

for i in range(280):
    x = 2**(i)
    print( i, sys.getsizeof(x))


In [None]:
x = 2**(100)*1.5
Type(x)
sys.getsizeof(x)


##### Largest representable float

In [None]:
sys.float_info.max


In [None]:
#how about 1.8e+308
print(1.8e+308)


In [None]:
#how about
1.0 +1.8e308/1.8e+308

##### Smallest representable float

In [None]:
sys.float_info.min

In [None]:
sys.float_info.epsilon


In [None]:
x = sys.float_info.min*sys.float_info.epsilon
print(x)

In [None]:
print(x*0.5)

### Round off errors




In [None]:
a = 1e15-1
b = 1e15+1
print("Size of float:",sys.getsizeof(a))

d1 = (b**2-a**2)
d2 = (b-a)*(a+b)

print(b**2)
print(a**2)
print(b-a)
print(b+a)


print("Ratio = ",d1/d2)

In [None]:
import decimal as d
d.getcontext().prec = 24  # set precision to 24 decimal places

a = d.Decimal(1e15)-1
b = d.Decimal(1e15)+1
print("Size of Decimal object:",sys.getsizeof(a))

d1 = (b**2-a**2)
d2 = (b-a)*(a+b)

print("Ratio = ",d1/d2)

#### Sizes of composite data types (tuples, lists, arrays)

In [None]:
xlist = [1,2,3,4,5]
sys.getsizeof(xlist)

In [None]:
xtuple = (1,2,3,4,5)
sys.getsizeof(xtuple)

In [None]:
import numpy as np
xnp = np.zeros((5,))
sys.getsizeof(xnp)




In [None]:
# Does it scale with the number of element?
xnp = np.zeros((50,1))
sys.getsizeof(xnp)



In [None]:
def size(x):
    return sys.getsizeof( np.zeros((int(x),1)))
vecsize = np.vectorize(size)
x = np.linspace(0,5000,51)
y = vecsize(x)

import matplotlib.pyplot as plt
plt.plot(x,y)
plt.xlabel("Number of element")
plt.ylabel("Size of array [Bytes]")

#### Beware of floating point comparisons !
    

In [None]:
i = 1
print(type(i))
print(2 == i)

In [None]:
x = 11.00000000000001/10
y = 1.1
x == y


In [None]:
#how about 
Tolerance = 1e-6
abs(x - y) < Tolerance

In [None]:
np.pi

### In general, the precision of the arithmetic comparisons (==) is not guaranteed. Behavior in Python may be very different from other languages, may depend on OS, compilers, etc. It is considered bad practice to use == comparisons on floating point data. Preferably, you should check if the difference between the two numbers is within a certain precision:

In [None]:
x = 3.1415926
Tolerance = 1e-6
print (abs(x-np.pi)<Tolerance)
Tolerance = 1e-10
print (abs(x-np.pi)<Tolerance)

### Formatted output

Usually the data you manipulate has finite precision. You do not know it absolutely precisely, and therefore you should not report it with an arbitrary number of digits. One of the cardinal rules of a good science paper: round off all your numbers to the precision you know them (or care about) -- and no more ! 



#### Examples:

In [None]:
t = 21.35   # I only know 4 digits
print(t)   # OK, let Python handle it

In [None]:
print(t**2)
# do I know more than 15 significant digits now? 

##### Do I all of sudden know so many significant digits? 
* No
* So let's truncate it

In [None]:
t2 = t**2
print('{a:5.2f}'.format(a=t2))
print('sqrt(t) = {val1:5.2f}, t**2 = {val2:5.2f}'.format(val1=math.sqrt(t),val2=t2))

In [None]:
print('t**2 = %5.2f' % t2)
print('sqrt(t) = %5.2f' % math.sqrt(t))

In [None]:
x = 26.2333
print ('sqrt(x) = {0:3.2e}, x**2 = {1:4.2f}'.format(np.sqrt(x),x**2))
print ('x**2 = {1:4.2f}, sqrt(x) = {0:3.2e}'.format(np.sqrt(x),x**2))
# note the difference between the two lines

print ('sqrt(x) = %3.2e, x**2 = %4.6f' % (np.sqrt(x), x**2))


##### String formatting: can specify the length of the field and justification. By default, the strings are left-justified, and if the length specifier exceeds the length of the string, the string is padded by white space

In [None]:
str = 'Jones, Bill'
print('My name is {0:20s} and I like {1:s}'.format(str,'pineapples'))

In [None]:
print('My name is {0:^20s} and I like {1:s}'.format(str,'pineapples'))

In [None]:
print('My name is {0:>20s} and I like {1:>20s}'.format(str,'pineapples'))

Another cool feature of `format()` function (in Python 3) is ability to label parameters by readable names instead of numbers

In [None]:
print ('x={x:4.2f}, sqrt(x) = {sqrt:4.2f}, x**2 = {square:4.2f}'.format(x=x,sqrt=np.sqrt(x),square=x**2))

For more formatting options, see https://pyformat.info/

A tricky example: Treat '%' in the old-style formatting expression as an operator which has highest precedence. It is the operator that takes a string and another type, and returns a string. E.g. 

In [None]:
stringFormat = 'var = %5.3f'
Type(stringFormat)

In [None]:
out = stringFormat % 2.5
Type(out)

###### So far so good. But what if you want to do some operations inline, e.g.:

In [None]:
out = stringFormat % 2.5+3
Type(out)

This does not work, because % takes the first number (5.) and converts it to formatted string; the string cannot be divided by an integer (2)
So use parentheses when you don't know about the association rule 

In [None]:
out = stringFormat % (2.5+3)
Type(out)

# Input and Output
Most of the time, your code will need to process external data -- either entered by a human (through a keyboard), or read from external media. *This is an example of abstraction*: you write generic code that is kept separately from the data. The same is true for the data generated by your code. You may want to display it on the screen or store it in a file.

Let's look at some basic examples

### Keyboard prompt

In [None]:
s = input('What is your name ? ')
print(len(s))
print (type(s), "Hello,", s)

##### note that what you typed is interpreted as a string
You may want to convert strings to numerical types in order to perform calculations. See 

In [None]:
age = input('What is your age ? ')
print (type(age))
number = int(age)
print (type(number), number)
nextYear = number+1
print('Next year you will be',nextYear)


##### This code may fail (see examples) if the user inputs something that can't be parsed as an integer. 
##### `Exception handling` 
can allow you to catch errors and recover. See the following example, where we will loop until the user enters correctly parsable number

In [None]:
failed = True
while failed:
    age = input('What is your age ? ')
    try:
        number = int(age)
        print (type(number), number)
        nextYear = number+1
        print('Next year you will be',nextYear)
        failed = False
    except:
        print('Try again: please enter an integer value')
        failed = True
        
print('Success !')

# More on try except structure, check out https://www.w3schools.com/python/python_try_except.asp

Slightly more flexible way to do conversion is to use <tt>eval()</tt>

In [None]:
import numpy as np
x = 5
age = input('What is your age ? ')
number = eval(age)
print (type(number))
print('You are',number,'years old')

#Try type a fractional number

##### Most often, you would want to enter several values and parse them. Use string method *split()*:
But pay attention: the parsing is pretty rudimantary ! (examples)

In [None]:
s = input('Enter coordinates (x,y,z):')
print(s)
[x,y,z] = s.split(',')  # what type of data does split return?

print ("x=",x,"y=",y,"z=",z)
print (type(x), type(y), type(z))

# more on string operations in python
# https://docs.python.org/3/library/stdtypes.html#string-methods

Sometimes you would want to convert to float or int immediately, so can use list comprehension:

In [None]:
s = input('Enter coordinates (x,y,z):')
mylist = s.split(',')
print(mylist)
print(type(mylist))
print(type(mylist[0]))

#listOfFloats = [float(var) for var in mylist]
#print(listOfFloats)


[x,y,z] = [float(var) for var in mylist] 
print ("x=",x,"y=",y,"z=",z)
print (type(x), type(y), type(z))
print('x squared = ',x**2)

of course there is no need to type the string from the prompt

In [None]:
s = '4312, 5.90, 12'
newlist = s.split(',')
for var in newlist:
    a = float(var)
    print(a)
    
[x,y,z] = [float(var) for var in newlist]
print(x,y,z)
print( type(x), type(y) , type(z))

### Reading and writing files


### Part I: ASCII files

ASCII stands for American Standard Code for Information Interchange -- which developed standards for encoding text and control information in files as far back as 1967. These standards are still in use today. 

* ##### Think of ASCII files as text files. 
    * You can open them using a text editor (like vim or emacs in Unix, Notepad in Windows, or TextEdit on a Mac) and read the information they contain directly. 
    * There are a few ways to produce these files, and to read them once they've been produced. 

* ##### In Python, the simplest way to handle ASCII is to use file objects. 

    *Let's give it a try. We create a file object by calling the function `open( filename, access_mode )` and assigning its return value to a variable (usually `f`). 
    This variable is often called a "file descriptor", or a "handle". It keeps information about the current state of the file, and also allows operating on the file, e.g. for reading or writing. 

The argument `filename` just specifices the name of the file we're interested in, and `access_mode` tells Python what we plan to do with that file:  
   * 'r': read the file  
   * 'w': write to the file (creates a new file, or clears an existing file)
   * 'a': append the file  
     
Note that both arguments should be strings.
For full syntax and special arguments, see documentation at https://docs.python.org/3/library/functions.html#open

In [None]:
f = open( 'welcome.txt', 'w' )
print(f)    # see what we got

In [None]:
# Let's see what's created in your current directory 
%ls -ltr


In [None]:
f.write('One more line\n')
f.write('And another\n')

In [None]:
# straight to print out the context of an ASCII file
# in UNIX

os.system("cat welcome.txt")

# or simply %cat welcome


In [None]:
os.system("ls -tlr welcome.txt")

##### the reason that we didn't see anything is because we didn't close the file

In [None]:
f.close()

In [None]:
os.system("ls -tlr welcome.txt")

In [None]:
# straight to print out the context of an ASCII file
# in UNIX
os.system("cat welcome.txt")

A note of caution: as soon as you call `open()`, Python creates a new file with the name you pass to it. Python will overwrite existing files if you open a file of the same name in write ('`w`') mode. For the next few exercises, we will overwrite the existing welcome.txt file

In [None]:
f = open('welcome.txt','w')

Now we can write to the file using `f.write( thing_to_write )`. 

We can write anything we want, but it must be formatted as *a string*.

In [None]:
topics = ['Data types', 'Loops', 'Functions', 'Arrays', 'Plotting', 'Statistics', 'Physics']

In [None]:
f.write( 'Welcome to Physics 77, Spring 2021\n' ) # the newline command \n tells Python to start a new line
f.write( '     Topics we will learn about include:\n' )
for top in topics:
    f.write( '         ' + top + '\n')
f.close()  

In [None]:
os.system('cat welcome.txt')

In [None]:
%cat welcome.txt

#### How about reading the content of the file with Python?

In [None]:
f = open( 'welcome.txt', 'r' )    # note that we reused the handle
print(f)

so f is an object; printing f doesn't print out the content of f

In [None]:
for line in f:
    print (line.strip())
f.close()

The strip() method removes any leading (spaces at the beginning) and trailing (spaces at the end) characters (space is the default leading character to remove)
https://www.w3schools.com/python/ref_string_strip.asp

In [None]:
f = open( 'welcome.txt', 'r' )    # note that we reused the handle

for line in f:
    print(line.upper()) # of course you have lower(), too
f.close()

In [None]:
f = open( 'welcome.txt', 'r' )    # note that we reused the handle

for line in f:
    print(line) 
f.close()

##### There are also shortcuts available, if we only want to read in some of the data:

In [None]:
f = open('welcome.txt','r')
print(f.readlines())





In [None]:
f.readline()

In [None]:
# skipping two lines
with open('welcome.txt') as f:
    for _ in range(2):
        next(f)
    for line in f:
        print(line.strip())

What can you do if you want to add a few more items to a file that already has data and you don't want to overwrite it?

In [None]:
f = open( 'welcome.txt', 'a' )
print(f)    # see what we got

In [None]:
f.write( '         Machine learning\n' ) # the newline command \n tells Python to start a new line
f.close()

In [None]:
%cat welcome.txt

### Numerical data

For the most part, our text files will contain numeric information, not strings. These can be somewhat trickier to read in. Let's read in a file produced in another program, that contains results from a BaBar experiment, where we searched for a "dark photon" produced in e+e- collisions [https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.119.131804, https://arxiv.org/abs/1702.03327]. The data are presented in two columns: 

  mass charge
  
First, let's peek into the file using iPython magic (direct interface to Unix operating system):

In [None]:
os.system("wget https://portal.nersc.gov/project/m3438/physics77/Babar_2016.dat")
%ls -ltr

In [None]:
%cat Babar_2016.dat

In [None]:
fname = 'BaBar_2016.dat'
f = open(fname, 'r')
# read each line, split the data wherever there's a blank space,
# and convert the values to floats
mass = []
charge = []
for line in f:
    m, c = [float(dat) for dat in line.split()]
    mass.append(m)
    charge.append(c)
f.close()

print('Read',len(mass),'lines from file',fname)

In [None]:
# Let's plot mass vs charge
plt.plot(mass, charge, 'r-' )
plt.xlim(0, 8)
plt.ylim(0, 3e-3)
plt.xlabel('mass (GeV)')
plt.ylabel('charge, 90% C.L. limit')
plt.show()

The conversion from string to float is still pretty cubersome, particularly, when there are many more columns of data

Fortunately, Python's `numpy` library has functions for converting file information into numpy arrays, which can be easily analyzed and plotted. The above can be accomplished with a lot less code (and a lot less head scratching!)

The two most common functions to read a tabulated text are numpy's `loadtxt` and `genfromtxt`. They are subtly different and mostly interchangable.The most useful feature of `genfromtxt` is that it is able to assign default values to missing fields. See https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html and https://numpy.org/doc/stable/reference/generated/numpy.genfromtxt.html

In [None]:
import numpy as np
mass, charge = np.genfromtxt('BaBar_2016.dat', unpack = True)
print(type(mass))
plt.plot(mass, charge,'r-')
plt.xlim(0, 8)
plt.ylim(0, 2e-3)
plt.xlabel('mass (GeV)')
plt.ylabel('charge, 90% C.L. limit')
plt.show()

## Part II: CSV

CSV stands for Comma Separated Values. Python's `csv` module allows easy reading and writing of sequences. CSV is especially useful for loading data from spreadsheets and databases.

##### Let's make a list and write a file!  
##### First, we load the module

In [None]:
import csv

Next, we create a file object that opens the file we want to write to.  
Then, we create a *csv writer*, a special object that is built specificly to write
sequences to our csv file.

Next, we create a file object that opens the file we want to write to.  
Then, we create a *csv writer*, a special object that is built specificly to write
sequences to our csv file.

In [None]:
f_csv = open( 'nationData.csv', 'w' )
SAWriter = csv.writer( f_csv,                 # write to this file object
                           delimiter = ',',          # place comma between items we write
                           quotechar = '',           # Don't place quotes around strings
                           quoting = csv.QUOTE_NONE )# made up of multiple words 

In [None]:
# let's write some data

countries = ['Argentina', 'Bolivia', 'Brazil', 'Chile', 'Colombia', 'Ecuador', 'Guyana',\
             'Paraguay', 'Peru', 'Suriname', 'Uruguay', 'Venezuela']
capitals = ['Buenos Aires', 'Sucre', 'Brasilia', 'Santiago', 'Bogota', 'Quito', 'Georgetown',\
             'Asuncion', 'Lima', 'Paramaribo', 'Montevideo', 'Caracas']
population_mils = [ 42.8, 10.1, 203.4, 16.9, 46.4, 15.0, 0.7, 6.5, 29.2, 0.5,\
                      3.3, 27.6]

In [None]:
SAWriter.writerow(['Data on South American Nations'])
SAWriter.writerow(['Country', 'Capital', 'Population (millions)'])
for i in range(len(countries)):
    SAWriter.writerow( [countries[i], capitals[i], population_mils[i]] )
f_csv.close()

In [None]:
#let's take a look

%cat nationData.csv

Now let's see if we can open your file using a spreadsheet program, like MS Excel. How did we do?
https://docs.google.com/spreadsheets



We can use a similar process to read data into Python from a csv file. Let's read in a list of the most populous cities and store them for analysis.

In [None]:
cities = []
cityPops = []
metroPops = []

os.system("wget https://portal.nersc.gov/project/m3438/physics77/cities.csv")

%cat cities.csv

f_csv = open( 'cities.csv', 'r')
readCity = csv.reader( f_csv, delimiter = ',' )
next(readCity) # skip the header row
for row in readCity:
    print(row)
    print (', '.join(row)) # join the element of the list together, with the strng ', ' in between
    city_country = ', '.join(row[0:2])
    cities.append(city_country)
    if row[2] != '':
        cityPops.append( float(row[2]) )
    else: cityPops.append(-1)
    if row[3] != '':
        metroPops.append( float(row[3]) )
    else: metroPops.append(-1)
f_csv.close()

print(cityPops)

In [None]:
metroPops, cityPops = np.array(metroPops), np.array(cityPops)
cIds = np.argsort(cityPops)[::-1] # sort in descending order
mIds= np.argsort(metroPops)[::-1]

print ("The five most populous cities (within city proper) are:\n")
for j in range(5):
        print (cities[cIds[j]], "with a population of {} million".format(cityPops[cIds[j]]))

print ("\nThe five most populous metropolitan regions in the world are:\n")
for i in range(5):
        print (cities[mIds[i]], "with a metro population of {} million".format(metroPops[mIds[i]]))


## Binary files

So far, we've been dealing with text files. If you opened these files up with a text editor, you could see what was written in them. Binary files are different. They're written in a form that Python (and other languages) understand how to read, but we can't access them directly.  The most common binary file you'll encounter in python is a *.npy* file, which stores numpy arrays. You can create these files using the command `np.save( filename, arr )`. That command will store the array `arr` as a file called filename, which should have the extension .npy. We can then reload the data with the command `np.load(filename)`

In [None]:
import numpy as np
x = np.linspace(-1.0, 1.0, 100)
y = np.sin(10*x)*np.exp(-x) - x
plt.plot(x,y,'r-')

In [None]:
x = x.reshape(100,1)
y = y.reshape(100,1)
xy = np.hstack((x,y))

#print(xy)
np.save('y_of_x.npy', xy )
print(len(xy))

In [None]:
del x, y, xy # erase these variables from Python's memory

In [None]:
xy = np.load('y_of_x.npy')
print (len(xy))
x = xy[:,0]
y = xy[:,1]

plt.plot(x,y,'r-')

In [None]:
a = np.array((1,2,3))

In [None]:
a.shape

In [None]:
a = np.array([[1],[2],[3]])
a.shape

In [None]:
print(a.shape)

In [None]:
np.savetxt('test.out', xy, fmt='%1.4e')

In [None]:
%cat test.out

More on numpy.savetxt
https://numpy.org/doc/stable/reference/generated/numpy.savetxt.html

### HDF5 Files (Hierarchical Data Format version 5)
HDF5 files are ideally suited for managing large amounts of complex data. Python can read them using the module `h5py.`

In [None]:
import h5py

Let's load our first hdf5 file:

In [None]:
fh5 = h5py.File( 'solar.h5py', 'r' )

hdf5 files are made up of data sets. Each data set has a name, or a key. Let's take a look at our data sets:

In [None]:
for k in fh5.keys(): # loop through the keys
    print (k)

We access the data sets in our file by name:

In [None]:
print(len(fh5["names"]))
for nm in fh5["names"]: # make sure to include the quotation marks!
    print (nm)

In [None]:
print(fh5["names"])
print(len(fh5["names"]))
for nm in fh5["names"]: # make sure to include the quotation marks!
    Type(nm)

It looks like we've got some planet data on our hands!  
Names is a special case, in that it's elements are strings. The other data sets contain float values, and can be treated like numpy arrays.

In [None]:
print (fh5["solar_AU"][:])

In [None]:
# Let's check out a few other quantities
# TOrbit_yr
# density
# mass_earthM
# names
# solar_AU
# surfT_K

print (fh5["surfT_K"][:])
surfT = np.array(fh5["surfT_K"])
print(surfT.shape)
surfT = fh5["surfT_K"]
print(surfT.shape)

Let's make a plot of the solar system that shows each planet's:  
* distance from the sun (position on the x-axis)
* orbital period (position on the y-axis
* mass (size of scatter plot marker)
* surface temperature (color of marker)
* density (transparency (or alpha, in matplotlib language))

In [None]:
distAU = fh5["solar_AU"][:]
mass = fh5["mass_earthM"][:]
torb = fh5["TOrbit_yr"][:]
temp = fh5["surfT_K"][:]
rho = fh5["density"][:]
names = fh5["names"][:]

In [None]:
import numpy as np

# Let's use the size of the circle to represent mass of the planet
def get_size( ms ):
    m = 400.0/(np.max(mass) - np.min(mass))
    return 100.0 + (ms - np.min(mass))*m 

# Let's use the transparency of the circle to represent the density the planet
def get_alpha( p ):
    m = .9/(np.max(rho)-np.min(rho))
    return .1+(p - np.min(rho))*m

In [None]:
alfs = get_alpha(rho)

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt

# normalize the color scale so that the color variation is corresponding to the range of
# planet temperature

norm = mpl.colors.Normalize(vmin=np.min(temp), vmax=np.max(temp))

#https://matplotlib.org/stable/tutorials/colors/colormaps.html
cmap = plt.cm.cool
m = plt.cm.ScalarMappable(norm=norm, cmap=cmap)

fig, ax = plt.subplots(1)
for i in range(8):
    ax.scatter( distAU[i], torb[i], s = get_size(mass[i]), color = m.to_rgba(temp[i]), alpha=alfs[i] ) 
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_ylim(0.1,200)
ax.set_ylabel( 'orbital period (y)' )
ax.set_xlabel( 'average dist. from sun (AU)' )
ax.set_title( 'Our solar system' )
plt.show()

### Recall Kepler's third law describes the relationship between the distance of planets from the Sun, and their orbital periods
# $\frac{r^3}{T^2} \approx \frac{GM}{4\pi^2} \approx 7.496 \times 10^{-6} \frac{AU^3}{days^2}$

$M$ is the mass of Sun


if we show r vs T in log scales, we get a linear relationship between the x and y axes, which are log(r) and Log(T)

## $\frac{3}{2}({log(r)} - {log(T)}) \approx {log(GM)}-{2log(2\pi)}$ 

# Uproot 

If you are proficient with Python or at least you are least familiar with operations earlier in this notebook, and if you have reached this part of the notebook while the lecture is still ongoing, then this is an open-ended explorative exercise for you.

## Background

In this part, we will be loading and analyzing data from a ROOT file. ROOT is a software package developed by physicists at CERN for High Energy Physics data analysis. If you can find everything you want to know about ROOT at this webpage https://root.cern/

While ROOT has many powerful functionalities for data analysis, we will not dedicate any significant fraction of this analysis to ROOT. Rather, we will load data from ROOT files to data structures such as numpy arrays and perform data analysis using Python libraries more widely used in the data science community.

## ROOT Tree 


## Uproot
There are many python software packages that convert root file to numpy arrays. Here we will use UPROOT. The full documentation of uproot can be found here
https://uproot.readthedocs.io/en/latest/index.html

and I would start with their `Getting Started Guide` examples https://uproot.readthedocs.io/en/latest/basic.html

### Below are some semi-structured guidelines

Because datahub@berkeley doesn't have uproot install, we will install it on the fly

In [1]:
!pip install uproot awkward

Collecting uproot
  Using cached uproot-4.3.5-py3-none-any.whl (302 kB)
Collecting awkward
  Using cached awkward-1.9.0-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (11.9 MB)
Installing collected packages: uproot, awkward
Successfully installed awkward-1.9.0 uproot-4.3.5


In [2]:
# once installed, let's import it

import uproot

In [7]:
#load a root file from somewhere. in this example, it is on the web, but usually it is on your local disk

# this file is provided by uproot authors for demos

file = uproot.open("https://scikit-hep.org/uproot3/examples/nesteddirs.root") 

# I have another root file here https://portal.nersc.gov/project/m3438/physics77/2012.root you can also check that out

In [8]:
# Let's what's inside the root file

print(file.keys())
print(file.classnames())



['one;1', 'one/two;1', 'one/two/tree;1', 'one/tree;1', 'three;1', 'three/tree;1']
{'one;1': 'TDirectory', 'one/two;1': 'TDirectory', 'one/two/tree;1': 'TTree', 'one/tree;1': 'TTree', 'three;1': 'TDirectory', 'three/tree;1': 'TTree'}


In [9]:
# directly open the object called 'events' in the root file
# N.B. this is a different file!!
events = uproot.open("https://scikit-hep.org/uproot3/examples/Zmumu.root:events")


In [10]:
# See what variables are available there
events.keys()

['Type',
 'Run',
 'Event',
 'E1',
 'px1',
 'py1',
 'pz1',
 'pt1',
 'eta1',
 'phi1',
 'Q1',
 'E2',
 'px2',
 'py2',
 'pz2',
 'pt2',
 'eta2',
 'phi2',
 'Q2',
 'M']

In [11]:
events.values()

[<TBranch 'Type' at 0x7fb14c298c70>,
 <TBranch 'Run' at 0x7fb14c2a0460>,
 <TBranch 'Event' at 0x7fb14c2a0b80>,
 <TBranch 'E1' at 0x7fb14c2a62e0>,
 <TBranch 'px1' at 0x7fb14c2a6a00>,
 <TBranch 'py1' at 0x7fb14c2ac160>,
 <TBranch 'pz1' at 0x7fb14c2ac880>,
 <TBranch 'pt1' at 0x7fb14c2acfa0>,
 <TBranch 'eta1' at 0x7fb14c2b2700>,
 <TBranch 'phi1' at 0x7fb14c2b2e20>,
 <TBranch 'Q1' at 0x7fb14c2b7580>,
 <TBranch 'E2' at 0x7fb14c2b7ca0>,
 <TBranch 'px2' at 0x7fb14c2bd400>,
 <TBranch 'py2' at 0x7fb14c2bdb20>,
 <TBranch 'pz2' at 0x7fb14c2c2280>,
 <TBranch 'pt2' at 0x7fb14c2c29a0>,
 <TBranch 'eta2' at 0x7fb14c2c8130>,
 <TBranch 'phi2' at 0x7fb14c2c8850>,
 <TBranch 'Q2' at 0x7fb14c2c8f70>,
 <TBranch 'M' at 0x7fb14c2cf6d0>]

In [14]:
# Loading a branch
mass = events['M']
type(mass)

uproot.models.TBranch.Model_TBranch_v12

In [15]:
# Other ways to browser variables in the file
# note that typenames are names of C++ types as ROOT files are created with C++
events.typenames()
events.show()

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
Type                 | char*                    | AsStrings()
Run                  | int32_t                  | AsDtype('>i4')
Event                | int32_t                  | AsDtype('>i4')
E1                   | double                   | AsDtype('>f8')
px1                  | double                   | AsDtype('>f8')
py1                  | double                   | AsDtype('>f8')
pz1                  | double                   | AsDtype('>f8')
pt1                  | double                   | AsDtype('>f8')
eta1                 | double                   | AsDtype('>f8')
phi1                 | double                   | AsDtype('>f8')
Q1                   | int32_t                  | AsDtype('>i4')
E2                   | double                   | AsDtype('>f8')
px2                  | double                   | AsDtype('>f

In [18]:
# How to load all the mass values to a numpy array
mass = events['M'].array(library='np')
mass.shape


(2304,)

In [19]:
# Load other variables
E1 = events['E1'].array(library='np')
E2 = events['E2'].array(library='np')

px1 = events['px1'].array(library='np')
px2 = events['px2'].array(library='np')

py1 = events['py1'].array(library='np')
py2 = events['py2'].array(library='np')

pz1 = events['pz1'].array(library='np')
pz2 = events['pz2'].array(library='np')

pz2.shape

(2304,)

A bit physics background:
* The variables that are in the events object (which corresponds to a TTree in ROOT) have the same shapes.
    * We saw mass has a shape of (2304,), so do pz2, E1, px1, etc.
    * in this example, the whole file contains a sample of collision events
    * there are 2304 collisions in this sample/file
* this sample of collisions all have exactly two leptons 
    * the kinematics of a lepton is described with its four momentum (E, momentum)
    * since momentum is a vector, the four momomentum, aka four vector, of the lepton, has four components (E, px, py, pz)
    * Einstein's mass momentum relation gives us $m^2 = E^2 - |\vec{p}|^2$
        * so you can calculate the mass of a lepton with the above formula, or simply $M = \sqrt{E^2 - (p_x^{2}+p_y^{2}+p_z^{2})}$
        
* can you calculate the mass for lepton1 and lepton2 and draw their distribution separately?
    * note that the numbers 1 and 2 in the variable names are indices of leptons.
    * i.e., E1, px1, py1, pz1 are properties of the same lepton, lepton 1; while those with 2 in name are properties of lepton 2
    
* can you calculate the mass of the di-lepton system?
    * what this means is that you consider the two leptons as a single physical system
    * as such, that dilepton system would have an Energy of E1+ E2, and a momentum of $\vec{p_{ll}} = \vec{p_{1}} + \vec{p_{2}}$. N.B. the momentum sum is always a vector sum
    * this quantity is also known as the invariant mass of dilepton system ($m_{ll}$)
    * how does this distribution look like?
    * how does the invariant mass of dilepton system compare to the sum of two individual lepton masses?


In [21]:
# How about this file?

file = uproot.open("https://portal.nersc.gov/project/m3438/physics77/2012.root") 

print(file.keys())
print(file.classnames())

# I've opened the TTree for you
events2 = uproot.open("https://portal.nersc.gov/project/m3438/physics77/2012.root:tree_TYPE_DATA")


['tree_TYPE_DATA;1']
{'tree_TYPE_DATA;1': 'TTree'}


In [22]:
events2.show()

name                 | typename                 | interpretation                
---------------------+--------------------------+-------------------------------
RunNumber            | int32_t                  | AsDtype('>i4')
EventNumber          | int32_t                  | AsDtype('>i4')
invariant_mass       | double                   | AsDtype('>f8')
vtx_z                | float                    | AsDtype('>f4')
ph_pt_leading        | float                    | AsDtype('>f4')
ph_pt_subleading     | float                    | AsDtype('>f4')
ph_eta_corrected_... | float                    | AsDtype('>f4')
ph_eta_corrected_... | float                    | AsDtype('>f4')
ph_etas1_leading     | float                    | AsDtype('>f4')
ph_etas1_subleading  | float                    | AsDtype('>f4')
ph_etas2_leading     | float                    | AsDtype('>f4')
ph_etas2_subleading  | float                    | AsDtype('>f4')
ph_cl_eta_leading    | float                    | AsDtype(

#### A bit physics background of this file

* this file contains a sample of collision events where there are always two photons
    * can you check how many events are in this sample?
    * the two photons are indexed as "leading" and "subleading", which are suffices in the variable names
        * i.e. ph_phi_leading, ph_eta_leading, ph_pt_leading as well as other ph_*_leading varialbes are properties of the leading photon. The same can be said about subleading photons
        
* in this file, the four momentum of a photon is not givne by E, px, py, pz, but rather than it is given by pT, eta, phi
    * we have the following relations to convert pT, eta, phi to E, px, py, pz
        * $p_x = p_{T} cos(\phi)$
        * $p_y = p_{T} sin(\phi)$
        * $p_z = p_{T} sinh(\eta)$
        * $E = p_{T} cosh(\eta)$
    * btw, why do we only need to know about three quantities (pT, eta, phi) to define the four momentum of photon which in principle has 4 DOFs?    
    
* can you calculate the invariant mass of the diphoton system consisting of the leading photon and subleading photon?
    * N.B. leading and subleading are the ordering of photons in pT. 
    * Leading photon's pT is greater than that of subleading photon
    * can you make a plot of the invariant mass distribution of diphoton?