# Python in data-analysis

Python is a programming language that has many beneficial features for example in data-analysis. Python libraries such as numpy, pandas and matplotlib provide a robust method to manipulate and visualize data.

In this tutorial you can find some of the most important things about Python related to this course.
Examples in this notebook can be useful in the upcoming exercises.

You can run code cells in notebooks by clicking the cell and pressing `CTRL`+`ENTER`.
If you can't remember how to do something in notebook, just press **H** while in command mode and you can see a list of shortcuts you can use.

Feel free to modify the code cells and try different things!

1. [Basics](#basics)
1. [Functions](#functions)
1. [Modules](#modules)
    1. [NumPy](#numpy)
    1. [Pandas](#pandas)
    1. [Matplotlib](#matplotlib)
    1. [SciPy](#scipy)
1. [More examples](#examples)

## Basics

The following example shows the very basic properties of python such as variables,  list-datatype, if-statements and loops.

In [None]:
# Example 1 of basics of python

# Variables
text = 'The result is' # you can also use double quotes ""
num1 = 25
num2 = 7
result = num1 - num2
 
print(text,result)

In [None]:
# Example 2 of basics of python

# lists
values = [1,3,19,-75]
fruits = ['Apple', 'Banana', 'Orange']

print("Values:",values)
print("Fruits:",fruits)

fruits.append('Passion fruit')
print("New fruits:",fruits)

### If-statements

**If**-statements can be used if we only want to execute the code if a certain condition is met. The syntax is quite self-explanatory:

In [None]:
# Example 1 of if-statement

condition1 = 1 > 2 # this is False
condition2 = 2 > 1 # this is True

if condition1:
    # This part of code is accessed if condition1 is true
    print("Condition 1 was true.")
elif condition2:
    # This part of code is accessed if condition1 is false and condition2 is true
    print("Condition 2 was true.")
else:
    # This part of code is accessed if each condition is false
    print("Each condition was false.")
    
# You can try changing the conditions and see what happens. You can also put the condition directly to 
# the if-statement rather than assinging it to a variable (e.g. "if 1 < 2:").

### Loops

**For**-loops are used to iterate over a sequence of values. For-loop can be used in the following way:

`for element in sequence:
    do something`
In this case the code part is done for each element in a sequence.

A sequence is usually a list or defined using `range(start,stop,interval)`-function (see example). 

In [None]:
# Example of for-loop. Print contents of a list.

fruits = ['Apple', 'Banana', 'Orange']

for fruit in fruits:
    print(fruit)

In [None]:
# Example 2 of for-loop. Print integers and their squares from 1 to 9

for i in range(1,10): # note that the stopping value is not included in the iteration
    print("Number: ",i,"  Square:",i*i)

In [None]:
# Example of if-statement and for-loop.

values = range(0,10)
limit = 5

for value in values:
    if value < limit:
        print(value,"is smaller than",limit)
    elif value > limit:
        print(value,"is larger than",limit)
    else:
        print(value,"is equal to",limit)

**While**-loop is another option to make iterative code. Here are a couple of examples how to use it.

In [None]:
# Example 1 of while loop

i = 0
while i < 5: # Loop continues until the condition turns into false.
    print(i)
    i = i+1 # increase i by 1 for next iteration

In [None]:
# Example 2 of while loop. This code asks the user for input and repeats it. Program exits when "exit" is passed.

while True: # Infinite loop
    answer = str(input()) # input() can be used to take user input
    if answer == "exit":
        print("Exiting... bye bye.\n")
        break # break-keyword will break the loop
    else:
        print("You said",answer)

### Basics - Examples

In [None]:
# Save texts 'Test 1' and 'Test 2' and numbers 0 and 1 into variables called t1,t2,n1,n2

t1 = 'Test 1'
t2 = 'Test 2'
n1 = 0
n2 = 1

# Make if-statement that prints 'Test 1' if number is 0, 'Test 2' if number is 1 and 'Unknown' otherwise

number = 1 # You can change this to something else and see what happens when you run the code.
if number == n1:   # note the double equal-sign here
    print(t1)
elif number == n2:
    print(t2)
else:
    print('Unknown')

In [None]:
# Let's make a for loop that prints the integers from 1 to 10

for i in range(1,11): # The last number in range is not included. Default interval for range-function is 1.
    print(i)

<a id="functions"></a>
## Functions

Functions are basically just pieces of code that allows you to use the same functionality multiple times. There are a number of ready-made functions in python itself and in modules that are discussed in the following section. Often it is useful to implement new functions to perform some task. Some useful functions in python are:
- `print(argument)` prints the argument inside parenthesis
- `len(object)` returns the length of an object, for example a list or a string
- `abs(value)` returns the absolute value of the value
- `min(values)` and `max(values)` returns the minimum or maximum value of the argument values

Example 2 shows an example of a custom-made function

In [None]:
# Example of python-functions

vals = [-36, 13, -4, 1, 196]
length = len(vals)
absvalue = abs(vals[2]) # note that index of first element is 0
minimum = min(vals)
print("Number of elements in vals is", length)
print("Absolute value of the third element in vals is", absvalue) 
print("Minimum value of vals is", minimum)

In [None]:
# Example of custom function

# Let's make a function that takes a single float number as a parameter and returns the square of it.

def square(x):
    return x*x

# Now we can use this function

print("The square of 3 is", square(3))
print()

# Let's test the function with a list of multiple numbers

testValues = [-5,-3,0,2,12]
# Loop over each value in list
for i in testValues:
    print("The square of", i, "is", square(i))

<a id="modules"></a>
## Modules

Python is widely used in scientific community for computing, modifying and analyzing data, and for these purposes Python is greatly optimized. Part of Python is to use different kind of *modules*, which are files containing definitions (functions) and statements. These modules are imported using **import**-command. The most important modules we're going to use are introduced below.

<a id="numpy"></a>
### Numerical calculus - NumPy

NumPy is the fundamental package for scientific computing in Python. For more information, see [NumPy documentation](https://numpy.org/devdocs/user/whatisnumpy.html).

**Summary** of useful commands

Import
```Python
import numpy as np
```

Calculus ( [pi](https://docs.scipy.org/doc/numpy/reference/constants.html), [sqrt](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sqrt.html), [exp](https://docs.scipy.org/doc/numpy/reference/generated/numpy.exp.html), [sin](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sin.html), [sum](https://docs.scipy.org/doc/numpy/reference/generated/numpy.sum.html), [linpace](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html) )
```Python
np.pi
np.sqrt(x)
np.exp(x)
np.sin(x)
np.sum(array)
np.linspace(start,stop,num)
```


<a id="pandas"></a>
### Data manipulation - Pandas

Pandas is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. For more information, see [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/index.html).

**Summary** of useful commands (df is a variable that containts the pandas-dataframe):

Import
```Python
import pandas as pd
```

Data reading ( [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), [read_table](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_table.html) )
```Python
df = pd.read_csv('path')
df = pd.read_table('path')
```

Data inspecting ( [head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html), [len](https://docs.python.org/3/library/functions.html#len), [shape](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) )
```Python
df.head(n)
len(df)
df.shape
```

Data manipulation ( pick spesific columns, filter data using limits )
```Python
df.column_name
df.['column_name]
df[ (df.column_name >= lower_limit) & (df.column_name <= upper_limit)]  
```

<a id="matplotlib"></a>
### Plotting - Matplotlib

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. For more information, see [Matplotlib documentation](https://matplotlib.org/).

**Summary** of useful commands:

Import
```Python
import matplotlib.pyplot as plt
```

Plot functions ( [plot](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.plot.html), [hist](https://matplotlib.org/3.2.0/api/_as_gen/matplotlib.pyplot.hist.html), [scatter](https://matplotlib.org/3.2.0/api/_as_gen/matplotlib.pyplot.scatter.html), [errorbar](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.errorbar.html) )

```Python
plt.plot(xdata, ydata, 'style')
plt.hist(data, n)
plt.scatter(xdata, ydata)
plt.errorbar(val1, val2, xerr, yerr)
```

Formatting ( [figure](https://matplotlib.org/3.2.0/api/_as_gen/matplotlib.pyplot.figure.html), [xscale](https://matplotlib.org/3.2.0/api/_as_gen/matplotlib.pyplot.xscale.html), [xlabel](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.xlabel.html), [title](https://matplotlib.org/3.2.0/api/_as_gen/matplotlib.pyplot.title.html), [legend](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.legend.html) )
```Python
plt.figure(figsize=[width,height])
plt.xscale('log')
plt.xlabel('x-axis name')
plt.title('title name')
plt.legend()
```


Other ( [show](https://matplotlib.org/3.2.0/api/_as_gen/matplotlib.pyplot.show.html) )
```Python
plt.show()
```

<a id="scipy"></a>
### Scientific computing - SciPy

SciPy library provides many user-friendly and efficient numerical routines, such as routines for numerical integration, interpolation, optimization, linear algebra, and statistics. For more information, see [SciPy documentation](https://www.scipy.org/scipylib/index.html)

**Summary** of useful commands (import statements here are separately for each function):

Import only the ones you need
```Python
from scipy.optimize import curve_fit
from scipy.integrate import simps
from scipy.interpolate import interp1d
```

Functions ( [curve_fit](https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html), [simps](https://docs.scipy.org/doc/scipy/reference/generated/scipy.integrate.simps.html), [interp1d](https://docs.scipy.org/doc/scipy/reference/generated/scipy.interpolate.interp1d.html) )
```Python
curve_fit(function, xdata, ydata, initial_guess)
simps(y,x)
interp1d(x,y,kind='...') # ... can be for example linear, quadratic or cubic 
```

<a id="examples"></a>
### Examples

#### Simple calculus

In [None]:
# numpy examples

import numpy as np

number = 100 # you can change this to see what happens
print("Square root of",number,"is",np.sqrt(number))

angle_rad = np.pi
angle_deg = np.degrees(angle_rad)
print("Cosine of", angle_deg, "degrees is", np.cos(angle_rad))

#### Using matplotlib

In [None]:
# Simple matplotlb example

import matplotlib.pyplot as plt

# Define x-points using np.linspace
x = np.linspace(-2,8,10) # 10 points evenly spaced between -2 and 8
y = x**2 # x**2 means x squared

plt.figure(figsize=(10,5)) # figsize in inches
plt.plot(x,y,'ro-', label='dataset') # plot in red dots connected with line. Label is shown in legend
plt.xlabel('y-axis label')
plt.ylabel('x-axis label')
plt.title('title')
plt.legend() # show legend
plt.show() # show plot

#### Plotting a histogram from data in file

In [None]:
# Pandas and matplotlib example

# Let's read sample data from a csv-file '../Data/sample.csv' that contains the birthdays for a group of people. 
# We want to plot a histogram of the birth months.

# first we need to import the necessary modules (in notebooks the import-statements that were run before are 
# still valid and in that case the statements here wouldn't be necessary)

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('../Data/sample.csv') # read the data
months = data.month              # month-column in csv-file can be selected using data.month or data['month']

plt.figure(figsize=(10,5))
plt.hist(months,12)              # plot the histogram
plt.xlabel('month')              # set label for x-axis
plt.ylabel('number of people')   # set label for y-axis
plt.show()                       # show the histogram


#### Fitting a function

In [None]:
# Numpy, matplotlib and scipy example

# In this example we fit a quadratic function to a dataset using curve_fit from scipy.optimize.
# numpy-library also has ready-made functions for fitting, but it is useful to know how to fit an arbitrary 
# function to your data.

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

# This is the function we want to fit to the data.
def quadratic(x,a,b,c):
    return a*x**2 + b*x + c

# This code is just to produce some random noisy data where we need to fit a function. Don't mind this too much.
xdata = np.linspace(-10,10,50)
y = quadratic(xdata, np.random.uniform(-1,1), np.random.uniform(-1,1), np.random.uniform(-1,1))
np.random.seed(5)
y_noise = 0.4 * np.random.normal(size=xdata.size)
ydata = y + y_noise
fig = plt.figure(figsize=(10,5))
plt.plot(xdata, ydata, 'r-', label='Data')

# The function-fitting is done here
p0 = [0.5, 1, -2] # This is the initial guess for parameters a, b and c. It is necessary especially for complicated functions
curve, covar = curve_fit(quadratic, xdata, ydata, p0) # curve is now a list of parameters a, b and c
[a,b,c] = curve

plt.plot(xdata, quadratic(xdata,a,b,c), 'k-', label='Fit')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

# CODERUNNER EXERCISES

# Week 1 (getting familiar with python)

# Average

<p>Write a function <b>average( vals )</b> that takes a list of values and returns the average value.</p>

In [None]:
# possible solution

import numpy as np

def average( vals ):
    return np.sum(vals)/len(vals) # or simply np.mean(vals)

In [None]:
# test1
testlist = [0]
print(average(testlist))
# expected output:
# 0.0

# test2
testlist = [1,2,3,4]
print(average(testlist))
# expected output:
# 2.5

# test1
testlist = [-2,-1,0,1,2,3]
print(average(testlist))
# expected output:
# 0.5

# test1
testlist = [-1.2345, 0.1245, 663.13223, 32847.1111]
print(round(average(testlist),2))
# expected output:
# 8377.28

## Exponential

<p>Write a function <b>exponential( x, a, b, f0 )</b> that returns the value of the function \( f(x)=ae^{bx}+f_0 \).</p>

In [None]:
# possible solution

import numpy as np

def exponential( x, a, b, f0 ):
    return a*np.exp(b*x) + f0


# Hint 1 if fails: Use for exp()-function from numpy-module.
# Hint 2 if fails: exp-function can be used like this:
# import numpy as np
# np.exp(...)
# Hint 3 if fails:
# import numpy as np
# def exponential( x, a, b, f0 ):
#     return ... * np.exp(...) + ...

In [None]:
# test1, x=-1, a=b=1, f0=0
x = -1
a = b = 1
f0 = 0
print(round(exponential(x,a,b,f0),2))
# expected output:
# 0.37

# test2, x=b=1, a=0, f0=18
print(exponential(1,0,1,18))
# expected output:
# 18.0

# test3, x=3, a=18, b=-0.71, f0=-5.968
x = 3
a = 18
b = -0.71
f0  =-5.968
print(round(exponential(x,a,b,f0),2))
# expected output:
# -3.83

# Week 2 (particle physics exercises start)

## Bin centers

<h3>Function that returns the bin centers</h3><p><br></p><p>Write a python function <b>bin_centers( bins )</b> that takes a list of bin edges as argument and <b>returns the bin centers</b>.<br><p>When fitting a distribution to a histogram, a good way is to take advantage of the <b>numpy</b> module's histogram-function, which returns the bin heights and bin edges as arrays. For function fitting, it would be better to use bin centers instead of the edges. Therefore, the function you will write in this exercise can be used later to determine the bin center locations in function fitting exercises. </p>

In [None]:
# possible solution

def bin_centers(bins):
    if len(bins) > 1:
        return [0.5 * (bins[i] + bins[i+1]) for i in range(len(bins)-1)]
    else:
        print("At least two bins are needed.")

In [None]:
# test1, evenly spaced positive values
bins = [0, 1, 2, 3, 4, 5]
print(bin_centers(bins))
# expected output: [0.5, 1.5, 2.5, 3.5, 4.5]

# test2, evenly spaced mixed values
bins = [-10, -6, -2, 2, 6, 10]
print(bin_centers(bins))
# expected output: [-8.0, -4.0, 0.0, 4.0, 8.0]

# test3, real data
import numpy as np
import pandas as pd
data = pd.read_csv('../Data/DoubleMuRun2011A.csv')
data_filtered = data[(data.M <= 100) & (data.M >= 80)]
inv_mass = data_filtered.M
hist, bins = np.histogram(inv_mass,bins=10)
print(bin_centers(bins))
# expected output: [81.001835, 83.001505, 85.00117499999999, 87.000845, 89.00051500000001, 91.000185, 92.999855, 94.999525, 96.999195, 98.998865]

# test4, unevenly spaces values (optional test, fails will most likely be here)
bins = [0,1,3,9,21,52]
print(bin_centers(bins))
# expected output: [0.5, 2.0, 6.0, 15.0, 36.5]

# other optional tests: empty list, list with only 1 element (use this if you want to be mean)

## Invariant mass

<h3>Function that returns the invariant mass</h3><p><br></p><p>Write a python function <b>invariant_mass( pt1, pt2, eta1, eta2, psi1, psi2 )</b> that takes transverse momentums, pseudorapidities and azimuth angles of two particles as parameters and <b>returns the invariant mass</b>. Remember that the invariant mass can be calculated as \(M=\sqrt{2p_{T1}p_{T2}(\cosh(\eta_1-\eta_2)-\cos(\phi_1-\phi_2))} \).</p><p>Hint: You can use square root and trigonometric functions by importing <b>numpy</b> module.<br></p>

In [None]:
# possible solution

import numpy as np

def invariant_mass(pt1,pt2,eta1,eta2,phi1,phi2):
    return np.sqrt( 2*pt1*pt2* ( np.cosh(eta1-eta2) - np.cos(phi1-phi2) ) )

In [None]:
# test with artificial parameters
pt1,pt2,eta1,eta2,phi1,phi2  =  0,1,2,3,4,5
print(round(invariant_mass(pt1,pt2,eta1,eta2,phi1,phi2),2))

# expected output: 
# 0.0

# test with artificial parameters
pt1,pt2,eta1,eta2,phi1,phi2  =  9,8,7,6,5,4
print(round(invariant_mass(pt1,pt2,eta1,eta2,phi1,phi2),2))

# expected output: 
# 12.02

# test with data near Z mass
pt1,pt2,eta1,eta2,phi1,phi2  =  58.6914, 45.7231, -1.02101, -0.37030, 0.836256, 2.741820
print(round(invariant_mass(pt1,pt2,eta1,eta2,phi1,phi2),2))

# expected output: 
# 91.14

# test with data near J/psi mass
pt1,pt2,eta1,eta2,phi1,phi2  =  6.60896, 17.876452, -0.250095, -0.375014, 0.512346, 0.765667
print(round(invariant_mass(pt1,pt2,eta1,eta2,phi1,phi2),2))

# expected output: 
# 3.06

# test with more data
import pandas as pd
df = pd.read_csv('../Data/DoubleMuRun2011A.csv')
df2 = df.head(10)
inv_mass = invariant_mass(df2.pt1,df2.pt2,df2.eta1,df2.eta2,df2.phi1,df2.phi2)
print(inv_mass.round(2))

# expected output:
# 0    17.49
# 1    11.55
# 2     9.16
# 3    12.48
# 4    14.31
# 5     6.82
# 6    39.53
# 7    37.74
# 8    10.54
# 9     3.11
# dtype: float64