In [None]:
import cekComputerLabs as cek
# cek.checkGitRepo()
from IPython.display import Image

from scipy import stats
import pandas as pd
import numpy as np
from lmfit import Model
import matplotlib.pyplot as plt

# 
Statistical Analysis and Linear Regression
In the short introductory lecture to this numerical laboratory we have briefly discussed averages and linear regressions. In this first activity we can practice how to compute averages and how to perform (linear) regressions using a jupyter note book and some of the most common Physics/Chemistry equations.

## Virtual experiment 
## [_You can start your first laboratory here_](virtualLaboratory.ipynb) 

# Glossary

- **Analytic solution**:\
Solving a problem with "pen and paper" and showing all the steps of the solution.
- **Numerical solution**:\
Solving a problem with an algorithm implemented in python or Excel to obtain an approximated solution of the problem.
- **Linearise an equation**:\
Rearrange an equation so that after a simple change of variables the plot of the data will be a straight line _e.g._ the Arrhenius equation can be written in either of these two forms
\begin{equation}
k = Ae^{-E/T} \qquad\qquad \ln k = -\frac{E}{RT} + \ln A
\end{equation}
if we then define $y=\ln K$ and $x=1/T$ we obtain
\begin{equation}
y = -\frac{E}{R} x + \ln A \qquad\qquad y=mx+q
\end{equation}
Hence, the slope a linear fit of $y$ _vs_ $x$ is  $m=-E/R$ and the y-axis intercept is  $q=\ln A$.


## Number of significant figures

When you conduct a measurement, the precision of the value you record corresponds to the measurement's inherent error. For instance, if you specify an object's length as 6.725 cm, it implies an approximate uncertainty of 0.005 cm. Expressing this measurement as either 6.7 or 6.7250012 would suggest that you possess knowledge to the nearest 0.1 cm in the former case or to an extraordinary precision of 0.0000002 cm in the latter. It is advisable to report only as many significant figures as align with the estimated error. In the case of 6.725 cm, it would be considered a four significant figure value, meaning that four digits hold significance in relation to the measurement. It's important to note that this concept isn't dependent on the "number of decimal places." Converting the same measurement to millimeters, it becomes 672.5 mm while still maintaining four significant figures. 

As per the widely accepted convention, only one uncertain digit should be included in a measurement report. For instance, if the estimated error is 0.03 cm, the result should be presented as 6.7 cm ± 0.03 cm, rather than 6.725 cm ± 0.03 cm.However, quantities should only be reported to a number of significant figures such that their uncertainties has one significant figure, _e.g_
* 123 $\pm$ 1
* 123.4 $\pm$ 0.1
* 120 $\pm$ 10
* 120 $\pm$ 2

Any ambiguity about whether the zero is significant or not in the last example is removed when the uncertainty of the measurement is explicitly reported.

An advantage of using a notebook or Excel for all the calculations is that you can use variables/cells to store  the results of all the operations up to the machine precision and then round them only the final answer for reporting purposes. If all operations are done using variables, the accuracy of calculations increases since all operations are done in double precision, and the results of such operations are also stored with 32 significant figures.

While rounding numbers to a given number of decimal places can be trivially done in python using formatted printing, it is less obvious how to round to the nearest 10 or 100.
In order to do that we can use the ```round``` function

```
round(number, ndigits=None)
Return number rounded to ndigits precision after the decimal point. 
If ndigits is omitted or is None, it returns the nearest integer to its input.

For the built-in types supporting round(), values are rounded to the closest 
multiple of 10 to the power minus ndigits; if two multiples are equally close, 
rounding is done toward the even choice (so, for example, both round(0.5) and 
round(-0.5) are 0, and round(1.5) is 2). Any integer value is valid for ndigits 
(positive, zero, or negative). The return value is an integer if ndigits 
is omitted or None. Otherwise, the return value has the same type as number.

Notes
-----
For values exactly halfway between rounded decimal values, Python rounds to 
the nearest even value. Thus 1.5 and 2.5 round to 2.0, -0.5 and 0.5 round to 0.0, 
etc. Results may also be surprising due to the inexact representation of decimal 
fractions in the IEEE floating point standard [1]_ and errors introduced when 
scaling by powers of ten.
```

Instead, if you want to always round .5 up you can use
```python
np.rint(np.nextafter(a, a+1))
```
where ```a``` is your variable.

This is however largely immaterial, since a floating point operation will never give exactly a result of .5.

In [None]:
number = 123.456789

print("Value {:f}".format(number))
print("Value {:.4f}".format(number))
print("Value {:.2f}".format(number))
print("Value {:.0f}".format(number))
print("-------")
rounded = round(number,6)
print("Value {}".format(rounded))
rounded = round(number,4)
print("Value {}".format(rounded))
rounded = round(number,2)
print("Value {}".format(rounded))
rounded = round(number,0)
print("Value {}".format(rounded))
rounded = round(number,-1)
print("Value {}".format(rounded))
rounded = round(number,-2)
print("Value {}".format(rounded))

## Random Errors

Random errors stem from the fluctuations observed when conducting multiple trials of a particular measurement. For instance, if you were to repeatedly measure the period of a pendulum using a stopwatch, you would notice that the measurements vary each time. These fluctuations primarily arise from the challenge of precisely determining when the pendulum reaches a specific point in its motion and accurately starting and stopping the stopwatch accordingly. Since you would obtain different period values in each attempt, your result inherently carries uncertainty. In any experiment there are various common sources of such random uncertainties:

1. Constraints imposed by the precision of your measuring equipment and the uncertainty in interpolating between the smallest divisions. Precision refers to the smallest directly measurable unit. A typical meter stick, for instance, is divided into millimeters, making its precision one millimeter.

2. Unpredictable variations in initial conditions during measurements. 

3. Absence of a precise definition for the measured quantity. The quantity being measured might be influenced by an uncontrolled variable, such as the temperature of the object.

4. Occasionally, the measured quantity is well-defined but is subject to inherent random fluctuations. These fluctuations may have a quantum origin or result from the fact that the values of the measured quantity are determined by the statistical behavior of a large number of particles. For example, AC noise can cause the needle of a voltmeter to fluctuate.

Regardless of the source of uncertainty, for it to be classified as "random," the fluctuations from a "true" value must be equally likely to be positive or negative. This characteristic provides a crucial insight into how to address random errors. By conducting a large number of measurements and calculating the average result, you can expect that, if uncertainties are truly equally likely to be positive or negative, the average of these measurements will closely approximate the correct value of the measured quantity. This is because positive and negative fluctuations tend to balance each other out over a large number of trials. 
In this case the measurements are "normally" distributed, which means that the normalised histogram of an infinitely large number of observations will be a Gaussian (bell curve) of width $\sigma$ and centered at the "true" value, $\mu$.

In [None]:
Image(filename="normal.png",width=600)

**Figure 1.** Normal distribution representation By Ainali - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=3141713

If the exact values of $\mu$ and $\sigma$ are known, a single measurement of the quantity would have 
* 68.2% change of being within $\sigma$ from $\mu$
* 95.4% change of being within $2\sigma$ from $\mu$
* ...

## Confidence interval
https://en.wikipedia.org/wiki/Student%27s_t-distribution

Because of the true value of a quantity is generally unknown, and the magnitude of the random errors depend on the experimental setup, operator's skills, etc. it important to report our results with confidence intervals. 
Like saying, "I am 95% confident that the true value of this quantity sits within this range of values..."

Given a set of $N$ measurements of a certain quantity, ${X} = {x_1, x_2, \dots, n_N}$, define the average of that quantity as

\begin{equation}
\mu = \frac{\sum_i x_i}{N},
\end{equation}

its standard deviation as

\begin{equation}
\sigma^2 = \frac{\sum_i (x_i-\mu)^2}{N-1},
\end{equation}

and its standard error as
\begin{equation}
s = \frac{\sigma}{\sqrt{N}},
\end{equation}

Because the number observations is usually limited (<50) the standard deviation would provide an underestimate of the confidence interval. 
A better estimate of the confidence interval for our observable is

\begin{equation}
CI = \mu \pm ts = \mu\pm t\frac{\sigma}{\sqrt{N}}
\end{equation}

where $t$ depends on the number of degrees of freedom of our sample, $df=N-1$, and it can be easily obtained from the _student-t_ tables, such as the one below.

In [None]:
Image(filename="t-table.png",width=600)

Let's make a practical example to see how this works.
Imagine we measure a reaction rate using 5 nominally equivalent initial conditions and we obtain these values:

| #   | $k_r$    |
|:---:|:--------:|
|  1  |  0.0291  |
|  2  |  0.0321  |
|  3  |  0.0293  |
|  4  |  0.0295  |
|  5  |  0.0272  |


In [None]:
# put the values into a numpy array
kr = np.array([0.0291, 0.0321, 0.0293, 0.0295, 0.0272])

ndata = len(kr)
print("The number of values is {}".format(ndata))

ndf = len(kr)-1
print("The number of degrees of freedom is {}".format(ndf))

mean = np.mean(kr)
print("The mean is {}".format(mean))

stdev = np.std(kr,ddof=1)
print("The standard deviation is {}".format(stdev))

sterr = stdev / np.sqrt(ndata)
print("The standard error is {}".format(sterr))

# Extract the t value from scipy for a two tailed distribution
tvalue = stats.t.ppf(q=1-.05/2,df=ndf)
print("The t value for a 95% confidence interval with {} degrees of freedom is {}".format(ndf,tvalue))

CI = tvalue * sterr
print("The value of kr is {:.3f} +/- {:.3f}".format(mean,CI))

## Comparison with experimental data - Student's t test
https://en.wikipedia.org/wiki/Student%27s_t-test

Once we have calculated our best estimate for a quantity and its 95% confidence interval, we can then compare it with its known/expected value, which we indicate as $k_{exp}$.
This can be done by performing the so called _t test_.

\begin{equation}
T_{test} = \frac{\mu - k_{exp}}{\sigma\big/\sqrt{N}}
\end{equation}

The value of $T_{test}$ computed above, should then be compared with the $t$ value obtained before when we computed the confidence interval. If
* $T_{test} \ge -t \quad or \quad T_{test} \le t$ the literature value is within our confidence interval, hence our estimate for $x$ is consistent with the literature value $\mu$
* $T_{test} < -t \quad or \quad T_{test}>t$ the literature value is outside our confidence interval, hence our estimate for $x$ is statistically different from the literature value, _i.e._ they are inconsistent.

Using the data set above, we can whether our measurement is consistent with the two available literature value of $\mu_1 = 0.026$ and of $\mu_2=0.031$

In [None]:
literature = [0.026,0.031]

for mu in literature:
    Ttest1 = (mean - mu)/sterr
    print("Ttest = {:.4f} || t value {:.4f}".format(Ttest1,tvalue))
    
    if Ttest1 >= -tvalue and Ttest1 <= tvalue:
        print("My measurement is consitent with a literature value of {:.4f}\n".format(mu))
    else:
        print("My measurement is inconsitent with a literature value of {:.4f}\n".format(mu))

## Propagation of the uncertainty
https://en.wikipedia.org/wiki/Propagation_of_uncertainty

In statistics, propagation of uncertainty (or propagation of error) is the effect of variables' uncertainties (or errors, more specifically random errors) on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations (e.g., instrument precision) which propagate due to the combination of variables in the function.

Let's assume we have some quantities $x, y,\dots$ with given uncertainty $\sigma_x, \sigma_y, \dots$ and that we want to compute the uncertainty on another quantity $\Gamma$ obtained from them $\Gamma=f(x,y,\dots)$. There are standard formulae to propagate the error, which depend on what the function $f$ is, for example.

\begin{eqnarray}
f(x)   &= ax       \qquad  \qquad \sigma_f &= \lvert a\rvert\sigma_x \\
f(x,y) &= ax+bx    \qquad  \qquad \sigma_f &= \sqrt{a^2\sigma_x^2 + b^2\sigma_y^2} \\
f(x)   &= ax^b     \qquad  \qquad \sigma_f &= \Bigg\lvert \frac{ax^b\ b\sigma_x}{x}\Bigg\rvert  \\
f(x)   &= a\ln(bx) \qquad  \qquad \sigma_f &= \Bigg\lvert a\frac{\sigma_x}{x}\Bigg\rvert  \\
f(x,y) &= xy       \qquad  \qquad \sigma_f &= \lvert f\rvert\sqrt{\Bigg(\frac{\sigma_x}{x}\Bigg)^2 + \Bigg(\displaystyle\frac{\sigma_y}{y}\Bigg)^2} \\
f(x,y) &= x/y       \qquad  \qquad \sigma_f &= \lvert f\rvert\sqrt{\Bigg(\frac{\sigma_x}{x}\Bigg)^2 + \Bigg(\displaystyle\frac{\sigma_y}{y}\Bigg)^2} \\
\end{eqnarray}

where $a$ and $b$ are pure numbers with no uncertainty.

### Example 1
Let's imagine we want to prepare a solution of potassium hydrogen phthalate (molar mass = 204.22 g/mol). Given the precision of the scale and of our pipette we measure
* 15.00 $\pm$ 0.05 g of potassium hydrogen phthalate
* 100.0 $\pm$ 0.1 mL of DI water

What is the uncertainty of the concentration of the final solution?

In [None]:
# Problem data
molarMass = 204.22 # g/mol
mass = 15.00 # g
mass_sigma = 0.05
volume = 0.1000 # L
volume_sigma = 0.0001

# Compute the number of moles and its uncertainty
moles = mass / molarMass
moles_sigma = mass_sigma / molarMass

# Compute the concentration ...
conc = moles / volume

Because the concentration is obtained as the ratio of two quantities with uncertainties, we use 
\begin{eqnarray}
f(x,y) &= x/y       \qquad  \qquad \sigma_f &= \lvert f\rvert\sqrt{\Bigg(\frac{\sigma_x}{x}\Bigg)^2 + \Bigg(\displaystyle\frac{\sigma_y}{y}\Bigg)^2} \\
\end{eqnarray}

In [None]:
# ... and its uncertainty
conc_sigma = conc * np.sqrt( (moles_sigma/moles)**2 + (volume_sigma/volume)**2 )

print("Solution concentration = {:.3f} +/- {:.3f} mol/L".format(conc,conc_sigma))

### Example 2
We prepared two stock solutions of potassium hydrogen phthalate 
* 0.735 $\pm$ 0.003
* 0.512 $\pm$ 0.007

what is the concentration of potassium hydrogen phthalate if we mix equal amount of the two stock solutions?

In [None]:
# Store the concentrations and uncertainties in lists
solution1 = [0.735 , 0.003] 
solution2 = [0.512 , 0.007]

mix = (solution1[0] + solution2[0]) / 2
mix_sigma = np.sqrt(solution1[1]**2 + solution2[1]**2) / 2

print("Solution concentration = {:.3f} +/- {:.3f} mol/L".format(mix,mix_sigma))

## Curve fitting
https://en.wikipedia.org/wiki/Least_squares

Curve fitting is generally done by minimising the sum of the squares of the residuals (the difference between an observed value and the fitted value provided by a model) by varying the model parameters. If we have a set of observations $y$ at different conditions $x$, we can fit those data using a function that contains a number of parameters $f(x,a,b,c\dots)$, which has to be smaller than the number of values we have. For example
* linear fit:\
$f(x,a,b) = ax + b$

* quadratic fit:\
$f(x,a,b,c) = ax^2 + bx + c$

* exponential fit:\
$f(x,a,b) = a\ e^{bx}$

* $\dots$

There are many ways of fitting a function in python, the one we would suggest you to use is [_lmfit_](https://lmfit.github.io/lmfit-py/). In the example below, we will demonstrate
* How to import data from a CSV file
* How to perform a linear fit
* How to extract the values of the fitting parameters and their uncertainties
* How to show to goodness of the fit
* How to use the fitted function to extract the expected an arbitrary value

In [None]:
# Import data in a Pandas DataFrame from a file with columns separated by commas
data = pd.read_csv("../../miscData/LB.csv")
# Import the data from a file with columns separated by white spaces
# Skip comments and file has no header
#data = pd.read_csv("../../miscData/timing.dat",sep='\s+',
#                   comment="#",header=None)

print(data)

# Change column names
data.columns = ("x","y","e")
print(data)

In [None]:
# For clarity we put the values in the dataframe into variables
x = data["x"]
y = data["y"]

# Define the function to be used in the fitting
def func(x,a,b):
    y = a*x + b
    return y

# Create the lmfit model
lmodel = Model(func)

# Initialise the function parameters to be optimised
params = lmodel.make_params(a=1,b=1)

# Fit the data:
#   first argument is y
#   second argument are the fitting parameters
#   third argument is x
result = lmodel.fit(y,params,x=x)

# Print the summary of the result
print(result.fit_report())

# Goodness of the fit
fig = result.plot()
ax = fig.gca()
ax.set(xlabel="Y label")
ax.set(ylabel="X label")
plt.show()

In [None]:
# Extract the value of the parameters
aParam = result.params["a"].value
bParam = result.params["b"].value

# Extract the uncertainty of the parameters
# Lmfit reports the confidence interval at 1 sigma, hence we need to multiply 
# the uncertainty by 1.960 to get thr 95% confidence interval
aErr = result.params["a"].stderr * 1.960
bErr = result.params["b"].stderr * 1.960
print("The slope is {:.3f} +/- {:.3f}".format(aParam,aErr))
print("The slope y-axis intercept is {:.2f} +/- {:.2f}".format(bParam,bErr))

In [None]:
# Direct evaluation of the function at a given x value
# Note that we have use a numpy array to the result object
xValue = np.array([20.])
yValue = func(xValue,aParam,bParam)
# Here we use the "*" operator to expand the list content
print("The value of func at x={} is {}".format(xValue,yValue))
print("The value of func at x={:.2f} is {:.2f}".format(*xValue,*yValue))

# Evaluation of the function using lmfit to get the 95% confidence interval
# Note that we have to pass a numpy array to the result object
yValue = result.eval(x=xValue)
yError = result.eval_uncertainty(x=xValue,sigma=0.95)
print("The value of func at x={} is {} +/- {}".format(xValue,yValue,yError))
print("The value of func at x={} is {:.2f} +/- {:.2f}".format(*xValue,*yValue,*yError))

In [None]:
# Function evalation for a set of values and plotting with the data
#   50 linearly spaced values between 0 and 10
xValues = np.linspace(0,10,50)
yValues = func(xValues,aParam,bParam)

# 95% confidence interval for the data
dely = result.eval_uncertainty(x=xValues,sigma=0.95)

# Create a figure and plot the data
fix, ax = plt.subplots()
ax.plot(xValues,yValues,label='fit',color='r')
ax.fill_between(xValues, yValues-dely, yValues+dely, color="gray", alpha=0.5)
ax.scatter(data["x"],data['y'],label="data")

ax.set(xlabel="Y label (mol/L)")
ax.set(ylabel="X label")

plt.legend()
plt.show()

## Weighted fit
The data we used in the previous example also contained the uncertainty of the value. We can use that information to provide a larger weight to the values with the smaller uncertainty, and then refit the data

In [None]:
# Import the data
data = pd.read_csv("../../miscData/LB.csv")
data.columns = ("x","y","e")
# For clarity we put the values in the dataframe into variables
x = data["x"]
y = data["y"]
e = data["e"]

# Define the function to be used in the fitting
def func(x,a,b):
    y = a*x + b
    return y

# Create the lmfit model
lmodel = Model(func)

# Initialise the function parameters to be optimised
params = lmodel.make_params(a=1,b=1)

# Fit the data:
#   first argument is y
#   second argument are the fitting parameters
#   third argument is x
result = lmodel.fit(y,params,x=x,weights=1/e)

# Print the summary of the result
print(result.fit_report())

# Goodness of the fit
fig = result.plot()
ax = fig.gca()
ax.set(xlabel="Y label")
ax.set(ylabel="X label")
plt.show()

# $\LaTeX$ for Greek Letters

Here below you can find the greek letters and the corresponding $\LaTeX$ commands to include them in your equations.

\begin{eqnarray*}
\verb| Lowercase letters| &\qquad\qquad& \verb| Uppercase letters| \\
\verb|\alpha      | \alpha   &\qquad\qquad& \verb|  A         |  A      \\
\verb|\beta       | \beta    &\qquad\qquad& \verb|  B         |  B      \\
\verb|\gamma      | \gamma   &\qquad\qquad& \verb| \Gamma     | \Gamma  \\
\verb|\delta      | \delta   &\qquad\qquad& \verb| \Delta     | \Delta  \\
\verb|\epsilon    | \epsilon &\qquad\qquad& \verb|  E         |  E      \\
\verb|\zeta       | \zeta    &\qquad\qquad& \verb|  Z         |  Z      \\
\verb|\eta        | \eta     &\qquad\qquad& \verb|  H         |  H      \\
\verb|\theta      | \theta   &\qquad\qquad& \verb| \Theta     | \Theta  \\
\verb|\iota       | \iota    &\qquad\qquad& \verb|  I         |  I      \\
\verb|\kappa      | \kappa   &\qquad\qquad& \verb|  K         |  K      \\
\verb|\lambda     | \lambda  &\qquad\qquad& \verb| \Lambda    | \Lambda \\
\verb|\mu         | \mu      &\qquad\qquad& \verb|  M         |  M      \\
\verb|\nu         | \nu      &\qquad\qquad& \verb|  N         |  N      \\
\verb|\xi         | \xi      &\qquad\qquad& \verb| \Xi        | \Xi     \\
\verb|omicron     | o        &\qquad\qquad& \verb|  O         |  O      \\
\verb|\pi         | \pi      &\qquad\qquad& \verb| \Pi        | \Pi     \\
\verb|\rho        | \rho     &\qquad\qquad& \verb|  P         |  P      \\
\verb|\sigma      | \sigma   &\qquad\qquad& \verb| \Sigma     | \Sigma  \\
\verb|\tau        | \tau     &\qquad\qquad& \verb|  T         |  T      \\
\verb|\upsilon    | \upsilon &\qquad\qquad& \verb| \Upsilon   | \Upsilon\\
\verb|\phi        | \phi     &\qquad\qquad& \verb| \Phi       | \Phi    \\
\verb|\chi        | \chi     &\qquad\qquad& \verb|  X         |  X      \\
\verb|\psi        | \psi     &\qquad\qquad& \verb| \Psi       | \Psi    \\
\verb|\omega      | \omega   &\qquad\qquad& \verb| \Omega     | \Omega  
\end{eqnarray*}

These are the $\LaTeX$ commands for the most commonly used mathematical symbols
\begin{eqnarray*}
\verb|\frac{a}{b}| &\qquad\qquad& \frac{a}{b} \\
\verb+\sum_{a}^{b}+ &\qquad\qquad& \sum_{a}^{b} \\
\verb+\prod_{a}^{b}+ &\qquad\qquad& \prod_{a}^{b} \\
\verb+\int_{a}^{b}+ &\qquad\qquad& \int_{a}^{b} \\
\end{eqnarray*}