Inga Ulusoy, Computational modelling in python, SoSe2020 

# Some fundamentals

All data in a program consists of data items. In python, all those data items are data objects -- with the attributes

1. Identity: Location of the data object in memory, i.e. its physical address
2. Type: Defines which operations are allowed with the data object
3. Value: The value of the data object which can be changed (mutable) or not (immutable)

In [None]:
myval = 0

In [None]:
myval2 = myval

Both objects point to the same address.

In [None]:
myval = 7 

The value of the data object has changed and with that, its physical address. Myval2 still points to its original assigned value.

In [None]:
type(myval)

Myval is an integer.

In [None]:
myval3=7.0

Myval3 is a floating-point number.

In [None]:
myval4 = 0.5+1.5j

Myval4 is a complex number.

In [None]:
myval5=True

A logical variable is called a Boolean and can only take one of two possible values.

In [None]:
myval6='Hello world'

Variables containing text are strings.

# Basic input and output

Variables can directly be printed out using the print() statement. Format() provides an easy way to format the output.

In [None]:
print(myval6)
print('I just meant to say',myval6)
print(myval,myval2,myval3)
print('First variable',myval,'second variable',myval2,'and third variable',myval3)
print(f'The value of myval3 is approximately {myval:.3f}.')
print('The value of myval3 is approximately {:.3f}.'.format(myval3))
print('The value of myval3 is approximately {:.3e}.'.format(myval3))
print('First variable {:1d}, second variable {:03d}, and third variable {:3.8f}'.format(myval,myval2,myval3))

They can also be printed out to a file that will be placed relative to the directory where jupyter is running. Please note the slightly different syntax, here the variables and strings are first concatenated into a new string that is then written to the file.

Variables can also be read in from a file. For this, please place the file "data.dat" in your directory.

The file contains the following lines

|||
| - | - |
| 2.5     | 0.097  |
| 5.0     | 0.195  |
| 7.5     | 0.289  |
| 10.0    | 0.387  |
| 15.0    | 0.581  |
| 20.0    | 0.775  |
| 30.0    | 0.966  |

Files can be read like this:

There is also a more pythonic way to read a file: This ensures that the file is properly closed in the end and no need for the close() statement.

This provides a very compact way of accessing a file.

# Linear regression

Imagine you know only values of a function at specific points, and you would like to know to what extent these follow a linear behaviour, and what the linear coefficients are.

## A simple example

In linear regression, the functional form that describes the data is assumed to be linear of the type $y=mx+b$.

We will use least-squares linear regression: The sum of the squares of the vertical deviations is given by $R^2$ (for a set of n data points):
\begin{align}
R^2=\sum_{i=1}^n [y_i-f(mx_i+b)]^2 = R^2(a,b)
\end{align}
This value is minimized so that the condition reads
\begin{align}
\frac{\partial R^2}{\partial a}&=0\\ 
\frac{\partial R^2}{\partial b}&=0
\end{align}
resulting in 
\begin{align}
\frac{\partial R^2}{\partial a}&=-2 \sum_{i=1}^n [y_i-f(mx_i+b)] \\
\frac{\partial R^2}{\partial b}&=-2 \sum_{i=1}^n [y_i-f(mx_i+b)] x_i
\end{align}
or, reformulated
\begin{align}
n a + b \sum_{i=1}^n x_i	&= \sum_{i=1}^n y_i	\\
a \sum_{i=1}^n x_i + b \sum_{i=1}^n x_i^2 &= \sum_{i=1}^n x_i y_i
\end{align}
This can be expressed in matrix form
\begin{align}
\begin{pmatrix}
n & \sum_{i=1}^n x_i \\
\sum_{i=1}^n x_i & \sum_{i=1}^n x_i^2
\end{pmatrix}
\begin{pmatrix}
a \\
b
\end{pmatrix} =
\begin{pmatrix}
\sum_{i=1}^n y_i	\\
\sum_{i=1}^n x_i y_i
\end{pmatrix}
 \end{align}
and solved for $a$ and $b$.

The data file "data.out" contains value pairs with the left column (x-values) corresponding to sample volume and the right column (y-values) detailing the measured extinction (light absorption) for each sample volume.


In [None]:
#initialize myx and myy as empty lists
myx=[]
myy=[]
with open('data.dat') as f:
    for line in f:
        #every new item for x is converted to float and appended to the array myx
        myx.append(float(line.split()[0]))
        print(myx)
        #every new item for y is converted to float and appended to the array myy
        myy.append(float(line.split()[1]))
# in the end we obtain one list filled with the x values and one with the y values
print('x-values:',myx)
print('y-values:',myy)
#the values are now stored in two lists, myx and myy. 
#the elements in the list can be accessed as such:
print('first list element has the index zero', myx[0])
print('second list element has the index one', myx[1])
print('first to third element can be accesses as such', myx[0:3], 'or as such',myx[:3])
print('the last element is easiest like this', myx[-1])
print('if I want to know how many elements I have in my list:',len(myx))
#and I can loop over them:
for i in range(len(myx)):
    something = myx[i]
    print(something)
#or like this:
for i in myx:
    something = i
    print(something)
#both loops achieve the same in this case

In [None]:
#Now we can plot the values
#use matplotlib
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(8,5))
#subplots returns a tuple that is unpacked into fig and ax here - ax is the object controlling the axes formatting,
#and fig is useful for global figure attributes, for example, if you want to save the figure into a file
#scatter plot (without a line)
ax.scatter(myx,myy)
ax.set_xlabel('x (Volume)',fontsize=18)
ax.set_ylabel('y (Extinktion)',fontsize=18)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

#if you want, you can save the plot
#plt.savefig('data.pdf',dpi=300,bbox_inches='tight')
#display the plot - in jupyter not really required but it is cleaner
plt.show()

Now we can do the linear regression using the scipy library.

In [None]:
from scipy import stats
from math import *
#call the linear regression function
slope,intercept,r,p,std_err = stats.linregress(myx,myy)
print('Slope m {:.3f}, intercept b {:3f}, Pearsons correlation coefficient r {:3f};'.format(slope,intercept,r))
print('r^2 {:.3f}, standard error {:3f}'.format(r**2,std_err))

In [None]:
#Now we can plot the values together with the fit
#myx is a list - and you cannot perform mathematical operations on lists
#it has to be converted to an array first, for that we need numpy library
from numpy import *
mm=array(myx)

fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(myx,myy,label='values')
ax.plot(mm,slope*mm+intercept,label='fit')

ax.set_xlabel('x (Volume)',fontsize=18)
ax.set_ylabel('y (Extinktion)',fontsize=18)
plt.xticks(fontsize=18)
plt.yticks(fontsize=18)

plt.savefig('data_regression.pdf',dpi=300,bbox_inches='tight')
plt.show()

# Problem 2: Good fit

A linear behaviour of extinction vs concentration (volume) signifies validity of the Lambert-Beer law. The Lambert-Beer law is only valid for dilute solutions, which explains the deviation of the large-volume data points from linear behaviour. 

Please carry out the linear regression using fewer points to obtain a better fit. How many points should be used for a good fit? What is the error in comparison to a fit using all data points?