# Module 13: Data & Curve fitting

In physics, we often a physical model in mind, for how we expect our data to behave or for how we want to analyze it. In order to make a comparison between the model abd our data, we need to be able to load data into Python (rather than generating it within Python). Many data analysis packages are available inside SciPy which help ease the task of understanding data.

*Note:* Please complete these activities on your own computer or on JHub, where you have full control of where the files are placed within your path (filesystem). Google Colab is difficult to use in this context.

## Learning objectives
* Learn how to import data from elsewhere
* Learn how to extract something useful from the data


## Reading and writing text data files

Often, an experimental apparatus will be controlled by a computer; the data from the experiment is often saved in an ASCII plain text file, with numbers separated by spaces, commas, or tabs. The data is often organized in rows or columns, like you might see in a spreadsheet. Common suffixes for these files are `.txt` or `.csv` (comma-separated variables). A nice feature of ASCII files is that they can be opened by any software, including software that you write yourself. 

Particularly when these data files are large and/or numerous, it is preferable to have a data-anlysis program read the data directly from a file. This also allows you to batch-process many files using the same Python script, rather than having to do perform each analysis by hand. 

Reading data from a file and writing data to a file can be a bit tricky, because Python interprets each character, including spaces and carriage returns, quite literally. For scientific purposes, we usually want to read and write arrays of floating point (real) numbers, but reading text characters is also useful, particularly in biophysics, geophysics, and astrophysics where the names of proteins, geographic locations, and astronomical objects might be part of the dataset.

###  numpy tools

As you might already expect, `numpy` includes easy-to-use functions for reading and writing data arrays. Two of these functions are called `loadtxt()` and `savetxt()`. 

To load the contents of the file  *filename* into the array variable `a`, simply use:

    a = np.loadtxt(filename) 

To save the array `a` to a file, just write:

    np.savetxt(filename, a) 


There are optional arguments to `loadtxt()` and `savetxt()` that you should explore on your own. By default, 
`loadtxt()` and `savetxt()` assume all numbers are real (type `float`). 

Note that `loadtxt` will give you a 2D array with the number of rows and columns corresponding to the rows and columns in the file.  Recall that to get a single column out of a 2D array, you slice it using `a[:,column_index]`.

Two other useful `numpy` file-readers are:
* `genfromtxt()` (has some nice features if the file has missing numbers)
* `read_csv()` (if you know the file has comman-separated variables)

### Exercise 1

Download the file `damped_oscillation.txt` from the Google Drive, and place it into the same directory where you saved the Jupyter notebook, using the commands listed above.

Note that, unless you specify a full file path, Python will assume that the file *filename* is in the same directory as the program you a running.  You may have to experiment with the precise file path, since it will depend on the machine you're running on, and where you saved the files. Typically, `loadtxt('damped_oscillation.txt')` should work, but here are some other things to try out. 

*If you need to use the full path:* be sure get the correct spelling and punctuation, and not the the direction the slashes face depends on which operating system you are using. Doing a copy-paste from from your file browser is a good way not to make typos. Here are some examples of what this might look like on different operating systems:
* Windows: `C:\\Documents\username\sub\directory\damped_oscillation.txt` (note the double backslash)
* Mac: `/Users/username/sub/directory/damped_oscillation.txt`
* Linux: `/home/username/sub/directory/damped_oscillation.txt`

Once you have no error on the file-loading (the first cell), you are ready to access the data (the second cell).

The file contains experimental results of an angular position vs. time (in seconds) for a damped oscillator.
The first column is a list of times at which the measurements were made, the second column is the corresponding angular position.

* Assign the first column of the loaded variable `a` to a new variable `t` (for time)
* Assign the second column of the loaded variable `a` to a new variable `theta` (for angle)
* Make a properly labeled plot of `theta` vs `t` (with units). 
* Given that it's experimental data, it's a good idea to plot points in addition to lines. This conveys how much/little data is available, and how closely-spaced it is.



In [None]:
import numpy as np
import pylab

a = np.loadtxt('damped_oscillation.txt')
print(type(a[0,0]))
print(np.shape(a))

In [None]:
t     = a[:,0]
theta = a[:,1]

fig, ax = pylab.subplots(1,1)
ax.plot(t, theta, 'o')
ax.set_xlabel('time', fontsize=14)
ax.set_ylabel('theta', fontsize=14)
pylab.show()

## Reading and writing text data files from GitHub

In [None]:
import base64
import requests

In [None]:
# To query data files from GitHub, make sure to switch to Raw format on the website
# This will clean out all the HTTP tags
# Raw format
url = 'https://raw.githubusercontent.com/lblogan14/CDSA2022_CompCourse/main/Day_09/damped_oscillation.txt'
# HTTP rendered format
#url = 'https://github.com/lblogan14/CDSA2022_CompCourse/blob/main/Day_09/damped_oscillation.txt'

req = requests.get(url)
raw_text = req.text
print(raw_text)

In [None]:
raw_text.split('\n')[2].split('\t')

In [None]:
data = []
for substr in raw_text.split('\n'):
    if '#' in substr:
        continue
    else:
        di = substr.split('\t')
        #print(di)
        if len(di) < 2:
            continue
        else:
            data.append(di)
data = np.array(data, dtype=float)

In [None]:
data

In [None]:
t     = data[:,0]
theta = data[:,1]

fig, ax = pylab.subplots(1,1)
ax.plot(t, theta, 'o')
ax.set_xlabel('time', fontsize=14)
ax.set_ylabel('theta', fontsize=14)
pylab.show()

## Interacting with data

Given our experimental data, we may want to extract something from it. For the damped oscillator data above, we might want to measure the frequency, amplitude, and damping coefficient.

Often, we will want to know the functional relationship between these variables. In some cases, we may have a theoretical reason 
to expect a certain relationship between the variables; in other cases the data itself might suggest a type of functional 
relationship. 

Note that data will always contain experimental noise and have been measured with finite precision, causing the data points to be a bit scattered from a perfect line. 

We need a rational way determine the functional relationship between these variables; that is, we need a way to choose a function that best fits the scattered data. 

Create a code that reads the freefall data and plots the speed versus time. Add the line $s = 9.8\,t$ to 
the the graph, showing the behavior we should expect.

Hint:* The data plot you plot in the exercise below should look something like this:
![Data](http://www.physics.ncsu.edu/kemperlab/images/freefallgraph.png)



In [None]:
rdm_data = [[0.00,  -0.10290], 
            [0.10,   0.37364],
            [0.20,   2.43748],
            [0.30,   3.93836],
            [0.40,   3.31230],
            [0.50,   5.49472],
            [0.60,   5.43325],
            [0.70,   6.39321],
            [0.80,   9.06048],
            [0.90,   9.36415],
            [1.00,   9.52066]]

In [None]:
rdm_data = np.array(rdm_data)
t = rdm_data[:,0]
speed = rdm_data[:,1]

pylab.plot(t, speed, 'o')
pylab.xlabel('time (s)', fontsize=14)
pylab.ylabel('speed (m/s)', fontsize=14)
pylab.show()

## Least Squares Fitting: Theory

Often, we want to find the parameters of a function $f(x)$ that *best fits* the data. For instance, the slope of a line, or the frequency of an oscillator. 

Let $x_i$, $y_i$ denote the data, where $i = 1,\ldots,N$ enumerates the individual data points (here, rows). 

We are looking for a function $f(x)$ that best fits this data. We will define *best fit* as the 
function that minimizes the sum of squares of differences between the function values $f(x_i)$ and the data
values $y_i$ (this is not the only possible definition, as we will see below, but a very common one). Therefore, we define 
$$
	S = \sum_{i=1}^N (f(x_i) - y_i)^2
$$
and look for the function $f(x)$ that minimizes $S$. This is called the *least squares* method, and $S$ is called the "$L_2$-norm". Note that the differences are squared so that 
each error contributes a positive 
amount to $S$. *To discuss:* can you see why this method has the name *least squares*?


The first step is to guess the form of the function $f(x)$. For the problem above, where we know something about uniform graviational acceleration near the earth's surface, we might assume a linear relationship of the form $f(x) = ax + b$.  In that case, the  $L_2$ norm becomes
$$
	S = \sum_{i=1}^N (ax_i + b - y_i)^2
$$

Note that $S$ is a quadratic function of the fit parameters $a$ and $b$. The minimum of $S$ satisfies both

$$
    \partial S/\partial a = 0\\
    \partial S/\partial b = 0
$$

where the partial derivative means that we take the derivative of only that variable. Note that we are treating $a$ and $b$ as variables because we want to see what moving them around does to the score $S$: the lower the score, the better the fit. 

If you take these two derivatives (do this for yourself on a piece of paper) these equations become:

$$
\begin{aligned}
	\sum(ax_i + b - y_i)x_i & = & 0 \ \\
	\sum(ax_i + b - y_i) & = & 0 \ 
\end{aligned}
$$

where we have dropped the limits on the summation signs for notational simplicity. 

Since $\sum c = c N$ for any constant $c$, these two equations simplify to 

$$
\begin{aligned}
	a\sum x_i^2 \, + \, & b\sum x_i & = & \sum x_i y_i \\
	a\sum x_i \, + \, & bN & = & \sum y_i
\end{aligned}
$$

This is a system of linear equations for the two unknowns, $a$ and $b$. In matrix notation, 
we have 
$$
\begin{aligned}
	\left( \begin{array}{cc} \sum x_i^2 & \sum x_i \\
		\sum x_i & N \end{array} \right) \left( \begin{array}{c} a \\ b \end{array}\right) 
	= \left( \begin{array}{c} \sum x_i y_i \\ \sum y_i \end{array}\right)
\end{aligned}
$$

As you know from the previous less, you can find the solution using numpy's `solve()` function:

    import numpy as n
    A = n.matrix([[a11,a12],[a21,a22]])
    r = n.matrix([[r1],[r2]])
    soln = n.linalg.solve(A,r)
    a = soln[0,0]
    b = soln[1,0]

with appropriate values used for the entries of `A` and `r`. 


### Other definitions of *best fit*
Another reasonable approach, which is sometimes used, is based on the "$L_1$-norm". The $L_1$ norm is 
defined by $S = \sum_{i=1}^N |f(x_i) - y_i|$. 
This is the *least absolute deviation* method of defining a best fit function. *To discuss:* can you think of a reason you might pick one method over the other?

## Using least squares on a real dataset

Now, let's pretend that we didn't know the value of $g$, and we wanted to use  the `freefall.data` dataset
to find it. We do know enough to expect a linear relationship between speed $s$ and time $t$, and apart from 
experimental noise, and the data do seem to follow a linear trend. So our goal will be to find the straight 
line that comes closest to fitting the data. 



In [None]:
# Build least-sqaure matrix
A = np.identity(2)
A[0,0]= np.sum(t**2)
A[0,1]= np.sum(t)
A[1,0]= np.sum(t)
A[1,1]= len(t)
r= np.matrix([[0.],[0.]])
r[0,0]= np.sum(t*speed)
r[1,0]= np.sum(speed)
# Solve and build fit line
soln = np.linalg.solve(A,r)
a = soln[0,0]
b = soln[1,0]
yt= np.linspace(0, t[-1], 20)
yy= a*yt+b

In [None]:
# Plot
pylab.plot(t, speed, 'o')
pylab.plot(yt, yy, '--')
pylab.xlabel('time (s)', fontsize=14)
pylab.ylabel('speed (m/s)', fontsize=14)
pylab.show()

## Least squarescurve-fitting with `scipy`

As you may imagine, numpy has some of these routines already built-in.  Here we'll learn to use it.

Least squares analysis can be applied to any function $f(x)$. For example, consider the data from 
the graph of the damped oscillator we made above.
The data appear to follow a damped oscillator relationship, but we don't know the amplitude, frequency, phase, or damping coefficient. From what we know about damped oscillators, we think that the function 

$$
	f(t) = a\cos(\omega t + \phi)  \exp(-t/\tau) + B
$$

would be a good fit. We can carry out a least-squares analysis to fit the `damped_oscillation.txt` data to a function of this form, and find the best-fit parameter values $a$, $\omega$, $\phi$, $\tau$ and $B$ that minimize $S =  \sum_{i=1}^N (a\cos(\omega t_i + \phi)  \exp(-t_i/\tau) +B - y_i)^2$. 

This type of *curve fitting* is so useful that the module `scipy` includes the command `curve_fit` to carry out the 
analysis. You can import the command with 

    from scipy.optimize import curve_fit

As usual, the documentation  (and a good example) is available online: https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html

To use this command, you will first need to create a definition of the function we're going to use for the fit:

    def func(t,a,omega,phi,tau,B):
        return damped_oscillation_function_goes_here
        
The first argument is the independent variable `t`, and the remaining arguments 
are the parameters (in this case, $a$, $\omega$, $\phi$, $\tau$, and $B). 

How do we use the scipy interface?
Let's say the data are contained in arrays `xdata` and `ydata`. Then the command
    
    par, con = curve_fit(func,xdata,ydata)

will return the best-fit values for $a$, $\omega$, $\phi$, $\tau$ and $B$ in the parameter array `par`. 

*Optional:* If you are interested in the confidence you should place in these parameters, the array `con` contains 
further information about a type of error analysis that you should explore on your own (it's called the
*covariance matrix*). 

Sometimes the `curve_fit` function does not converge to a result you expect, and you can nudge it along by
providing a starting point for the fit through the optional parameter `p0`, where you can pass in an initial guess
of parameters:

    pguess = [guess_for_a, guess_for_omega, guess_for_phi, guess_for_tau, guess_for_B]
    par, con = curve_fit(func, xdata, ydata, p0=pguess)

You'll notice this may be necessary if the fitting routine complains: _RuntimeError: Optimal parameters not found_

## Exercise

Write a code that will import the data file `damped_oscillation.txt` (available on Google drive)
and carry out a least-squares fit 
to a function of the form just discussed, using the scipy `curve_fit` function. Plot a graph showing the data points, as well as the best fit curve. What 
values did you get for the amplitude, frequency, phase and decay constant?  Annotate these on the plot, as well as the residual: $R^2 = \sum_i (f(x_i) - y_i)^2$.

*Stretch goal (optional):* include information about how you interpreted the errors, using the covariance matrix.

In [None]:
from scipy.optimize import curve_fit
###################
# Main code
###################
# function to fit
def func(time, a, omega, phi, tau, B):
    return a*np.cos(omega*time + phi)*np.exp(-time/tau) + B

# the osccilation data we read at the beginning of this notebook
time = data[:, 0] # goes to time
theta = data[:, 1] # goes to theta
# scipy fit for least sqaure
par, con = curve_fit(func, time, theta)
# Output result for parameters. Make fit line
a = par[0]
omega = par[1]
phi = par[2]
tau = par[3]
B = par[4]
print(a, omega, phi, tau, B)
yt= np.linspace(time[0], time[-1], 100)
yy= func(yt, a, omega, phi, tau, B)

In [None]:
# Plot
pylab.plot(time, theta, 'x')
pylab.plot(yt, yy, '-')
pylab.xlabel('time (s)', fontsize=14)
pylab.ylabel('theta (rad)', fontsize=14)
pylab.show()

## Overfitting, Underfitting, Model Complexity

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score


def true_fun(X):
    return np.cos(1.5 * np.pi * X)


np.random.seed(0)

n_samples = 30
degrees = [1, 4, 15]

X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1

In [None]:
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
    ax = plt.subplot(1, len(degrees), i + 1)
    plt.setp(ax, xticks=(), yticks=())

    polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
    linear_regression = LinearRegression()
    pipeline = Pipeline(
        [
            ("polynomial_features", polynomial_features),
            ("linear_regression", linear_regression),
        ]
    )
    pipeline.fit(X[:, np.newaxis], y)

    # Evaluate the models using crossvalidation
    scores = cross_val_score(
        pipeline, X[:, np.newaxis], y, scoring="neg_mean_squared_error", cv=10
    )

    X_test = np.linspace(0, 1, 100)
    plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
    plt.plot(X_test, true_fun(X_test), label="True function")
    plt.scatter(X, y, edgecolor="b", s=20, label="Samples")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.xlim((0, 1))
    plt.ylim((-2, 2))
    plt.legend(loc="best")
    plt.title(
        "Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
            degrees[i], -scores.mean(), scores.std()
        )
    )
plt.show()