# Data Analysis and Curve Fitting
## Lecture 13

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import optimize

# Best fits with polynomials

### Example 1

Suppose an experiment measuring the height $y$ and time $t$ of a ball being thrown upwards is performed and the following data is generated:

In [None]:
df = pd.DataFrame([[0,1.302], [0.03333,1.411],[0.06667,1.5],[0.1,1.578],
                    [0.1333,1.646],[0.1667,1.703],[0.2,1.745],[0.2333,1.781],
                    [0.2667,1.807],[0.3,1.828],[0.3333,1.818],[0.3667,1.818],
                    [0.4,1.807],[0.4333,1.776],[0.4667,1.734],[0.5,1.682],
                    [0.5333,1.63],[0.567,1.552],[0.6,1.469],[0.6333,1.37],
                    [0.667,1.266],[0.7,1.151],[0.733,1.026],[0.7667,0.875],
                    [0.8,0.719],[0.8333,0.557],[0.867,0.385],[0.9,0.193],
                    [0.9333,0.005]], columns=['t', 'y'])

This data is stored as a Pandas DataFrame, `df`.

In [None]:
df

We can plot the data

In [None]:
plt.plot (df.t, df.y, '.')
plt.xlabel('Time (s)')
plt.ylabel('Position (m)')
plt.title('Ball thrown upward')
plt.show()

In experimental Physics curve fitting is an important statistical tool for analyzing data and quantifying correlations between variables. 
The command `np.polyfit` finds the parameters of a polynomial by doing a best fit, in the least squares sense, of the function on a set of data. 

For example, we can fit a quadratic to the ball data like this:

In [None]:
np.polyfit(df.t, df.y, 2)

The third argument is the degree of the polynomial fit; for a quadratic the degree is 2.  Notice the funciton `np.polyfit` returns an array of three numbers

This are the coefficients of a polynomial

$$P(t) = a t^2 + b t + c$$ 

We could write:

In [None]:
a, b, c = np.polyfit(df.t, df.y, 2)

Then

In [None]:
y_fit = a*df.t**2 + b*df.t + c

and finally compare the best fit curve to the original data

In [None]:
plt.plot(df.t, df.y, '.')
plt.plot(df.t, y_fit, '-') 
plt.xlabel('Time (s)')
plt.ylabel('Position (m)')
plt.title('Ball thrown upward')
plt.show()

Since it is very common operation to evaluate a polynomial, there is function called `np.polyval` 

In [None]:
p = np.polyfit(df.t, df.y, 2)
y_fit = np.polyval(p, df.t)

which gives exactly the same thing

In [None]:
plt.plot(df.t, df.y, '.')
plt.plot(df.t, y_fit, '-') 
plt.xlabel('Time (s)')
plt.ylabel('Position (m)')
plt.title('Ball thrown upward')
plt.show()

### Example 2

Here's another example with a linear fit and a set of artificial data

In [None]:
x = np.arange(0, 5, 0.5)

# make y a straight line
m = 2
b = 3
y = m*x + b

# add some artificial noise
noise = np.random.normal(0, 1.5, size=len(x))
y = y + noise

df = pd.DataFrame( {'x': x, 'y': y})

In [None]:
df

Fit the data to one degree polynomial -- a straight line.

In [None]:
p = np.polyfit(df.x, df.y, 1)
y_fit = np.polyval(p, df.x)

Plot the data and best fit line together in the same plot.

In [None]:
plt.plot(df.x, df.y, 'o')
plt.plot(df.x, y_fit, '-') 
plt.xlabel('x')
plt.ylabel('y')
plt.show()

In [None]:
p

## Least-squares fit

Discussion...

## Nonlinear Curve Fitting

In the section above, fit parameters to a polynomial.  Under the hood, this is typically done by solving a linear system of equations to find our parameters.  Let's bring up our small data consider our best fit again.

In [None]:
np.polyfit(df.x, df.y, 1)

We can also fit our data to function that are not necessarily polynomials.  The `curve_fit` from the `scipy.optimize` subpackage is useful here.

In [None]:
from scipy import optimize

To use `curve_fit` we need to first define a function for the model we want to fit.

To understand what this function does, let's use it to fit a function that just happens to be a linear function and show that we get the same coefficients as `np.polyfit` gave us.

In [None]:
def line_func(x, m, b):
    return m*x + b

popt, pcov = optimize.curve_fit(line_func, df.x, df.y)

What is being returns is the parameters, followed by the covariance matrix. 

In [None]:
popt

This covariance matrix can be used to estimate confidence intervals for the parameters. From the documentation for `curve_fit()`:

*To compute one standard deviation errors
    on the parameters use* ``perr = np.sqrt(np.diag(pcov))``.


In [None]:
print(pcov)

In [None]:
perr = np.sqrt(np.diag(pcov))

In [None]:
perr

This is telling us what the estimate of the slope and intercept are to within one standard deviation.

In [None]:
print(f"m = {popt[0]:.2f} ± {perr[0]:.2f}")
print(f"b = {popt[1]:.2f} ± {perr[1]:.2f}")

But the linear fit is exactly the same

In [None]:
x_fit = np.linspace(min(df.x), max(df.x), 100)
y_fit = line_func(x_fit, *popt)

plt.plot(df.x, df.y, 'o')
plt.plot(x_fit, y_fit, '-') 

plt.xlabel('x')
plt.ylabel('y')
plt.show()

which is, not suprisingly, what we had seen before.  



### A non-linear function

Consider now the following artificial data set:

In [None]:
x = np.arange(0, 8, 0.1)

A = 3
μ = 1
σ = 1.5
y = A*np.exp(-(x-μ)**2/(2*σ**2))

# add some artificial noise
noise = np.random.normal(0, 0.1, size=len(x))
y = y + noise

df = pd.DataFrame( {'x': x, 'y': y})

This data is a little more interesting.

In [None]:
plt.plot(df.x, df.y, 'o')
plt.xlabel('x')
plt.ylabel('y')
plt.show()

We can try a linear, quadratic, or cubic fit

In [None]:
plt.plot(df.x, df.y, 'o')

for n in range(1,4):
    p = np.polyfit(df.x, df.y, n)
    y_fit = np.polyval(p, x)
    plt.plot(df.x, y_fit, 
             label=f"n={n}")

plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()

But none of them look especially good.  That is because this data is not a polynomial! It is a Gaussian.

In [None]:
def gaussian_func(x, a, μ, σ):
    return a * np.exp(-(x-μ)**2 / (2*σ**2))

In [None]:
xs = np.arange(-2.5, 2.5, 0.1)
ys = gaussian_func(xs, 1, 0, 1)
plt.plot(xs, ys, color='orange')
plt.show()

Our goal is find the amplitude, $A$, the centre, $\mu$, and the standard deviation, $\sigma$ for our original data.

In [None]:
plt.plot(df.x, df.y, 'o')
plt.plot(xs, ys)

plt.xlabel('x')
plt.ylabel('y')

plt.show()

We can use `curve_fit` to find the parameters $A, \mu,$ and $\sigma$ for us.

In [None]:
popt, pcov = optimize.curve_fit(gaussian_func, df.x, df.y)

In [None]:
popt

In [None]:
perr = np.sqrt(np.diag(pcov))
perr

So this is telling us that

In [None]:
print(f"A = {popt[0]:.2f} ± {perr[0]:.2f}")
print(f"μ = {popt[1]:.2f} ± {perr[1]:.2f}")
print(f"σ = {popt[2]:.2f} ± {perr[2]:.2f}")

In [None]:
x_fit = np.linspace(min(df.x), max(df.x), 100)
y_fit = gaussian_func(x_fit, *popt)

plt.plot(df.x, df.y, 'o')
plt.plot(x_fit, y_fit, '-') 

plt.xlabel('x')
plt.ylabel('y')
plt.show()


### Initial guesses

Note that in calling `curve_fit`, you are able to provide initial guesses for the parameters.  In general it is difficult to find the best fit in the nonlinear case.  Rather, routines find the best fit near the initial guess.  Different initial guesses may yield different fit parameters.  We won't go any deeper into the methods of finding parameters for nonlinear fits.


### Best fits vs interpolation

The idea of a best fit of a curve is different than an interpolation of data.  For interpolation, we could use the `interpolate.interp1d` function from scipy.  For comparison,

In [None]:
from scipy import interpolate

x = [1, 2, 4, 5, 6, 7, 8]
y = [-1, 5, 14, 12, 19, 21, 25]
fig, axs = plt.subplots(1,2, figsize=(12,6))
x_fit = np.linspace(min(x), max(x), 100)

# linear interpolation vs linear fit
axs[0].plot(x, y, 'o') # plot the data

p = np.polyfit(x, y, 1)
y_fit = np.polyval(p, x_fit)
axs[0].plot(x_fit, y_fit, label='best fit')

interp = interpolate.interp1d(x, y, kind='linear')
axs[0].plot(x_fit, interp(x_fit), label='interpolation') 

axs[0].set_xlabel('x')
axs[0].set_ylabel('y')
axs[0].set_title('Linear interpolation vs linear fit')
axs[0].legend(loc='lower right')

# cubic interpolation vs cubic fit  ########
axs[1].plot(x, y, 'o') # plot the data

p = np.polyfit(x, y, 3)
y_fit = np.polyval(p, x_fit)
axs[1].plot(x_fit, y_fit, label='best fit')

interp = interpolate.interp1d(x, y, kind='cubic')
axs[1].plot(x_fit, interp(x_fit), label='interpolation')
axs[1].set_xlabel('x')
axs[1].set_ylabel('y')
axs[1].set_title('Cubic interpolation vs cubic fit')
axs[1].legend(loc='lower right')

plt.show()

## Application of Fitting

There is a data file that you should download along with this exercise called `falling_object.dat` that has two columns.  The first column contains time (s) and the second contains height (m).

a) Load the data into a data frame and plot the height as a function of time.

In [None]:
df = pd.read_csv('data/falling_object.csv')

In [None]:
df.head()

In [None]:
plt.plot(df.t, df.y, '.')
plt.xlabel('time (s)')
plt.ylabel('height (m)')
plt.title('Falling Object')
plt.show()

b) Using the centred difference scheme, calculate and plot the velocity (but do not interpolate) as a function of time.  You should see the velocity approach a terminal value.

#### Centered scheme
$$\frac{df}{dt}(t_0) \approx \frac{f(t_0+\Delta t)-f(t_0-\Delta t)}{2\Delta t}= \frac{y_{i+1} - y_{i-1}}{t_{i+1} - t_{i-1}}$$

In [None]:
df['v'] = np.nan

for i in range(1, len(df) - 1):
    df.v[i] = (df.y[i+1] - df.y[i-1]) / (df.t[i+1] - df.t[i-1])

In [None]:
plt.plot(df.t, df.v, '.')
plt.xlabel('time (s)')
plt.ylabel('velocity (m/s)')
plt.title('Falling Object')
plt.show()

c) Calculate and plot the acceleration of the falling object directly from the height data by using the centred scheme for the second derivative.  You should see the acceleration approach zero.  The acceleration graph looks noisy. This is because the original measurements contain some uncertainty and random noise. This noise gets amplified by taking derivatives.

#### Centered scheme for 2nd order derivative
$$\frac{d^2 f}{dt^2}(t_0) \approx \frac{f(t_0+\Delta t)-2f(t_0)+f(t_0-\Delta t)}{(\Delta t)^2}=\frac{y_{i+1} - 2y_i + y_{i+1}}{(t_{i+1} - t_{i})^2} $$


In [None]:
df['a'] = np.nan

for i in range(1, len(df) - 1):
    df.a[i] = (df.y[i+1] - 2*df.y[i] + df.y[i-1]) / (df.t[i+1]-df.t[i])**2

In [None]:
plt.plot(df.t, df.a, '.')
plt.xlabel('time (s)')
plt.ylabel('acceleration (m/s$^2$)')
plt.title('Falling Object')
plt.show()

d)  Let's assume that the object experiences a drag force of $F= - b v$.  We will attempt to find $b$.  As usual, we'll begin with Newton's 2nd illustrious Law

\begin{align}
F&=ma \\
ma &= -bv -mg
\end{align}
 
 Aha!

If we plot the quantity $ma$ as a function of $-v$, the graph should be a straight line (with noise) with intercept $-mg$ and having slope equal to $b$.

Make a plot of $m a$ vs $-v$.

(The mass of the object is 0.2 kg, while acceleration due to gravity is approximately 9.81 m/s/s.  These values should jive with our intercept value. )

In [None]:
m = 0.2 # kg
g = 9.81 # m/s^2

In [None]:
plt.plot(-df.v, m*df.a, 'o')
plt.xlabel('-v (m/s)')
plt.ylabel('m a (kg m/s$^2$)')
plt.title('Falling Object')
plt.show()

e) Make a linear fit for the data in part (d) using `np.polyfit` command

When we try and use `polyfit` we encounter a problem...

In [None]:
p = np.polyfit(-df.v, m*df.a, 1)
print(p)

The issue is that those `NaN`'s are not valid numbers to complete a least-square regression.

In [None]:
df.head()

One solution would be to remove all of the non-numbers from the dataframe.

In [None]:
df = df.dropna()
df.head()

In [None]:
p = np.polyfit(-df.v, m*df.a, 1)

print(f"{p[0]:.3f}, {p[1]:.3f}")

Another option would be to use a forward and backward difference to estimate the velocity and accleration at the end points.

f) Show both the fit line and the discrete data in single graph.

In [None]:
plt.plot(-df.v, m*df.a, 'o')
plt.plot(-df.v, np.polyval(p, -df.v))

plt.xlabel('-v (m/s)')
plt.ylabel('m a (kg m/s$^2$)')
plt.title('Falling Object')
plt.show()

g) Consider the parameters of the best fit

In [None]:
print(f"{p[0]:.3f}, {p[1]:.3f}")

The slope of this line is $b$ (units of kg/s$^2$)

In [None]:
b = p[0]
print(f"{b:.3f}")

And the intercept is $-mg$ (units of kg m /s$^2$)

In [None]:
print (f"{p[1]:.3f}")

Let's compare the intercept value with what we expect it to be ($-mg$).


In [None]:
print(f"{-m*g:.3f}")

Golden!

 h) Newton's equation $m a = -b v - m g$ can be solved analytically for the height as a function of time.  

In [None]:
import sympy as sym
sym.init_printing()

b, g, m, t = sym.symbols('b g m t')
y = sym.Function('y')
expr = sym.Eq(m*y(t).diff(t,2), -b*y(t).diff(t) -m*g)
expr

In [None]:
sym.dsolve(expr)

j)  Use  `optimize.curve_fit` to find the values of $b, g, C_1,$ and $C_2$. We can use the given value for $m = 0.2$ kg.

In [None]:
m = 0.2

def model1(t, b, g, C1, C2):
    return C1 + C2*np.exp(-b/m*t) - g*m/b*t

popt1, pcov = optimize.curve_fit(model1, df.t, df.y)

print([f'{p:.3f}' for p in popt1])

The first parameter is the drag coefficient $b$ and the second is the acceleration due to gravity $g$. It is not immediately obvious how $C_1$ and $C_2$ relate with the initial height and initial velocity.


To solve this analytically, we can also provide initial conditions.

In [None]:
y0, v0 = sym.symbols('y0 v0')
ics = { y(t).subs(t, 0): y0, 
        y(t).diff(t).subs(t, 0): v0} 

In [None]:
sym.dsolve(expr, ics=ics )

In [None]:
def model2(t, b, g, y0, v0):
    return y0 - g*m*t/b + m*v0/b + g*m**2/b**2 - m*(b*v0 + g*m)*np.exp(-b*t/m)/b**2

popt2, pcov = optimize.curve_fit(model2, df.t, df.y)

print([f'{p:.3f}' for p in popt2])

This make the interpretation of the coefficients much clearer. In this problem, $ y_0 = 100$ m and $v_0 = 0.0$ m/s.

k)  Now plot the fitting function.  Plot this fitting function over the data.  Does it seem like our model ($F=-b v - m g$) describes the data?

In [None]:
fit_y1 = model1(df.t, *popt1)
fit_y2 = model2(df.t, *popt2)
    
plt.plot(df.t, df.y, '.', label='Data')
plt.plot(df.t, fit_y1, '-', label='Model1') 
plt.plot(df.t, fit_y2, '-', label='Model2') 
plt.xlabel('time (s)')
plt.ylabel('height (m)')
plt.title('Falling Object')
plt.legend()

plt.show()

In [None]:
fit_y1 = model1(df.t, *popt1)
fit_y2 = model2(df.t, *popt2)
    
plt.plot(df.t, df.y, '.', label='Data')
plt.plot(df.t, fit_y1, '-', label='Model1') 
plt.plot(df.t, fit_y2, '-', label='Model2') 
plt.xlabel('time (s)')
plt.ylabel('height (m)')
plt.title('Falling Object')
plt.xlim(7, 8)
plt.ylim(0, 20)
plt.legend()

plt.show()