In [None]:
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline

<img src="https://raw.githubusercontent.com/harmsm/pythonic-science/master/chapters/02_regression/data/wands-away.jpg" style="margin:auto" height="50%" width="50%"/>

## Fitting models to data

<img src="https://raw.githubusercontent.com/harmsm/pythonic-science/master/chapters/02_regression/data/spreadsheet-fit.png" style="margin:auto" />

### Steps

+ Getting data into python
+ Performing fit
+ Plotting the fit
+ Analyzing the fit (quality, parameter estimates, etc.)

<img src="https://raw.githubusercontent.com/harmsm/pythonic-science/master/chapters/02_regression/data/dataset-csv.png" style="margin:auto" />

#### We can read files using `pandas`

In [None]:
import pandas as pd
pd.read_csv("data/test-dataset.csv")

In [None]:
d = pd.read_csv("data/test-dataset.csv") # <- capture in variable d
type(d)

In [None]:
plt.plot(d.time,d.obs,"o")
plt.xlabel("time")
plt.ylabel("observable")

#### How can we fit a model to these data?

In [None]:
import scipy.stats

In [None]:
m, b, r_value, p_value, std_err = scipy.stats.linregress(d.time,
                                                         d.obs)
print("slope:",m)
print("intercept:",b)
print("R squared:",r_value**2)

#### Plot it

In [None]:
plt.plot(d.time,d.obs,"o")
plt.xlabel("time")
plt.ylabel("observable")

t_range = np.array([0,10]) 
plt.plot(t_range,t_range*m + b,"--")

#### Put together

In [None]:
d = pd.read_csv("data/test-dataset.csv")
plt.plot(d.time,d.obs,"o")

m, b, r_value, p_value, std_err = scipy.stats.linregress(d.time,d.obs)
t_range = np.array([0,10]) 
plt.plot(t_range,t_range*m + b,"--")

print("slope:",m)
print("intercept:",b)
print("R squared:",r_value**2)

## What is actually going on in a fit???

#### Let's try fitting manually

In [None]:
def y_calc(x,m,b):
    return m*x + b

In [None]:
plt.plot(d.time,d.obs,"ko")
plt.plot(d.time,y_calc(d.time,m=0.1,b=0))
plt.plot(d.time,y_calc(d.time,m=0.1,b=0.2))
plt.plot(d.time,y_calc(d.time,m=0.08,b=0.2))


#### Implicitly minimizing a *residuals* function

In [None]:
def y_residuals(x_obs,y_obs,m,b):
    
    return (m*x_obs + b) - y_obs

In [None]:
def plot_figs(x_obs,y_obs,m,b,ax):
    """Convenience plotting function"""
    ax[0].plot(x_obs,y_obs,"ko")
    ax[0].plot(x_obs,y_calc(x_obs,m,b))
    ax[1].plot(x_obs,y_residuals(x_obs,y_obs,m,b))

In [None]:
fig, ax = plt.subplots(1,2,figsize=(10,5))

plot_figs(d.time,d.obs,m=0.1,b=0,ax=ax)
plot_figs(d.time,d.obs,m=0.1,b=0.2,ax=ax)
plot_figs(d.time,d.obs,m=0.08,b=0.2,ax=ax)


#### Software minimizes the sum of the squared residuals

$$ SSR = \sum_{i=0}^{i < N} (y_{calc,i} - y_{obs,i})^{2}$$

In [None]:

sum_squared_residuals = [np.sum(y_residuals(d.time,d.obs,m=0.10,b=0.0)**2),
                         np.sum(y_residuals(d.time,d.obs,m=0.10,b=0.2)**2),
                         np.sum(y_residuals(d.time,d.obs,m=0.08,b=0.2)**2),
                         np.sum(y_residuals(d.time,d.obs,m=0.0696240601504,b=0.186085714286)**2)]

plt.plot([1,2,3,4],sum_squared_residuals)
plt.xlabel("round")
plt.ylabel("SSR")

#### We use sum of squared residuals:
+ Penalizes big deviations a lot.
+ Penalizes both negative and positive deviations equally. 

+ Deep statistical reasons.  Minimizing SSR is a *maximum likelihood* calculation.  (If uncertainty for each point is identical, $SSR \propto -ln(L)$.)

#### Regression ingredients
+ Experimental data
+ Residuals function
+ Way to search through "parameter space"

#### Parameter space
<img src="https://raw.githubusercontent.com/harmsm/pythonic-science/master/chapters/02_regression/data/param-space.png" style="margin:auto" />

### How do we search parameter space?

One example: Nelder-Mead

<img src="https://raw.githubusercontent.com/harmsm/pythonic-science/master/chapters/02_regression/data/00_flip.png" style="margin:auto" height="65%" width="65%"/>

<img src="https://raw.githubusercontent.com/harmsm/pythonic-science/master/chapters/02_regression/data/01_contract.png" style="margin:auto" height="65%" width="65%"/>

<img src="https://raw.githubusercontent.com/harmsm/pythonic-science/master/chapters/02_regression/data/02_shrink.png" style="margin:auto" height="65%" width="65%"/>

In [None]:
%%HTML
<video width="650" height="600" controls="controls"> 
  <source src="https://raw.githubusercontent.com/harmsm/pythonic-science/master/chapters/02_regression/data/simplex.mp4" type="video/mp4">
</video>

### When does this fail?

+ If there is no minimum (or an infinitely long valley)
+ If there are multiple minima
+ If the function has infinities
+ When guesses are so bad you can't reach minimum

### Next class:

+ Implementing residual functions
+ `scipy.optimize.least_squares`

# Do
+ Choose the model that fits the data best with the fewest parameters
+ Check your residuals for randomness
+ Use a biologically/physically informed model with independently testable/interpretable parameters  

# Don't
+ Transform your data (e.g. take a log before fitting).  Most regression approaches assume measurement uncertainty is normally distributed. 
+ Try only one set of parameter guesses
+ Overfit your data (fit a model with more parameters than observations)