### Week 05 ###
__ March 7, 2017 __

* [Multivariate regression](#Multivariate-regression)
* [Example: modeling aragonite saturation state](#Example:-modeling-aragonite-saturation-state)
* [Matrix algebra](#Matrix-Multiplication-Review)
* [Dates and times](#Dates-and-times-in-computing)

### Error propagation review 


Example: 
Given, 5% error in the diameter f a sphere<br>
__What is the % error in calculating volume $ V = \frac{4}{3}\pi r^3 $?__

Relative error doesn't change by multiplying<br>
$\sigma_v =\frac{4}{3}\pi \sigma_{r^3}$ 

Need to find the relative error in cubing the radius, use the power rule<br>
$\frac{\sigma_{r^3}}{r^3} = 3(\frac{\sigma_{r}}{r})$

$=3(0.05)$
__$=0.15$__ 


## Multivariate regression ##

__Goal:__ Find a relationship that explains variable $y$ in terms of variables, $x_1, x_2, x_3, ...x_n$

<img src='images/mult_reg_viz.png' width="600">
source: [http://www.sjsu.edu/faculty/gerstman/EpiInfo/cont-mult.htm](http://www.sjsu.edu/faculty/gerstman/EpiInfo/cont-mult.htm)

This three dimensional visualization shows how linear model based on two predictor variables, $x_1$ and $x_2$ can be used to model a response variable $y$. A constant and two slopes to define a 2D plane in 3D space. The sum of squared vertical distances between the plane (model) and observations of $y$ are minimized. Like fitting a line in 2D space, this procedure assumes the validity of a linear model.

#### Example: Predicting beach volume from environmental variables

Beach volume is measured periodically with a terrestrial laser scanner. Can beach volume be predicted with common, continuously observed variables such as wind, wave height and river outflow? 

$y = b_0 + b_1x_1 + b_2x_2 + b_3x_3$

where $y$ = beach volume

$x_1$ = wind

$x_2$ = wave height

$x_3$ = river outflow

$k$ = 3 variables

$N$ = 100 samples

Because we have three predictor variables, it is hard to visualize in four-dimensional space. However, the same principles are involved as fitting a 1D line in 2D space, or a 2D plane in 3D space.

#### Equations for linear model

A linear model can be represented as a system of $N$ equations.

$ b_0 + b_1x_{11} + b_2x_{12} + b_3x_{13} = \hat{y}_1 + \epsilon_1  $<br>
$ b_0 + b_1x_{21} + b_2x_{22} + b_3x_{23} = \hat{y}_2 + \epsilon_2 $<br>
$ b_0 + b_1x_{31} + b_2x_{32} + b_3x_{33} = \hat{y}_3  + \epsilon_3 $<br>
... <br>
$ b_0 + b_1x_{N1} + b_2x_{N2} + b_3x_{N3} = \hat{y}_N  + \epsilon_4 $<br>

where $\hat{y}_i$ is a modeled value and $\epsilon_i$ is the difference between the modeled value, $\hat{y}_i$ and an observation $y_i$. The least squares regression minimizes the sum of $\epsilon_i^2$, the overall deviation between the linear model and data.

#### Matrix form

To solve a least squares problem numerically, it helps to write the system of equations for the model in matrix form.

Form a vector $Y$ of $N$ observations of beach volume.

$ Y = \begin{bmatrix}
        y_1 \\
        y_2 \\
        y_3 \\
        \vdots \\
        y_n \\
        \end{bmatrix} $

A vector $B$ contains $k+1$ unknown coefficients.

$ B = \begin{bmatrix}
        b_0 \\
        b_1 \\
        b_2 \\
        b_3 \\
        \end{bmatrix} $

The predictor variables are stored as columns in a matrix with $N$ rows and $k+1$ columns

$
X =     \begin{bmatrix}
        1 & x_{11} & x_{12} & x_{13} \\
        1 & x_{21} & x_{22} & x_{23} \\
        1 & x_{31} & x_{32} & x_{33} \\
        \vdots & \vdots & \vdots     & \vdots \\
        1 & x_{N1} & x_{N2} & x_{N3} \\
        \end{bmatrix}
      $

Now the system of equations for the linear model can be written as

$ X B = Y $


#### Numerical solution

The least squares problem is solved using a singular value decomposition method. Efficient alorithms for this procedure are typically included in scientfic computing software. In Python, create an array for the vector $Y$ and a 2D array for the matrix $X$. Then use `np.linalg,lstsq` to solve for $B$.

```python
import numpy as np
B = np.linalg.lstsq(X,Y)
```

#### Testing for significance

##### F test 

Similar to ANOVA significance calculation, which also involves ratios of squared values.

$ \hat{y} = b_0 + b_1x_1 + b_2x_2 + b_3x_3 $

$H_0 : \hat{y} = C_0$ (All non-constant coefficients are zero)

$H_1 :$ At least one coefficient is non-zero

#### Total sum of squares <br>
$ SST =\sum_{j=1}^N{(y_j - \bar{y})^2} $

#### Regression sum of squares

$ SSR =\sum_{j=1}^N{(\hat{y_j} - \bar{y})^2} $

where $\hat{y_j}$ are model values


##### Error sum of squares

$ SSE =\sum_{j=1}^N{(y_i - \hat{y})^2} $

$MST =\frac{SST}{N-1}$

$MST =\frac{SSR}{k}$ , where k is the number of variables

$MSE = \frac{SSE}{N-k-1}$


__F-statistic:__ 

$F = \frac{MSR}{MSE}$<br>

This test statistic can be compared with a critical F value, which depends on significance level $\alpha$ and the degrees of freedom in the numerator and denominator. If F is larger, then error is small. Find F using statistical tables, or `stats.f.ppf` in Python.



## Example: modeling aragonite saturation state


<img src='images/arag_sat.png' width="600">

From Feeley et al. (2208) Evidence for upwelling of corrosive acidification water onto the continental shelf, Science

At Aragonite staturation state > 1 aragonite (calcium carbonate will dissolve in seawater)

Is there a way to estimate aragonite saturation state $\Omega_{Ar}$ based on more commonly measured parameters?

$ \Omega = \frac{[Ca^{2+}][CO_3^{2-}]}{K'_{sp}} $<br>
Where $K'_{sp}$ is the stoichiometric solubility product function of T,S,pr and mineral phase (aragonite, calcite)

$[Ca^{2+}]$  doesn't change much<br>
$[CO_3^{2+}]$ can be calculated from chemical measurements of DIC, $pCO_2$, total alkalinity and pH (at least two of these 4 parameters).

### Models

Juranek et al. (2009) describe a set of least squares regression models for aragonite saturation state, based on more commonly measured oceangraphic variables (temperature, salinity, pressure, oxygen and nitrate).

Juranek, L. W., R. A. Feely, W. T. Peterson, S. R. Alin, B. Hales, K. Lee, C. L. Sabine, and J. Peterson, 2009: A novel method for determination of aragonite saturation state on the continental shelf of central Oregon using multi-parameter relationships with hydrographic data. Geophys. Res. Lett., 36, doi:10.1029/2009GL040778.

#### Model 1 

$\Omega_{arag}^e = \beta_0 + \beta_1T + \beta_2S + \beta_3P + \beta_4O_2 + \beta_5NO_3^-$

* Has high $R_a^2$ ("adjusted" $R^2$)
* High "variance inflation factor"
* Indicates multiple collinearity
* Coefficients are ambiguous and not meaningful - When you add more data, you get will get a different answer (this is bad!)

##### Adjusted $R^2$

Accounts for reduction of degrees of freedom when using multiple predictor variables.

$R_a^2 = R^2 - (1-R^2)\frac{k}{n-k-1}$

$= 1 - \frac{MSE}{MST}$

If the MSE is low, the adjusted R-squared is going to be high. The more observations you have, the less this adjustment matters.


##### Variance Inflation Factor

$ VIF  = \frac{1}{1 - R^2}$ Variance Inflation Factor

where $R^2 $ from regression of predictor variables against other predictor variables. There is no clear "cut-off" that defines high VIF, but greater than 5 (and definitely greater than 10) is generally considered high.

<img src='images/arag_sat_table.png' width="600">

#### Final Model

$\Omega_{arag}^e = \alpha_0 + \alpha_1(O_2 - O_{2,r}) + \alpha_2(T - T_r) \circ (O_2 - O_{2,r}) $

* Less variables, avoids multiple collinearity
* Includes interaction term
* Reference values (Tr and O2r) keep product from getting too big
* Using variables with differing magnitudes can lead to problems like round-off errors
* Standardizing variables (using z-scores) another common strategy

<img src='images/arag_sat_final.png' width="600">


__ Aragonite saturation state __

_Red:_ from measured DIC and total alkalinaity

_Blue:_ Multiple regression model

__ Application of the model to time series that do not have direct observations of the carbonate system parameters  __

<img src='images/arag_sat_ts.png' width="600">


## Matrix Multiplication Review

$ \begin{pmatrix}
a & b \\
c & d \\
\end{pmatrix} 
\begin{pmatrix}
x & y \\
z & w \\
\end{pmatrix} 
 = 
\begin{pmatrix}
ax + bz & ay + bw \\
cx + dz & cy + dw \\
\end{pmatrix} $


Transformations in 2D Space<br>
<img src='images/2d_trans.png' width="600">

__Identity Matrix __<br>

No transformation

$\begin{pmatrix}
1 & 0 & 0 \\
0 & 1 & 0 \\
0 & 0 & 1 \\
\end{pmatrix}$

### Also in 3D ###

<img src='images/3d_trans.png' width="600">

<img src='images/3d_trans2.png' width="600">

<img src='images/3d_trans3.png' width="600">

[source for above two images](http://www.c-jump.com/bcc/common/Talk3/Math/Matrices/Matrices.html)

## Dates and times in computing

__ Gregorian calendar __
* Current civil calendar
* Introduced by Pope Gregory XII
* Designed to tie Easter celebration to Spring Equinox
* Mean length of calendar year = 365.2425 days

ISO8601 standard: YYYY-MM-DD mm:hh:ss

__UNIX timestamp__ = number of seconds since 1970-01-01 00:00:00

__Matlab__ = # of days since (0000-01-01 00:00:00) + 1

__Excel__ = # of days since (1900-01-01 00:00:00) +1


__Julian day number__ = # number of days since noon of November 24, 4714 BC
* Historically, 15 year indiction cycle (Roman census)
* 19 year metonic cycle (235 lunar months)
* 28 year solar cycle

[Modified Julian dates](https://en.wikipedia.org/wiki/Julian_day)

__ Caution __
Don't confuse the __"Julian Day"__ with __"Year Day"__ 

https://landweb.modaps.eosdis.nasa.gov/browse/calendar.html

In [23]:
import numpy as np
import pandas as pd

NumPy stores dates in `datetime64` type objects. A date expressed in the ISO8601 standard format can be converted to a datetime64 object.

In [24]:
date = np.datetime64('2017-03-07 11:41:00')
print(date)

2017-03-07T11:41:00


In [25]:
print(type(date))

<class 'numpy.datetime64'>


Dates can be represented as numbers that are referenced to 1970-01-01 00:00:00, just like UNIX time.

In [26]:
np.datetime64(0,'D')

numpy.datetime64('1970-01-01')

Differences between dates are represented as a "timedelta." A timedelta divided by another time delta gives a float in units of the denominator. For example a timedelta of 1 hour divided by a timedelta of 1 second is equal to 3600.0

In [27]:
np.timedelta64(1,'h')/np.timedelta64(1,'s')

3600.0

In [28]:
# Unix time as timedelta
(date - np.datetime64(0,'s'))

numpy.timedelta64(1488886860,'s')

In [29]:
# Unix time as float
(date - np.datetime64(0,'s'))/ np.timedelta64(1,'s')

1488886860.0

In [30]:
# Matlab Time
(date - np.datetime64('0000-01-01 00:00:00')) / np.timedelta64(1,'D')+1

736761.48680555553

Pandas is much less strict about the input datetime format

In [31]:
# Using Pandas
dates =pd.to_datetime(['2005/11/23','2010-12-31','1-3-1931'])
print(dates)

DatetimeIndex(['2005-11-23', '2010-12-31', '1931-01-03'], dtype='datetime64[ns]', freq=None)


In [32]:
np.array(dates)

array(['2005-11-23T00:00:00.000000000', '2010-12-31T00:00:00.000000000',
       '1931-01-03T00:00:00.000000000'], dtype='datetime64[ns]')