### Multiple regression workbook

3. [Multivariate regression tutorial: aragonite saturation state](#3.-Multivariate-regression-tutorial:-aragonite-saturation-state)
4. [Application of final model for aragonite saturation state](#4.-Application-of-final-model-for-aragonite-saturation-state)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pingouin as pg

import PyCO2SYS as pyco2

In [3]:
filename07 = 'wcoa_cruise_2007/32WC20070511.exc.csv'
df07 = pd.read_csv(filename07,header=29,na_values=-999,
                  parse_dates=[[6,7]])

In [4]:
filename13 = 'wcoa_cruise_2013/WCOA2013_hy1.csv'
df13 =  pd.read_csv(filename13,header=31,na_values=-999,parse_dates=[[8,9]])

In [6]:
c07 = pyco2.sys(df07['ALKALI'], df07['TCARBN'], 1, 2,
               salinity=df07['CTDSAL'], temperature=df07['CTDTMP'], 
                pressure=df07['CTDPRS'])

In [7]:
c13 = pyco2.sys(df13['ALKALI'], df13['TCARBN'], 1, 2,
               salinity=df13['CTDSAL'], temperature=df13['CTDTMP'], 
                pressure=df13['CTDPRS'])

### 3. Multivariate regression tutorial: aragonite saturation state

Using data from the West Coast Ocean Acidification (WCOA) cruise, create two different multiple linear regression models to calculate aragonite saturation state (OmegaA) between 30 and 300 dbar as a function of more commonly observed variables.

#### Model 1: Temperature, salinity, pressure, dissolved oxygen and nitrate:

* Temperature
* Salinity
* Pressure
* Oxygen
* Nitrate

$\hat{\Omega}_A = c_0 + c_1\times T + c_2\times S + c_3\times p + c_4\times O_2 + c_5\times N $ 

a. (*in class*) Create a 1-D array for the response variable, `y`. Which variable is the response variable in this case?

b. (*in class*) Create a 2-D array `X` that contains a column of all ones, and additional columns containing the explanatory variables. This 2-D array is called the "design matrix" and should have six columns. What are the explanatory variables in this case? 

* Approach: use `np.ones()` to create a 2-D array of correct size, then fill in the columns.

c. (*in class*) Use `np.linalg.lstsq` to compute the set of coefficients, `c`.

d. (*in class*) Use `np.matmul` to compute the modeled values `yhat`

e. (*in class*) Plot model vs. observations.

f. (*in class*) Plot residuals vs. observations.

g. (*in class*) Use `statsmodels` to get a complete summary of regression statistics.

h. (*in class*) Alternate approach using `statsmodels` formulas.

i. Another alternative approach using `pingouin`.

j. (*in class*) Correlation matrix to look for multi-collinearity.

### 4. Application of final model for aragonite saturation state

In this section, data from the 2007 WCOA cruise are used to model aragonite saturation state following the model proposed by Juranek et al. (2009). An important question is whether this model can predict aragonite saturation state in different years. This model is can be tested using an independent data set from the 2013 WCOA cruise.

#### Final model: Dissolved oxygen, and the interaction between oxygen and temperature (subtracting constant reference values):

$ \hat{\Omega}_A = a_0 + a_1 \times (O - Oref) + a_2 \times (O - Oref) \times (T - Tref) $

a. (Homework) Calculate the coefficients $a_0$, $a_1$ and $a_2$ using either the design matrix approach or `statsmodels` formulas. Use the same constant values for $Tref$ and $Oref$ as Juranek et al. (2009).

In [58]:
# insert code here

b. (Homework) Calculate the root mean squared error for this  model.

In [59]:
# insert code here

c. (Homework) Plot the residuals vs. the observations.

In [60]:
# insert code here

d. (Homework) Use the coefficients calculated in part a to compute predicted aragonite saturation state for the 2013 cruise. Use the subset of 2013 data between 30 dbar and 300 dbar.

*Important:* Use the same model coefficients calculated from the 2007 dataset. The goal is to evaluate whether a model developed based on a single cruise in 2007 would still be useful in 2013.

In [62]:
# insert code here

e. (Homework) Calculate the root mean squared error (RMSE) between the model prediction and aragonite saturation state observations during 2013. Describe how this RMSE value compares with the RMSE calculated for the 2007 observations.

In [None]:
# insert code here

*insert text here*

f. (Homework)  Make a plot of the residuals vs. observations during 2013. Comment on whether you observe any biases in the model.

In [None]:
# insert code here

*insert text here*

g. (Homework) In a paragraph, compare the two different regressions (model 1 and the final model), commenting on:
  * General applicability of the model equations
  * Statistical significance
  * Multiple co-linearity
  * The potential for numerical errors
  * How well the final model represents aragonite saturation state in different years
  * Your scientific interpretation

*insert text here*