### Least squares regression workbook


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pingouin as pg

import PyCO2SYS as pyco2

Load 2007 data

In [4]:
filename07 = 'wcoa_cruise_2007/32WC20070511.exc.csv'
df07 = pd.read_csv(filename07,header=29,na_values=-999,parse_dates=[[6,7]])

Load 2013 data

In [5]:
filename13 = 'wcoa_cruise_2013/WCOA2013_hy1.csv'
df13 =  pd.read_csv(filename13,header=31,na_values=-999,parse_dates=[[8,9]])

Use the PyCO2SYS package to calculate seawater carbon chemistry parameters.

https://pyco2sys.readthedocs.io/en/latest/

In [7]:
c07 = pyco2.sys(df07['ALKALI'], df07['TCARBN'], 1, 2,
               salinity=df07['CTDSAL'], temperature=df07['CTDTMP'], 
                pressure=df07['CTDPRS'])

In [8]:
c13 = pyco2.sys(df13['ALKALI'], df13['TCARBN'], 1, 2,
               salinity=df13['CTDSAL'], temperature=df13['CTDTMP'], 
                pressure=df13['CTDPRS'])

### 2. Linear regression and correlation

a. Plot dissolved oxygen vs. aragonite saturation state ($\Omega_A$) for all samples collected during the 2007 West Coast Ocean Acidification Cruise.

In [None]:
# insert code here

b. Plot dissolved oxygen vs. aragonite saturation state ($\Omega_A$) for the subset of samples collected at pressures between 30-300 dbar during the 2007 West Coast Ocean Acidification Cruise.

In [None]:
# insert code here

c. Calculate the slope and intercept of the Type 1 linear regression between dissolved oxygen vs. aragonite saturation state ($\Omega_A$) for the subset of samples collected at pressures between 30-300 dbar during the 2007 West Coast Ocean Acidification Cruise. Plot the regression line with the data.

In [None]:
# insert code here

d. Create a function that calculates confidence intervals for the slope of the regression line, given the data values and significance level (where the default $\alpha$ = 0.05 can be modified by the user).

In [None]:
def slope_ci(x,y,alpha=0.05):
    ''' Compute the confidence intervals for the slope of the 
    Type I regression between x and y.
    
    INPUTS:
    x - independent variable values
    y - dependent variable values (must be same length as x)
    alpha - significance level (default 0.05 for 95% confidence)
    
    OUTPUT:
    slope_lower,slope_upper - bounds of confidence interval
    '''
    
    # Insert code here

    return slope_lower,slope_upper

e. Use the above function to calculate 95% confidence intervals for the slope of the linear regression calculated in part c.

In [None]:
# insert code here

f. For the linear model created in part c, plot the model residuals (or errors) as a function of observed aragonite saturation state. See Figure 1d of Juranek et al. (2009) for an example of this type of plot.

In [None]:
# insert code here

g. Calculate the root mean squared error (RMSE) for the linear model created in part c.

In [None]:
# insert code here

h. Create a function called `rcrit` that calculates the critical correlation coefficient $r_{crit}$ for statistical significance (the minimum $r$ that will give you a significant coreelation), given the degrees of freedom and significance level alpha (default 0.05). Use this function to calculate the critical correlation coefficient for the linear aragonite saturation state model obtained in part c.

Hints:
* There is a helpful equation for testing the significance of a correlation in the [course notes](https://tompc35.github.io/data_marine_science/2-04-corr-regress-least-squares.html).
* You may need to use `stats.t.ppf`
* Use the table from Emery and Thomson to check your results (in the `table` directory in this repository).

In [None]:
def rcrit(nu,alpha=0.05):
    """
    Critical r (correlation coefficient), given significance level
    and degrees of freedom.
    
    INPUTS:
    nu - degrees of freedom (N-2)        
    alpha - significance level (default 0.05 for 95% confidence)
    
    OUTPUT:
    rcrit - critical r value
    
    Values for 0.05 and 0.01 correspond with Appendix E in
    Emery and Thomson (2004) Data Analysis Methods in Physical 
    Oceanography
    """

    # Insert code here.

    return rcrit

In [None]:
# use the rcrit function here

i. Based on the plots and calculations made above, summarize the applicability of the linear regression model for aragonite saturation state based on dissolved oxygen concentration only. Comment on the quality of the fit, the magnitude of the expected errors and the potential for systematic bias. In your explanation, refer to the correlation coefficient $r$, the critical correlation coefficient $r_{crit}$, the RMSE, and the 95% confidence intervals of the slope.

*(insert answer here)*