# Stats as Linear Models

In this notebook, we'll explore statistical tests as linear models.

**Import the modules for this notebook**

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import statsmodels.formula.api as smf
import pingouin as pg
import PyCO2SYS as pyco2
import warnings
# ignore scary PyCO2SYS warnings
warnings.simplefilter("ignore", RuntimeWarning)
np.set_printoptions(legacy='1.25') #dont print the float64 for every number

## Loading the data
In this example, we'll use data from the 2007 WCOA cruise. As in a previous lab, we'll also use estimates of aragonite saturation state. The cells below will load the data

In [None]:
filename07 = os.path.join('data','wcoa_cruise_2007','32WC20070511.exc.csv')
df07 = pd.read_csv(filename07,header=29,na_values=-999,
                 dtype={'DATE': str, 'TIME': str})
df07.insert(0,'DATE_TIME',pd.to_datetime(df07.pop('DATE')+' '+df07.pop('TIME'),
                                format="%m/%d/%Y %H:%M:%S"))

Use the PyCO2SYS package to calculate seawater carbon chemistry parameters.

https://pyco2sys.readthedocs.io/en/latest/

In [None]:
c07 = pyco2.sys(df07['ALKALI'], df07['TCARBN'], 1, 2,
               salinity=df07['CTDSAL'], temperature=df07['CTDTMP'], 
                pressure=df07['CTDPRS'])
df07['OmegaA'] = c07['saturation_aragonite']

Create a subset of good data in the upper 10m (near surface).

In [None]:
iisurf07 = ((df07['CTDPRS'] <= 10) &
      (df07['NITRAT_FLAG_W'] == 2) & (df07['PHSPHT_FLAG_W'] == 2)
      & (df07['CTDOXY_FLAG_W'] == 2) & (df07['CTDSAL_FLAG_W'] == 2) 
        & (df07['ALKALI_FLAG_W'] == 2) & (df07['TCARBN_FLAG_W'] == 2))
df07surf = df07[iisurf07]

## One sample t-test - Aragonite Saturation 
For the first example, we'll take a look at a one sample t-test

In [None]:
plt.figure()
plt.boxplot([df07surf['OmegaA']],
            tick_labels=['2007'],showmeans=True,notch=True);
plt.title('$\\Omega_A$ - upper 10m 2007')
plt.plot([0.5,1.5],[1,1],'r--')
plt.ylabel('$\\Omega_A$')
plt.xlabel('region');

Compute the mean:

In [None]:
print(df07surf['OmegaA'].mean())

Conduct a t-test using the stats library:

In [None]:
result = stats.ttest_1samp(df07surf['OmegaA'],1)
print(result)
ci = result.confidence_interval(confidence_level=0.95)
print(ci)

Similarly, use the statsmodels OLS method to fit a linear model to the data:

In [None]:
res = smf.ols(formula="OmegaA ~ 1", data=df07surf).fit()
res.summary()

In [None]:
plt.figure()
plt.boxplot([df07surf['OmegaA']],
            tick_labels=['2007'],showmeans=True,notch=True);
plt.title('$\\Omega_A$ - upper 10m 2007')
plt.plot([0.5,1.5],[1,1],'r--')
plt.plot([0.5,1.5],[ci.high,ci.high], 'b--')
plt.plot([0.5,1.5],[res.params.iloc[0],res.params.iloc[0]], 'b-')
plt.plot([0.5,1.5],[ci.low,ci.low], 'b--')
plt.ylabel('$\\Omega_A$')
plt.xlabel('region');

How the outputs of the above t-test and linear model compare?

## Two sample t-test - Aragonite Saturation

Next, we'll look at a two-sample t-test. Here, we'll create two subsets of our data - one for the south and one for the north.

In [None]:
df07surf = df07surf.assign(is_northern = df07surf['LATITUDE'] > 40.4)

In [None]:
plt.figure()
plt.boxplot([df07surf['OmegaA'][~df07surf['is_northern']],
             df07surf['OmegaA'][df07surf['is_northern']]],
            tick_labels=['south','north'],showmeans=True,notch=True);
plt.title('$\\Omega_A$ - upper 10m 2007')
plt.plot([0.5,2.5],
         [np.mean(df07surf['OmegaA'][~df07surf['is_northern']]),
          np.mean(df07surf['OmegaA'][~df07surf['is_northern']])], 'r--')
plt.ylabel('$\\Omega_A$')
plt.xlabel('region');

In [None]:
# compute the difference in the means
print(np.mean(df07surf['OmegaA'][df07surf['is_northern']])-
      np.mean(df07surf['OmegaA'][~df07surf['is_northern']]))

In [None]:
result = stats.ttest_ind(df07surf['OmegaA'][df07surf['is_northern']],
                         df07surf['OmegaA'][~df07surf['is_northern']])
print(result)

# confidence interval around difference in population means
ci = result.confidence_interval(confidence_level=0.95)
print(ci)

In [None]:
res = smf.ols(formula="OmegaA ~ is_northern", data=df07surf).fit()
res.summary()

## Two sample t-test - Temperature

In the above example, we determined there was no significant difference between the means. How does this same test compare for temperature?

In [None]:
plt.figure()
plt.boxplot([df07surf['CTDTMP'][~df07surf['is_northern']],
             df07surf['CTDTMP'][df07surf['is_northern']]],
            tick_labels=['south','north'],showmeans=True,notch=True);
plt.title('Temperature - upper 10m 2007')
plt.plot([0.5,2.5],
         [np.mean(df07surf['CTDTMP'][~df07surf['is_northern']]),
          np.mean(df07surf['CTDTMP'][~df07surf['is_northern']])], 'r--')
plt.ylabel('[deg C]')
plt.xlabel('region');

In [None]:
# compute the difference in the means
print(np.mean(df07surf['CTDTMP'][df07surf['is_northern']])-
      np.mean(df07surf['CTDTMP'][~df07surf['is_northern']]))

In [None]:
result = stats.ttest_ind(df07surf['CTDTMP'][df07surf['is_northern']],
                df07surf['CTDTMP'][~df07surf['is_northern']])
print(result)

# confidence interval around difference in population means
ci = result.confidence_interval(confidence_level=0.95)
print(ci)

In [None]:
res = smf.ols(formula="CTDTMP ~ is_northern", data=df07surf).fit()
res.summary()