## Data extraction: daily, nutrients (sio4, no3, po4)

In bash, I scped the files from the server to my local machine:
```
cd /home/lindsay/hioekg-2013
scp lindsayv@frinkiac.soest.hawaii.edu:/share/frinkraid3/lindsayv/hioekg-2013/output_semi_daily/hioekg_his_* .
```

I ran the following loop to extract the surface layer for each half-day average:

```
printf -v PATH2 "/home/lindsay/hioekg-2013/"
cd $PATH2
# define local variable tn0; here, the integer that corresponds to the ROMS time stamp on the file (without zero padding)  -- this is Jan 2, 2013
tn0=4750
# for loop: loop over 13 files
for ((i=0; i<13; i++ ));
do
## NOW INSIDE FOR LOOP

# in this printf command, %s means to copy a string
# and %05d means a 5 digit integer padded with 0s on the left
printf -v FNin "%shioekg_his_%05d.nc" $PATH2 $tn0
printf -v FNout "%shioekg_his_surface_%05d.nc" $PATH2 $tn0
# this command extracts the surface layer (s_rho) and put the output in new file $FNout
ncks -O -d s_rho,-0.975 $FNin $FNout
# output to screen
echo $tn0
echo $FNin
echo $FNout
# increase tn0 by 30 days
tn0=$((tn0+30))

## done CLOSES FOR LOOP
done
cd $PATH2

```

#### Import modules

In [9]:
import netCDF4
from netCDF4 import Dataset
import numpy as np
import pandas as pd
from pathlib import Path
import os
import matplotlib.pyplot as plt
import xarray as xr
import scipy.stats as stats
import itertools

#### Define lists called in the forthcoming for loops

In [2]:
day_2013 = ['4750','4780','4810','4840','4870','4900','4930','4960','4990','5020','5050', '5080'] 

In [3]:
%cd /home/lindsay/hioekg-2013/
%ls

/home/lindsay/hioekg-2013
hioekg_his_04750.nc  hioekg_his_05020.nc          hioekg_his_surface_04930.nc
hioekg_his_04780.nc  hioekg_his_05050.nc          hioekg_his_surface_04960.nc
hioekg_his_04810.nc  hioekg_his_05110.nc          hioekg_his_surface_04990.nc
hioekg_his_04840.nc  hioekg_his_surface_04750.nc  hioekg_his_surface_05020.nc
hioekg_his_04870.nc  hioekg_his_surface_04780.nc  hioekg_his_surface_05050.nc
hioekg_his_04900.nc  hioekg_his_surface_04810.nc  hioekg_his_surface_05080.nc
hioekg_his_04930.nc  hioekg_his_surface_04840.nc  hioekg_his_surface_05110.nc
hioekg_his_04960.nc  hioekg_his_surface_04870.nc  [0m[01;34mSeasonal[0m/
hioekg_his_04990.nc  hioekg_his_surface_04900.nc


## Loop through all history files for 2013

In [4]:
# Reset the working directory before running the loops
%pwd
%cd /home/lindsay/hioekg-2013/

# Add master list for the files collected by this first nested loop and group lists
file_list = []
no3=[]
po4=[]
sio4=[]

for i in range(0,12):
        folder = '/home/lindsay/hioekg-2013/'
        os.chdir(folder)
        file = xr.open_dataset('hioekg_his_surface_0' + str(day_2013[i]) + '.nc')
        file_list.append(file)      

for item in file_list: 
# no3: nitrate
    dat = item.no3
    dat_numpy = dat.values
    dat_numpy=dat_numpy[~np.isnan(dat_numpy)]
    no3.append(dat_numpy)
# po4: phosphate
    dat = item.po4
    dat_numpy = dat.values
    dat_numpy=dat_numpy[~np.isnan(dat_numpy)]
    po4.append(dat_numpy)
# sio4: silicate
    dat = item.sio4
    dat_numpy = dat.values
    dat_numpy=dat_numpy[~np.isnan(dat_numpy)]
    sio4.append(dat_numpy)

/home/lindsay/hioekg-2013


The len of each array=614636. Each variable list contains 61 frames, so each frame (= each 12 hour half day average) is len=10076.

For 4750, data collection starts at Jan 2 00:00 and ends Feb 1 00:00 = 61 frames. I want to remove the 61st frame from each array: 614636-10076=604560.

I want to label each of the 12 arrays of len=604560 in the list with a month and date.

I want to label every other frame (half day average, len=10076) with timestamp 00:00 or 12:00.

My path forward:

- Use a for loop to slice each array within the lists to include indices from 0:604560 (slices off the day that is repeated at the start of the next month).
- Use a nested loop to label each i as i (from month/date list)
- Once my df is assembled, label every i%2 (every other frame/row) from timestamp list, starting at 00:00

Then I'll apply this framework to the 2014 data.
For now, I am ignoring that tiny chunk at the end of December; may add it in once I have the other data nicely labeled.

#### Classifying values by date, month, time

In [12]:
# Creating 12-unit date list and month list, 2-unit timestamp list, and empty list to store organized values:
date = ['2013-01-01','2013-02-01','2013-03-01','2013-04-01','2013-05-01',
        '2013-06-01','2013-07-01','2013-08-01','2013-09-01','2013-10-01',
        '2013-11-01','2013-12-01']
month=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
timestamp=['00:00','12:00']

# Store my dfs separately for my own benefit in error reduction
list_for_dataframe_no3_2013=[]
list_for_dataframe_po4_2013=[]
list_for_dataframe_sio4_2013=[]

#### no3 (2013)

In [13]:
%cd /home/lindsay/hioekg-2013

# Iterating sequentially (i) through each array (element) in the list of arrays (selection)
for i,element in enumerate(no3):
    element = element[0:604560]
    this_date = date[i] # each array i is assigned date i
    this_month = month[i] # each array i is assigned month i
    for sub_element in element:
        list_for_dataframe_no3_2013.append(
            {'month': this_month, 'date': this_date, 'concentration': sub_element})

no3_df_2013 = pd.DataFrame(list_for_dataframe_no3_2013)

# Add group identifier and year column
no3_df_2013['group']='no3'
no3_df_2013['year']=2013

# Add timestamp (this, unlike the plankton dfs, is accurate- though presently unused):
# 3627360 = one half the length of the df
timestamp = ['00:00','12:00'] 
timestamp = list(itertools.chain.from_iterable(itertools.repeat(x, 3627360) for x in timestamp))
no3_df_2013['timestamp']= timestamp
print(len(no3_df_2013))

%cd /home/lindsay/hioekg-compare-years/
pd.DataFrame.to_csv(no3_df_2013,'no3_semidaily_df_2013.csv')
no3_df_2013.head()

/home/lindsay/hioekg-2013
7254720
/home/lindsay/hioekg-compare-years


Unnamed: 0,month,date,concentration,group,year,timestamp
0,Jan,2013-01-01,2.906196e-08,no3,2013,00:00
1,Jan,2013-01-01,2.900492e-08,no3,2013,00:00
2,Jan,2013-01-01,2.894771e-08,no3,2013,00:00
3,Jan,2013-01-01,2.889031e-08,no3,2013,00:00
4,Jan,2013-01-01,2.924156e-08,no3,2013,00:00


#### po4 (2013)

In [14]:
%cd /home/lindsay/hioekg-2013

# Iterating sequentially (i) through each array (element) in the list of arrays (selection)
for i,element in enumerate(po4):
    element = element[0:604560]
    this_date = date[i] # each array i is assigned date i
    this_month = month[i] # each array i is assigned month i
    for sub_element in element:
        list_for_dataframe_po4_2013.append(
            {'month': this_month, 'date': this_date, 'concentration': sub_element})

po4_df_2013 = pd.DataFrame(list_for_dataframe_po4_2013)

# Add group identifier and year column
po4_df_2013['group']='po4'
po4_df_2013['year']=2013

# Add timestamp:
# 3627360 = one half the length of the df
timestamp = ['00:00','12:00'] 
timestamp = list(itertools.chain.from_iterable(itertools.repeat(x, 3627360) for x in timestamp))
po4_df_2013['timestamp']= timestamp
print(len(po4_df_2013))

%cd /home/lindsay/hioekg-compare-years/
pd.DataFrame.to_csv(po4_df_2013,'po4_semidaily_df_2013.csv')
po4_df_2013.head()

/home/lindsay/hioekg-2013
7254720
/home/lindsay/hioekg-compare-years


Unnamed: 0,month,date,concentration,group,year,timestamp
0,Jan,2013-01-01,5.431153e-08,po4,2013,00:00
1,Jan,2013-01-01,5.430301e-08,po4,2013,00:00
2,Jan,2013-01-01,5.429449e-08,po4,2013,00:00
3,Jan,2013-01-01,5.428597e-08,po4,2013,00:00
4,Jan,2013-01-01,5.434026e-08,po4,2013,00:00


#### sio4 (2013)

In [15]:
%cd /home/lindsay/hioekg-2013

# Iterating sequentially (i) through each array (element) in the list of arrays (selection)
for i,element in enumerate(sio4):
    element = element[0:604560]
    this_date = date[i] # each array i is assigned date i
    this_month = month[i] # each array i is assigned month i
    for sub_element in element:
        list_for_dataframe_sio4_2013.append(
            {'month': this_month, 'date': this_date, 'concentration': sub_element})

sio4_df_2013 = pd.DataFrame(list_for_dataframe_sio4_2013)

# Add group identifier and year column
sio4_df_2013['group']='sio4'
sio4_df_2013['year']=2013

# Add timestamp:
timestamp = ['00:00','12:00'] 
timestamp = list(itertools.chain.from_iterable(itertools.repeat(x, 3627360) for x in timestamp))
sio4_df_2013['timestamp']= timestamp
print(len(sio4_df_2013))

%cd /home/lindsay/hioekg-compare-years/
pd.DataFrame.to_csv(sio4_df_2013,'sio4_semidaily_df_2013.csv')
sio4_df_2013.head()

/home/lindsay/hioekg-2013
7254720
/home/lindsay/hioekg-compare-years


Unnamed: 0,month,date,concentration,group,year,timestamp
0,Jan,2013-01-01,5.828139e-07,sio4,2013,00:00
1,Jan,2013-01-01,5.82559e-07,sio4,2013,00:00
2,Jan,2013-01-01,5.823036e-07,sio4,2013,00:00
3,Jan,2013-01-01,5.820479e-07,sio4,2013,00:00
4,Jan,2013-01-01,5.833647e-07,sio4,2013,00:00


## Loop through: 2014

In [1]:
day_2014 = ['5115','5145','5175','5205','5235','5265','5295','5325','5355','5385','5415','5445']

In [4]:
# Reset the working directory before running the loops
%pwd
%cd /home/lindsay/hioekg-2014/

# Add master list for the files collected by this first nested loop and group lists
file_list = []
no3=[]
po4=[]
sio4=[]

for i in range(0,12):
        folder = '/home/lindsay/hioekg-2014/'
        os.chdir(folder)
        file = xr.open_dataset('hioekg_his_surface_0' + str(day_2014[i]) + '.nc')
        file_list.append(file)      

for item in file_list: 
# no3: nitrate
    dat = item.no3
    dat_numpy = dat.values
    dat_numpy=dat_numpy[~np.isnan(dat_numpy)]
    no3.append(dat_numpy)
# po4: phosphate
    dat = item.po4
    dat_numpy = dat.values
    dat_numpy=dat_numpy[~np.isnan(dat_numpy)]
    po4.append(dat_numpy)
# sio4: silicate
    dat = item.sio4
    dat_numpy = dat.values
    dat_numpy=dat_numpy[~np.isnan(dat_numpy)]
    sio4.append(dat_numpy)

/home/lindsay/hioekg-2014


In [8]:
# Creating 12-unit date list and month list, 2-unit timestamp list, and empty list to store organized values:
date = ['2014-01-01','2014-02-01','2014-03-01','2014-04-01','2014-05-01',
        '2014-06-01','2014-07-01','2014-08-01','2014-09-01','2014-10-01',
        '2014-11-01','2014-12-01']
month=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
timestamp=['00:00','12:00']

# Store my dfs separately for my own benefit in error reduction
list_for_dataframe_no3_2014=[]
list_for_dataframe_po4_2014=[]
list_for_dataframe_sio4_2014=[]

#### no3 (2014)

In [10]:
%cd /home/lindsay/hioekg-2014

# Iterating sequentially (i) through each array (element) in the list of arrays (selection)
for i,element in enumerate(no3):
    element = element[0:604560]
    this_date = date[i] # each array i is assigned date i
    this_month = month[i] # each array i is assigned month i
    for sub_element in element:
        list_for_dataframe_no3_2014.append(
            {'month': this_month, 'date': this_date, 'concentration': sub_element})

no3_df_2014 = pd.DataFrame(list_for_dataframe_no3_2014)

# Add group identifier and year column
no3_df_2014['group']='no3'
no3_df_2014['year']=2014

# Add timestamp:
timestamp = ['00:00','12:00'] 
timestamp = list(itertools.chain.from_iterable(itertools.repeat(x, 3627360) for x in timestamp))
no3_df_2014['timestamp']= timestamp
print(len(no3_df_2014))

%cd /home/lindsay/hioekg-compare-years/
pd.DataFrame.to_csv(no3_df_2014,'no3_semidaily_df_2014.csv')
no3_df_2014.head()

/home/lindsay/hioekg-2014
7254720
/home/lindsay/hioekg-compare-years


Unnamed: 0,month,date,concentration,group,year,timestamp
0,Jan,2014-01-01,3.185587e-09,no3,2014,00:00
1,Jan,2014-01-01,3.177744e-09,no3,2014,00:00
2,Jan,2014-01-01,3.16982e-09,no3,2014,00:00
3,Jan,2014-01-01,3.161814e-09,no3,2014,00:00
4,Jan,2014-01-01,3.234628e-09,no3,2014,00:00


#### po4 (2014)

In [11]:
%cd /home/lindsay/hioekg-2014

# Iterating sequentially (i) through each array (element) in the list of arrays (selection)
for i,element in enumerate(po4):
    element = element[0:604560]
    this_date = date[i] # each array i is assigned date i
    this_month = month[i] # each array i is assigned month i
    for sub_element in element:
        list_for_dataframe_po4_2014.append(
            {'month': this_month, 'date': this_date, 'concentration': sub_element})

po4_df_2014 = pd.DataFrame(list_for_dataframe_po4_2014)

# Add group identifier and year column
po4_df_2014['group']='po4'
po4_df_2014['year']=2014

# Add timestamp:
timestamp = ['00:00','12:00'] 
timestamp = list(itertools.chain.from_iterable(itertools.repeat(x, 3627360) for x in timestamp))
po4_df_2014['timestamp']= timestamp
print(len(po4_df_2014))

%cd /home/lindsay/hioekg-compare-years/
pd.DataFrame.to_csv(po4_df_2014,'po4_semidaily_df_2014.csv')
po4_df_2014.head()

/home/lindsay/hioekg-2014
7254720
/home/lindsay/hioekg-compare-years


Unnamed: 0,month,date,concentration,group,year,timestamp
0,Jan,2014-01-01,4.55849e-08,po4,2014,00:00
1,Jan,2014-01-01,4.55825e-08,po4,2014,00:00
2,Jan,2014-01-01,4.558009e-08,po4,2014,00:00
3,Jan,2014-01-01,4.557767e-08,po4,2014,00:00
4,Jan,2014-01-01,4.560035e-08,po4,2014,00:00


#### sio4 (2014)

In [12]:
%cd /home/lindsay/hioekg-2014

# Iterating sequentially (i) through each array (element) in the list of arrays (selection)
for i,element in enumerate(sio4):
    element = element[0:604560]
    this_date = date[i] # each array i is assigned date i
    this_month = month[i] # each array i is assigned month i
    for sub_element in element:
        list_for_dataframe_sio4_2014.append(
            {'month': this_month, 'date': this_date, 'concentration': sub_element})

sio4_df_2014 = pd.DataFrame(list_for_dataframe_sio4_2014)

# Add group identifier and year column
sio4_df_2014['group']='sio4'
sio4_df_2014['year']=2014

# Add timestamp:
timestamp = ['00:00','12:00'] 
timestamp = list(itertools.chain.from_iterable(itertools.repeat(x, 3627360) for x in timestamp))
sio4_df_2014['timestamp']= timestamp
print(len(sio4_df_2014))

%cd /home/lindsay/hioekg-compare-years/
pd.DataFrame.to_csv(sio4_df_2014,'sio4_semidaily_df_2014.csv')
sio4_df_2014.head()

/home/lindsay/hioekg-2014
7254720
/home/lindsay/hioekg-compare-years


Unnamed: 0,month,date,concentration,group,year,timestamp
0,Jan,2014-01-01,3.740273e-07,sio4,2014,00:00
1,Jan,2014-01-01,3.740934e-07,sio4,2014,00:00
2,Jan,2014-01-01,3.741607e-07,sio4,2014,00:00
3,Jan,2014-01-01,3.742292e-07,sio4,2014,00:00
4,Jan,2014-01-01,3.737438e-07,sio4,2014,00:00
