This notebook describes the preparation of a NOMAD v.2 based dataset that I will use to reproduce the OC4v6 fitting process, with the aim of comparing it to Bayesian linear regression. I'm interested in the latter for the uncertainties that it will yield. That however, the subject of another post.

In [1]:
import pandas as pd
import os
import pathlib

In [2]:
home = pathlib.Path.home()
fp = home / 'DATA/NOMAD/dfNomadSWF.pkl'
df = pd.read_pickle(fp)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4459 entries, 0 to 4458
Data columns (total 44 columns):
rrs411      4459 non-null float64
rrs443      4459 non-null float64
rrs489      4459 non-null float64
rrs510      4459 non-null float64
rrs555      4459 non-null float64
rrs670      4459 non-null float64
datetime    4459 non-null datetime64[ns]
hplc_chl    1381 non-null float64
fluo_chl    3392 non-null float64
lat         4459 non-null float64
lon         4459 non-null float64
a411        1131 non-null float64
ad411       1231 non-null float64
ag411       1182 non-null float64
ap411       1272 non-null float64
bb411       369 non-null float64
a443        1138 non-null float64
ad443       1238 non-null float64
ag443       1182 non-null float64
ap443       1279 non-null float64
bb443       369 non-null float64
a489        1137 non-null float64
ad489       1237 non-null float64
ag489       1182 non-null float64
ap489       1278 non-null float64
bb489       369 non-null float64
a510 

In [5]:
colIwant=['id', 'depth','rrs411','rrs443','rrs489', 'rrs510','rrs555','rrs670',
          'hplc_chl','fluo_chl', 'sst', 'lat', 'lon', 'datetime']
dfOreilly = df[colIwant]

In [6]:
dfOreilly.describe()

Unnamed: 0,id,depth,rrs411,rrs443,rrs489,rrs510,rrs555,rrs670,hplc_chl,fluo_chl,sst,lat,lon
count,4459.0,4459.0,4459.0,4459.0,4459.0,4459.0,4459.0,4459.0,1381.0,3392.0,4459.0,4459.0,4459.0
mean,4377.381251,1312.346715,-0.117524,-0.002515,-0.078922,0.196083,0.251945,0.568632,2.285293,2.703251,14.841534,1.868658,-61.592062
std,2298.272102,1766.435289,5.308161,0.289728,1.080335,0.626941,0.520492,1.331805,5.752391,5.611762,10.374969,44.765125,53.894958
min,6.0,0.0,-325.392327,-14.690636,-22.160603,-16.837003,-13.82609,-41.858711,0.017,0.012,-1.8,-77.0356,-179.955
25%,2028.5,18.0,0.002555,0.002615,0.003011,0.002963,0.001694,0.001545,0.145,0.274,1.86,-61.299,-82.69995
50%,5039.0,240.0,0.004068,0.003896,0.004128,0.003737,0.002656,1.0,0.538,0.83,16.38,27.093,-67.675
75%,6271.5,2789.5,0.006696,0.006074,0.005644,0.009465,1.0,1.0,1.694,2.24008,24.43,34.4585,-63.9615
max,7831.0,7978.0,1.0,0.036769,0.063814,1.0,1.0,1.0,70.2133,77.8648,30.89,79.69,179.907


Getting rid of spurious values rrs values, including negative values and  1.000 values

In [18]:
rrsCols = [col for col in colIwant if 'rrs' in col] 
dfOreillyClean = dfOreilly.loc[((dfOreilly.rrs411>0) & (dfOreilly.rrs411<1.0)\
                                &(dfOreilly.rrs443>0)&(dfOreilly.rrs489>0)&\
                               (dfOreilly.rrs510>0)&(dfOreilly.rrs510<1.0)&\
                               (dfOreilly.rrs555>0)&(dfOreilly.rrs555<1.0)&\
                               (dfOreilly.rrs670<1.0)&(dfOreilly.rrs670>0)),:]

In [16]:
dfOreillyClean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1101 entries, 0 to 4458
Data columns (total 14 columns):
id          1101 non-null int32
depth       1101 non-null float64
rrs411      1101 non-null float64
rrs443      1101 non-null float64
rrs489      1101 non-null float64
rrs510      1101 non-null float64
rrs555      1101 non-null float64
rrs670      1101 non-null float64
hplc_chl    417 non-null float64
fluo_chl    794 non-null float64
sst         1101 non-null float64
lat         1101 non-null float64
lon         1101 non-null float64
datetime    1101 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(12), int32(1)
memory usage: 124.7 KB


The next step is to create a single chlorophyll *a* column from the fluorometric and hplc data. Where possible, the hplc value will be used as fluorometric data, which tend to be plagued by contamination from other pigments. When not hplc data is not available, I will use entries from the fluo_chl column.

In [19]:

dfOreillyClean['chl_all'] = dfOreillyClean.loc[:, 'hplc_chl'].where(dfOreillyClean.hplc_chl.notnull(),
                                                      dfOreillyClean.fluo_chl)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [20]:
dfOreillyClean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1101 entries, 0 to 4458
Data columns (total 15 columns):
id          1101 non-null int32
depth       1101 non-null float64
rrs411      1101 non-null float64
rrs443      1101 non-null float64
rrs489      1101 non-null float64
rrs510      1101 non-null float64
rrs555      1101 non-null float64
rrs670      1101 non-null float64
hplc_chl    417 non-null float64
fluo_chl    794 non-null float64
sst         1101 non-null float64
lat         1101 non-null float64
lon         1101 non-null float64
datetime    1101 non-null datetime64[ns]
chl_all     1016 non-null float64
dtypes: datetime64[ns](1), float64(13), int32(1)
memory usage: 133.3 KB


In [21]:
dfOreillyClean.chl_all.min(), dfOreillyClean.chl_all.max()

(0.017000000000000001, 77.864800000000002)

In [22]:
dfOreillyClean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1101 entries, 0 to 4458
Data columns (total 15 columns):
id          1101 non-null int32
depth       1101 non-null float64
rrs411      1101 non-null float64
rrs443      1101 non-null float64
rrs489      1101 non-null float64
rrs510      1101 non-null float64
rrs555      1101 non-null float64
rrs670      1101 non-null float64
hplc_chl    417 non-null float64
fluo_chl    794 non-null float64
sst         1101 non-null float64
lat         1101 non-null float64
lon         1101 non-null float64
datetime    1101 non-null datetime64[ns]
chl_all     1016 non-null float64
dtypes: datetime64[ns](1), float64(13), int32(1)
memory usage: 133.3 KB


In [24]:
dfOreillyClean.dropna(axis=0, subset=['chl_all'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [25]:
dfOreillyClean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1016 entries, 0 to 4458
Data columns (total 15 columns):
id          1016 non-null int32
depth       1016 non-null float64
rrs411      1016 non-null float64
rrs443      1016 non-null float64
rrs489      1016 non-null float64
rrs510      1016 non-null float64
rrs555      1016 non-null float64
rrs670      1016 non-null float64
hplc_chl    417 non-null float64
fluo_chl    794 non-null float64
sst         1016 non-null float64
lat         1016 non-null float64
lon         1016 non-null float64
datetime    1016 non-null datetime64[ns]
chl_all     1016 non-null float64
dtypes: datetime64[ns](1), float64(13), int32(1)
memory usage: 123.0 KB


In [27]:
dfOreillyClean.to_pickle(home / 'DATA/NOMAD/Oreilly2000Clean.pkl')

In [28]:
dfOreillyClean.to_pickle(home / 'DEV-ALL/Bayesian_Chl_Algorithms/pickleJar/dfNOMADV2.pkl')