## <u>Notebook Synopsis</u>
In the previous notebook, I cleaned a subset of data, stored in a pandas dataframe, from the _NOMAD_ 2008 dataset, which I then pickled as "*df_main.pkl*". I also put the _NOMAD_ data in a dataframe and pickled it as "*df_raw.pkl*". All pickled files are contained in the subfolder "pickleJar" of this project folder. In this notebook I create several smaller datasets, one for the present study (*df_clean_nopurple_nored*), and others for later use in follow-up projects.

In [1]:
import numpy as np
import pandas as pd
%matplotlib inline

In [2]:
df = pd.read_pickle("./pickleJar/df_main.pkl")

In [3]:
df.describe()

Unnamed: 0,id,rrs411,rrs443,rrs489,rrs510,rrs555,rrs670,chlor_a
count,4459.0,4293.0,4456.0,4422.0,3435.0,3255.0,1598.0,4127.0
mean,4377.381251,0.004881,0.004652,0.00459,0.00413,0.003256,0.001557,2.680228
std,2298.272102,0.003447,0.003002,0.002768,0.00313,0.003536,0.002387,5.758436
min,6.0,5.1e-05,0.00019,0.000284,0.000261,0.000183,0.0,0.012
25%,2028.5,0.002509,0.002617,0.003051,0.002831,0.001588,0.0002,0.233325
50%,5039.0,0.003984,0.003899,0.004153,0.003425,0.002071,0.000614,0.764
75%,6271.5,0.006301,0.006076,0.005655,0.004242,0.003141,0.002,2.15
max,7831.0,0.0306,0.036769,0.063814,0.07774,0.0466,0.0277,77.8648


In the table above the "min" row shows there are null (=0) entries in the rrs670 column, which need to be marked as invalid (NaN)

In [4]:
df.replace(0, np.NaN, inplace=True)

In [5]:
df.describe()

Unnamed: 0,id,rrs411,rrs443,rrs489,rrs510,rrs555,rrs670,chlor_a
count,4459.0,4293.0,4456.0,4422.0,3435.0,3255.0,1515.0,4127.0
mean,4377.381251,0.004881,0.004652,0.00459,0.00413,0.003256,0.001642,2.680228
std,2298.272102,0.003447,0.003002,0.002768,0.00313,0.003536,0.002422,5.758436
min,6.0,5.1e-05,0.00019,0.000284,0.000261,0.000183,3e-05,0.012
25%,2028.5,0.002509,0.002617,0.003051,0.002831,0.001588,0.000222,0.233325
50%,5039.0,0.003984,0.003899,0.004153,0.003425,0.002071,0.0007,0.764
75%,6271.5,0.006301,0.006076,0.005655,0.004242,0.003141,0.002111,2.15
max,7831.0,0.0306,0.036769,0.063814,0.07774,0.0466,0.0277,77.8648


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4459 entries, 0 to 4458
Data columns (total 9 columns):
id         4459 non-null int64
rrs411     4293 non-null float64
rrs443     4456 non-null float64
rrs489     4422 non-null float64
rrs510     3435 non-null float64
rrs555     3255 non-null float64
rrs670     1515 non-null float64
chlor_a    4127 non-null float64
is_hplc    4459 non-null bool
dtypes: bool(1), float64(7), int64(1)
memory usage: 283.1 KB


In [7]:
rrs_cols = df.filter(regex='rrs', axis=1).columns.tolist()

## Subsetting:
Here I turn out smaller datasets for later use in this and other projects.

### $\Rightarrow$ Subset all columns leaving out rrs670
Note that here I leave NaNs in the dataset, for possible imputation in a later project.

In [8]:
rrs_cols_nored = rrs_cols[:-1]

In [11]:
df_no_red = df.loc[:, ['id'] + rrs_cols_nored + ['chlor_a', 'is_hplc']]

In [12]:
df_no_red.describe()

Unnamed: 0,id,rrs411,rrs443,rrs489,rrs510,rrs555,chlor_a
count,4459.0,4293.0,4456.0,4422.0,3435.0,3255.0,4127.0
mean,4377.381251,0.004881,0.004652,0.00459,0.00413,0.003256,2.680228
std,2298.272102,0.003447,0.003002,0.002768,0.00313,0.003536,5.758436
min,6.0,5.1e-05,0.00019,0.000284,0.000261,0.000183,0.012
25%,2028.5,0.002509,0.002617,0.003051,0.002831,0.001588,0.233325
50%,5039.0,0.003984,0.003899,0.004153,0.003425,0.002071,0.764
75%,6271.5,0.006301,0.006076,0.005655,0.004242,0.003141,2.15
max,7831.0,0.0306,0.036769,0.063814,0.07774,0.0466,77.8648


In [13]:
df_no_red.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4459 entries, 0 to 4458
Data columns (total 8 columns):
id         4459 non-null int64
rrs411     4293 non-null float64
rrs443     4456 non-null float64
rrs489     4422 non-null float64
rrs510     3435 non-null float64
rrs555     3255 non-null float64
chlor_a    4127 non-null float64
is_hplc    4459 non-null bool
dtypes: bool(1), float64(6), int64(1)
memory usage: 248.3 KB


### $\Rightarrow$ Subset all columns leaving out rrs411 and rrs670 and dropping out any rows with at least one invalid entry


In [14]:
rrs_cols_nopurple_nored = rrs_cols[1:-1]

In [15]:
df_nopurple_nored = df.loc[:, ['id'] + rrs_cols_nopurple_nored + ['chlor_a', 'is_hplc']]

In [16]:
df_nopurple_nored.describe()

Unnamed: 0,id,rrs443,rrs489,rrs510,rrs555,chlor_a
count,4459.0,4456.0,4422.0,3435.0,3255.0,4127.0
mean,4377.381251,0.004652,0.00459,0.00413,0.003256,2.680228
std,2298.272102,0.003002,0.002768,0.00313,0.003536,5.758436
min,6.0,0.00019,0.000284,0.000261,0.000183,0.012
25%,2028.5,0.002617,0.003051,0.002831,0.001588,0.233325
50%,5039.0,0.003899,0.004153,0.003425,0.002071,0.764
75%,6271.5,0.006076,0.005655,0.004242,0.003141,2.15
max,7831.0,0.036769,0.063814,0.07774,0.0466,77.8648


In [17]:
df_nopurple_nored.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4459 entries, 0 to 4458
Data columns (total 7 columns):
id         4459 non-null int64
rrs443     4456 non-null float64
rrs489     4422 non-null float64
rrs510     3435 non-null float64
rrs555     3255 non-null float64
chlor_a    4127 non-null float64
is_hplc    4459 non-null bool
dtypes: bool(1), float64(5), int64(1)
memory usage: 213.4 KB


In [18]:
df_no_red.to_pickle('./pickleJar/df_clean_nored.pkl')
df_nopurple_nored.to_pickle('./pickleJar/df_clean_nopurple_nored.pkl')

### End of this notebook on dataset cleanup & prep. Next: feature engineering for band ratio application.