# Accessing Optical Absorption and Attenuation (OPTAA) Data from the OOI Raw Data Server
The example code provided below shows a pathway for downloading and converting the raw OPTAA data (recorded in binary format) into a usable form for further processing and analysis. The data is accessible from the [OOI Raw Data Server](https://rawdata.oceanobservatories.org/files/). For this demonstration we are using data from the Spring 2016 Deployment of the [Oregon Shelf Surface Mooring (CE02SHSM)](https://rawdata.oceanobservatories.org/files/CE02SHSM/D00003/cg_data/dcl27/optaa/)

Before proceeding, you need to obtain a copy of the cgsn_parsers modules used below. Using the Anaconda python distribution and the conda-forge channel, you can install these modules via:

```bash
# Via conda
conda install -c conda-forge cgsn_parsers

# Or via pip if not using Anaconda
pip install git+https://bitbucket.org/ooicgsn/cgsn-parsers
```

See the [README](https://github.com/oceanobservatories/ooi-data-explorations/blob/master/python/README.md) in this repo for further information.

In [29]:
# Load required python modules
import os
import requests
import numpy as np
import pandas as pd
import xarray as xr

from bokeh.plotting import figure, show
from bokeh.palettes import Colorblind as palette
from bokeh.io import output_notebook

import warnings
warnings.filterwarnings('ignore')

In [18]:
from cgsn_parsers.parsers.parse_optaa import Parser

In [19]:
# Coastal Endurance Oregon Shelf Surface Mooring NSIF (7 meters) OPTAA data for June 01, 2016 at 12:30 UTC
baseurl = "https://rawdata.oceanobservatories.org/files/CE02SHSM/D00003/cg_data/dcl27/optaa/"
fname = "20160601_123018.optaa.log"

# initialize the Parser object for OPTAA
optaa = Parser(baseurl + fname)
r = requests.get(optaa.infile, verify=True) # use verify=False for expired certificate

In [20]:
# Raw data is available in the raw data object for the parser class. 
optaa.raw = r.content
len(optaa.raw), optaa.raw[:4]  # print a snippet of the raw data

(448731, b'\xff\x00\xff\x00')

In [21]:
# The parser class method parse_data converts the raw data into a parsed data object
optaa.parse_data()
optaa.data.keys()  # print the resulting dictionary keys in the data object

dict_keys(['time', 'serial_number', 'a_reference_dark', 'pressure_raw', 'a_signal_dark', 'external_temp_raw', 'internal_temp_raw', 'c_reference_dark', 'c_signal_dark', 'elapsed_run_time', 'num_wavelengths', 'c_reference_raw', 'a_reference_raw', 'c_signal_raw', 'a_signal_raw'])

Almost every dataset will include multiple sources of timing data. In this case, because the OPTAA reports the time as a relative elapsed runtime from when it was started, we use the DCL recorded file start time added to the relative elapsed run time timestamp in the OPTAA data to create the Epoch (seconds since 1970-01-01) time used in these records.

From here, you can save the data to disk as a JSON formatted data file if you so desire. We use this method to store the parsed data files locally for all further processing.
```python
# write the resulting Bunch object via the toJSON method to a JSON
# formatted data file (note, no pretty-printing keeping things compact)
with open(outfile, 'w') as f:
    f.write(optaa.data.toJSON())
```
We are going to proceed, instead, by converting the data into a [pandas](https://github.com/pandas-dev/pandas) dataframe and then an [xarray](http://xarray.pydata.org/en/stable/index.html) dataset for the following steps.

In [22]:
# Convert the data into a panda dataframe and then an xarray dataset for further analysis.
df = pd.DataFrame(optaa.data)
df['time'] = pd.to_datetime(df.time, unit='s')  # use the time variable to set the index
df.set_index('time', drop=False, inplace=True)
ds = df.to_xarray()

# add the wavelength number as a coordinate and dimension to the dataset
ds.coords['wavelength_index'] = np.arange(df.num_wavelengths.values[0])
ds.update({'a_reference_raw': (('time', 'wavelength_index'), np.vstack(df.a_reference_raw.values)),
          'a_signal_raw': (('time', 'wavelength_index'), np.vstack(df.a_signal_raw.values)),
          'c_reference_raw': (('time', 'wavelength_index'), np.vstack(df.c_reference_raw.values)),
          'c_signal_raw': (('time', 'wavelength_index'), np.vstack(df.c_signal_raw.values))})

ds

The data from the OPTAA is collected in a burst at approximately 4 Hz for 2-4 minutes every hour. Per the vendor's recommendation we need to delete at least the first minutes worth of data to account for changes in absorption and attenuation caused by the lamp and instrument warming up. More recently, the vendor recommended that we remove the first 2 minutes worth of data, but that may not be possible since some of the bursts are only 2 minutes in duration. Since each raw data file represents a burst, we can use the `elapsed_run_time` to remove the 1st minute of data. After that, we can median average the burst to create a simpler, easier to work with data record.

In [23]:
# The OPTAA data is collected hourly in a burst mode (~1 Hz data sampled for 2-3 minutes). We need to take a median
# average of each burst to clean up variablity in the data created by the movement of the NSIF relative to the 
# water column and to make the ultimate data files smaller and easier to work with. First, though, drop the first
# 60 seconds worth of data from the 
ds.elapsed_run_time.values = ds.elapsed_run_time.where(ds.elapsed_run_time / 1000 > 60)
ds = ds.dropna(dim='time', subset=['elapsed_run_time'])
burst = ds.resample(time='30Min').median()

In [24]:
# Provide a simple plot showing all the data in the burst.
output_notebook()

# make a list of our columns
cols = ['a_reference_raw', 'a_signal_raw', 'c_reference_raw', 'c_signal_raw']
colors = palette[4]

# make the figure, 
p = figure(title="Raw OPTAA Data -- Burst", width = 850, height = 500)
p.xaxis.axis_label = 'Wavelength Number'
p.yaxis.axis_label = 'Counts'

# loop through our columns and colours
for c, cname in zip(colors, cols):
    for i in range(0, len(ds.time), 10):
        p.line(ds.wavelength_index.values, ds[cname].values[i, :], color=c, legend=cname)

p.toolbar_location = 'above'
show(p)

In [25]:
# Provide a simple plot showing the median averaged data.
p = figure(title="Raw OPTAA Data -- Averaged", width = 850, height = 500)
p.xaxis.axis_label = 'Wavelength Number'
p.yaxis.axis_label = 'Counts'

for c, cname in zip(colors, cols):
    p.line(burst.wavelength_index.values, burst[cname].values[0, :], color=c, legend=cname)

p.toolbar_location = 'above'
show(p)

The following two functions and the implementation below, takes the work from the examples above and combines them into 
a simple routine we can use to access, download and initially process the OPTAA data for the month of June 2016 (change the example regex to get whatever data it is you are after).

In [26]:
# Add some addition modules
from bs4 import BeautifulSoup
import re

# Function to create a list of the data files of interest on the raw data server
def list_files(url, tag=''):
    page = requests.get(url).text
    soup = BeautifulSoup(page, 'html.parser')
    pattern = re.compile(str(tag))
    return [node.get('href') for node in soup.find_all('a', text=pattern)]

# Function to download a file, parse it, apply median-averaging to the bursts and create a final dataset.
def process_file(file):
    # Initialize the parser, download and parse the data file
    optaa = Parser(baseurl + file)
    r = requests.get(optaa.infile, verify=True)
    optaa.raw = r.content
    optaa.parse_data()

    # Convert the parsed data to a dataframe and from there to an xarray dataset
    df = pd.DataFrame(optaa.data)
    df['time'] = pd.to_datetime(df.time, unit='s')  # use the time variable to set the index
    df.set_index('time', drop=False, inplace=True)
    ds = df.to_xarray()

    # add the wavelength number as a coordinate and dimension to the dataset, and reset the arrays
    ds.coords['wavelength_index'] = np.arange(df.num_wavelengths.values[0])
    ds.update({'a_reference_raw': (('time', 'wavelength_index'), np.vstack(df.a_reference_raw.values)),
              'a_signal_raw': (('time', 'wavelength_index'), np.vstack(df.a_signal_raw.values)),
              'c_reference_raw': (('time', 'wavelength_index'), np.vstack(df.c_reference_raw.values)),
              'c_signal_raw': (('time', 'wavelength_index'), np.vstack(df.c_signal_raw.values))})

    # remove the first 60 seconds worth of data
    ds.elapsed_run_time.values = ds.elapsed_run_time.where(ds.elapsed_run_time / 1000 > 60)
    ds = ds.dropna(dim='time', subset=['elapsed_run_time'])

    # apply a median average to the burst
    burst = ds.resample(time='30Min').median()
    return burst

In [27]:
# Create a list of the files from June using a simple regex as tag to discriminate the files
files = list_files(baseurl, '201606[0-9]{2}_[0-9]{6}.optaa.log')

# Process the data files for June and concatenate into a single dataset
frames = [process_file(f) for f in files]
june = xr.concat(frames, 'time')

In [28]:
# Plot the burst averaged data for the month of June 2016 (showing just 1 of the raw wavelengths, 
# nominally around 676 nm).
p = figure(x_axis_type="datetime", title="Raw OPTAA Data -- June 2016", width = 800, height = 500)
p.xaxis.axis_label = 'Date and Time'
p.yaxis.axis_label = 'Counts'

# loop through our columns and colours
for c, cname in zip(colors, cols):
    p.line(june.time.values, june[cname].values[:, 69], color=c, legend=cname)
    
show(p)

At this point, you have the option to save the data, or apply the processing routines available in pyseas and cgsn_processing, to convert the data from raw engineering units to scientific units using the calibration coefficients that are available online.

In [32]:
june['time'] = june.time.values.astype(float) / 10.0**9  # Convert time to seconds since 1970
june.to_netcdf(os.path.join(os.path.expanduser('~'), 'ooidata/raw/ce02shsm/ce02shsm_june2016_raw_optaa.nc'))