In this post, I'm going to briefly describe how a I download the [NASA bio-Optical Marine Algorithm Dataset or NOMAD](https://seabass.gsfc.nasa.gov/wiki/NOMAD) created for algorithm development, extract the data I need and store it all neatly in a Pandas DataFrame. Here I use the latest dataset, NOMAD v.2, created in 2008.
<!-- Teaser_End -->
First things first; imports!

In [1]:
import requests
import pandas as pd
import re
import numpy as np
import pickle

The data can be accessed through a URL that I'll store in a string below.

In [2]:
NOMADV2url='https://seabass.gsfc.nasa.gov/wiki/NOMAD/nomad_seabass_v2.a_2008200.txt'

Next, I'll write a couple of functions. The first to get the data from the url. The second function will parse the text returned by the first function and put in a Pandas DataFrame. This second function makes more sense after inspecting the content of the page at the url above.

In [3]:
def GetNomad(url=NOMADV2url):
    """Download and return data as text"""
    resp = requests.get(NOMADV2url)
    content = resp.text.splitlines()    
    resp.close()
    return content

def ParseTextFile(textFile, topickle=False, convert2DateTime=False, **kwargs):
    """
    * topickle: pickle resulting DataFrame if True
    * convert2DateTime: join date/time columns and convert entries to datetime objects
    * kwargs:
        pkl_fname: pickle file name to save DataFrame by, if topickle=True
    """
    # Pre-compute some regex
    columns = re.compile('^/fields=(.+)') # to get field/column names
    units = re.compile('^/units=(.+)') # to get units -- optional
    endHeader = re.compile('^/end_header') # to know when to start storing data
    # Set some milestones
    noFields = True
    getData = False
    # loop through the text data
    for line in textFile:
        if noFields:
            fieldStr = columns.findall(line)
            if len(fieldStr)>0:
                noFields = False
                fieldList = fieldStr[0].split(',')
                dataDict = dict.fromkeys(fieldList)
                continue # nothing left to do with this line, keep looping
        if not getData:
            if endHeader.match(line):
                # end of header reached, start acquiring data
                getData = True 
        else:
            dataList = line.split(',')
            for field,datum in zip(fieldList, dataList):
                if not dataDict[field]:
                    dataDict[field] = []
                dataDict[field].append(datum)
    df = pd.DataFrame(dataDict, columns=fieldList)
    if convert2DateTime:
        datetimelabels=['year', 'month', 'day', 'hour', 'minute', 'second']
        df['Datetime']= pd.to_datetime(df[datetimelabels],
                                       format='%Y-%m-%dT%H:%M:%S')
        df.drop(datetimelabels, axis=1, inplace=True)
    if topickle:
        fname=kwargs.pop('pkl_fname', 'dfNomad2.pkl')
        df.to_pickle(fname)
    return df

In [4]:
df = ParseTextFile(GetNomad(), topickle=True, convert2DateTime=True,
                  pkl_fname='/accounts/ekarakoy/DATA/dfNomadRaw.pkl')

In [5]:
df.head()

Unnamed: 0,lat,lon,id,oisst,etopo2,chl,chl_a,kd405,kd411,kd443,...,diato,lut,zea,chl_b,beta-car,alpha-car,alpha-beta-car,flag,cruise,Datetime
0,38.4279,-76.61,1565,3.7,0.0,38.19,-999,-999,3.9455,3.1457,...,-999,-999,-999,-999,-999,-999,-999,20691,ace0301,2003-04-15 15:15:00
1,38.368,-76.5,1566,3.7,0.0,35.01,-999,-999,2.5637,2.0529,...,-999,-999,-999,-999,-999,-999,-999,20675,ace0301,2003-04-15 16:50:00
2,38.3074,-76.44,1567,3.7,1.0,26.91,-999,-999,2.1533,1.7531,...,-999,-999,-999,-999,-999,-999,-999,20691,ace0301,2003-04-15 17:50:00
3,38.6367,-76.32,1568,3.7,3.0,47.96,-999,-999,2.69,2.2985,...,-999,-999,-999,-999,-999,-999,-999,20675,ace0301,2003-04-17 18:15:00
4,38.3047,-76.44,1559,22.03,1.0,23.55,-999,-999,3.095,2.3966,...,-999,-999,-999,-999,-999,-999,-999,20691,ace0302,2003-07-21 18:27:00


This DataFrame quite large and unwieldy with 212 columns. But Pandas makes it easy to extract the necessary data for a particular project. For my current project, which I'll go over in a subsequent post, I need field data relevant to the [SeaWiFS sensor](https://en.wikipedia.org/wiki/SeaWiFS), in particular optical data at wavelengths 412, 443, 490, 510, 555, and 670 nm. First let's look at the available bands as they appear in spectral surface irradiance column labels, which start with 'es'.

In [6]:
bandregex = re.compile('es([0-9]+)')
bands = bandregex.findall(''.join(df.columns))
print(bands)

['405', '411', '443', '455', '465', '489', '510', '520', '530', '550', '555', '560', '565', '570', '590', '619', '625', '665', '670', '683']


Now I can extract data with bands that are the closest to what I need. In the process I'm going to use water leaving radiance and spectral surface irradiance to compute remote sensing reflectance, rrs. I will store this new data in a new DataFrame, dfSwf.

In [7]:
swfBands = ['411','443','489','510','555','670']
dfSwf = pd.DataFrame(columns=['rrs%s' % b for b in swfBands])
for b in swfBands:
    dfSwf.loc[:,'rrs%s'%b] = df.loc[:,'lw%s' % b].astype('f8') / df.loc[:,'es%s' % b].astype('f8')

In [8]:
dfSwf.head()

Unnamed: 0,rrs411,rrs443,rrs489,rrs510,rrs555,rrs670
0,0.001204,0.001686,0.003293,0.004036,0.007479,0.003465
1,0.001062,0.001384,0.002173,0.002499,0.004152,0.001695
2,0.000971,0.001185,0.001843,0.002288,0.004246,0.001612
3,0.001472,0.001741,0.002877,0.003664,0.006982,0.003234
4,0.000905,0.001022,0.001506,0.001903,0.002801,0.001791


My project is about developing a bayesian framework for estimating biological variables using remote sensing data. So I'll need to copy over a few more features from the inital dataset. 

In [9]:
dfSwf['id'] = df.id.astype('i4') # in case I need to relate this data to the original
dfSwf['datetime'] = df.Datetime
dfSwf['hplc_chl'] = df.chl_a.astype('f8')
dfSwf['fluo_chl'] = df.chl.astype('f8')
dfSwf['lat'] = df.lat.astype('f8')
dfSwf['lon'] = df.lon.astype('f8')
dfSwf['depth'] = df.etopo2.astype('f8')
dfSwf['sst'] = df.oisst.astype('f8')
for band in swfBands:
    addprods=['a','ad','ag','ap','bb']
    for prod in addprods:
        dfSwf['%s%s' % (prod,band)] = df['%s%s' % (prod, band)].astype('f8')
dfSwf.replace(-999,np.nan, inplace=True)

Tallying the features I've gathered...

In [10]:
print(dfSwf.columns)

Index(['rrs411', 'rrs443', 'rrs489', 'rrs510', 'rrs555', 'rrs670', 'id',
       'datetime', 'hplc_chl', 'fluo_chl', 'lat', 'lon', 'depth', 'sst',
       'a411', 'ad411', 'ag411', 'ap411', 'bb411', 'a443', 'ad443', 'ag443',
       'ap443', 'bb443', 'a489', 'ad489', 'ag489', 'ap489', 'bb489', 'a510',
       'ad510', 'ag510', 'ap510', 'bb510', 'a555', 'ad555', 'ag555', 'ap555',
       'bb555', 'a670', 'ad670', 'ag670', 'ap670', 'bb670'],
      dtype='object')


That seems like a good dataset to start with. I'll pickle this DataFrame for later.

In [11]:
dfSwf.to_pickle('/accounts/ekarakoy/DATA/dfNomadSWF.pkl')

That's it. Until next time, *Happy Hacking!*