First, we need to read the data. I'll load the HEF and header data into pandas dataframes.

In [35]:
import os
import h5py
import numpy as np
import pandas as pd
from ntplib import system_to_ntp_time

NTP_DIFF = system_to_ntp_time(0)

hef_file = os.path.join('data', 'horizontal_electric_field.nc')
header_file = os.path.join('data', 'hpies_data_header.nc')

def netcdf_to_pandas(path):
    h5file = h5py.File(path)
    times = h5file['time'].value
    times = times - NTP_DIFF
    times = pd.to_datetime(times, unit='s')
    data = {}
    for group in h5file:
        if group == 'time':
            continue
        for var in h5file[group]:
            if 'timestamp' in var:
                continue
            data[var] = h5file[group][var].value
    return pd.DataFrame(data, index=times)
    
hef_df = netcdf_to_pandas(hef_file)
header_df = netcdf_to_pandas(header_file)

Now, we can check the contents.

In [36]:
print len(hef_df)
print len(header_df)

12009
252


We have lots of HEF data and header data at around 1/50th the data rate. Let's re-index the header data to the same time series as the HEF data, forward filling the missing data points (using the last seen value).

In [5]:
header_df = header_df.reindex(index=hef_df.index, method='ffill')
print len(header_df)

12009


Now the time series for both datasets match, let's join the two dataframes:

In [12]:
joined = hef_df.join(header_df, rsuffix='_header')
print len(joined)

Now joined contains the combination of the HEF data and header data. We now have a hpies_hcno value for all data points in the HEF data:

In [38]:
print joined.hpies_hcno.head(20)

2015-04-29 14:21:08.658715       0
2015-04-29 14:21:08.708307       0
2015-04-29 14:21:08.766100       0
2015-04-29 14:21:08.825862       0
2015-04-29 14:21:08.887132       0
2015-04-29 14:21:08.947684       0
2015-04-29 14:21:09.009053       0
2015-04-29 14:21:09.069960       0
2015-04-29 14:21:09.131373999    0
2015-04-29 14:21:09.191834       0
2015-04-29 14:21:09.253042       0
2015-04-29 14:21:09.318136       0
2015-04-29 14:21:09.379428       0
2015-04-29 14:21:09.440475       0
2015-04-29 14:21:09.505502       0
2015-04-29 14:21:09.566555       0
2015-04-29 14:21:09.627611       0
2015-04-29 14:21:09.692813       0
2015-04-29 14:21:09.753664       0
2015-04-29 14:21:09.818702       0
Name: hpies_hcno, dtype: int64


Let's compute the data rate for the HEF data:

In [32]:
data_rate = 1 / np.median(np.diff(joined.index.values).astype('f64') / 1e9)
print 'data rate: %.2f Hz' % data_rate

data rate: 16.32 Hz


Now, let's compute the storage requirements for the lifetime of the project if we are to add just one single 32-bit integer to each HEF particle:

In [34]:
one_day = data_rate * 60 * 60 * 24
one_year = one_day * 365
project_lifetime = one_year * 25
int_storage = project_lifetime * 4 / 1e9
print 'Expected storage cost for adding one int to each particle: %.2f GB' % int_storage 

Expected storage cost for adding one int to each particle: 51.48 GB


So, by storing the header data separately and putting the onus on the user to combine the datasets we save over 51 GB (just for the hcno data alone.)

Now, let's see if we can figure out what went wrong to produce the occasional jump in eindex.

In [1]:
import ntplib
import struct
import time
import os

message_types = ( 'Data From Instrument',
                  'Data From Driver',
                  'Port Agent Command',
                  'Port Agent Status',
                  'Port Agent Fault',
                  'Instrument Command',
                  'Heartbeat',
                  'Pickled Data From Instrument',
                  'Pickled Data From Driver' )

HEADER_FORMAT = '>3sBHHII'
SYNC_BYTES = '\xa3\x9d\x7a'
raw_data_fh = open(os.path.join('data', 'hpies-port_agent.20150429.data'))
raw_data = raw_data_fh.read()

In [3]:
def ntp_time_to_ascii(tstamp):
    return time.asctime(time.gmtime(ntplib.ntp_to_system_time(tstamp)))

def parse_header(data):
    sync, message_type, packet_size, checksum, time_upper, time_lower = struct.unpack_from(HEADER_FORMAT, data)
    timestamp = time_upper + float(time_lower) / 2**32
    return message_type, packet_size, checksum, timestamp

Data from the instrument is message type 1

In [None]:
HEADER_SIZE = struct.calcsize(HEADER_FORMAT)
packets = []
while raw_data:
    start = raw_data.find(SYNC_BYTES)
    if start == -1:
        break
    
    message_type, packet_size, checksum, timestamp = parse_header(raw_data[start:])
    packet = raw_data[start:start+packet_size]
    raw_data = raw_data[start+packet_size:]
    
    if message_type == 1:
        packets.append(packet)