# Observatory Data

<a id="top"/>

## Contents

- [Minute mean values](#obsms)
    - [Read data from CDF files](#obsms-read-cdf)
    - [Convert data to pandas DataFrame](#obsms-to-dataframe)
    - [Read data from multiple files](#obsms-multifiles)

<a id="obsms" />

## Minute and second mean values

[[TOP]](#top)

Repositories:
- ftp://ftp.nerc-murchison.ac.uk/geomag/Swarm/AUX_OBS/minute/
- ftp://ftp.nerc-murchison.ac.uk/geomag/Swarm/AUX_OBS/second/

<a id="obsms-read-cdf" />

### Read data from CDF files

[[TOP]](#top)

Settings and functions:

In [None]:
# Python standard library
from contextlib import closing
from pathlib import Path

# Extra libraries
import cdflib
import numpy as np
import pandas as pd


# TODO: update the data dir once the files will be available in the shared folder
OBS_MINUTE_DIR = Path('~/data/AUX_OBS/minute').expanduser()
OBS_SECOND_DIR = Path('~/data/AUX_OBS/second').expanduser()


def cdf_to_pandas(*files):
    """Convert CDF `files` to a pandas dataframe."""
    dfs = []
    for file in files:
        with closing(cdflib.cdfread.CDF(file)) as data:
            ts = pd.DatetimeIndex(cdflib.cdfepoch.encode(data.varget('Timestamp'), iso_8601=True), name='Timestamp')
            df = pd.DataFrame(
                {
                    'IAGA_code': data.varget('IAGA_code')[:,0,0],
                    'Latitude': data.varget('Latitude'),
                    'Longitude': data.varget('Longitude'),
                    'Radius': data.varget('Radius'),
                    'B_N': data.varget('B_NEC')[:,0],
                    'B_E': data.varget('B_NEC')[:,1],
                    'B_C': data.varget('B_NEC')[:,2]
                },
                index=ts
            )
        dfs.append(df)
    return pd.concat(dfs).sort_values(by=['IAGA_code', 'Timestamp'])

In [None]:
# NBVAL_SKIP
# OPTIONAL - download data from the FTP server
!wget -nv -nc -P ~/data/AUX_OBS/minute ftp://ftp.nerc-murchison.ac.uk/geomag/Swarm/AUX_OBS/minute/SW_OPER_AUX_OBSM2__201912*
!wget -nv -nc -P ~/data/AUX_OBS/second ftp://ftp.nerc-murchison.ac.uk/geomag/Swarm/AUX_OBS/second/SW_OPER_AUX_OBSS2__201912*
!find ~/data/AUX_OBS -name "*.ZIP" | while read f ; do unzip -u $f -d `dirname $f` ; done
!find ~/data/AUX_OBS -name "*.ZIP" -delete
!find ~/data/AUX_OBS -name "*.HDR" -delete

Select one of the AUX_OBSM2_ files (e.g. the first one):

In [None]:
test_file = sorted(OBS_MINUTE_DIR.glob('SW_OPER_AUX_OBSM2_*.DBL'))[0]

test_file

Read CDF file using `cdflib` (for more information on `cdflib`, see: https://github.com/MAVENSDC/cdflib)

In [None]:
data = cdflib.CDF(test_file)

Get info about the file as a Python dictionary:

In [None]:
data.cdf_info()

You can see that measurements are stored as *zVariables*:

In [None]:
data.cdf_info()['zVariables']

Data can be retrieved via the `.varget()` method, e.g:

In [None]:
data.varget('B_NEC')

Data is returned as a `numpy.ndarray` object (for more information on `numpy.ndarray`, see: https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html).

Variable attributes can be retrieved using the `.varattsget()` method, e.g.:

In [None]:
data.varattsget('B_NEC')

Attributes are returned as a Python dictionary.

Let's retrieve the timestamps:

In [None]:
data.varget('Timestamp')

`Timestamp` type is:

In [None]:
data.varget('Timestamp').dtype

Timestamps are represented as NumPy `float64` values. Why? Get info about `Timestamp` variable using the `.varinq()` method:

In [None]:
data.varinq('Timestamp')

The returned dictionary shows that the data type is *CDF_EPOCH* consising in a floating point value representing the number of milliseconds since 01-Jan-0000 00:00:00.000. It can be converted to a more readable format using the `cdflib.cdfepoch.encode()` function:

In [None]:
ts = cdflib.cdfepoch.encode(data.varget('Timestamp'), iso_8601=True)

ts[:5]

Or to `numpy.datetime64`:

In [None]:
ts = np.array(cdflib.cdfepoch.encode(data.varget('Timestamp'), iso_8601=True), dtype='datetime64[ms]')

ts[:5]

You may be interested also in the CDF global attributes:

In [None]:
data.globalattsget()

Close the file when you have finished:

In [None]:
data.close()

AUX_OBSS2_ data contains the same variables:

In [None]:
with closing(cdflib.cdfread.CDF(list(OBS_SECOND_DIR.glob('SW_OPER_AUX_OBSS2_*.DBL'))[0])) as data:
    zvariables = data.cdf_info()['zVariables']

zvariables

<a id="obsms-to-dataframe" />

### Convert data to pandas DataFrame

[[TOP]](#top)

Data can be represented as a `pandas.DataFrame` object:

In [None]:
with closing(cdflib.cdfread.CDF(test_file)) as data:
    ts = np.array(cdflib.cdfepoch.encode(data.varget('Timestamp'), iso_8601=True), dtype='datetime64[us]')
    df = pd.DataFrame(
        {
            'IAGA_code': data.varget('IAGA_code')[:,0,0],
            'Latitude': data.varget('Latitude'),
            'Longitude': data.varget('Longitude'),
            'Radius': data.varget('Radius'),
            'B_N': data.varget('B_NEC')[:,0],
            'B_E': data.varget('B_NEC')[:,1],
            'B_C': data.varget('B_NEC')[:,2]
        },
        index=ts
    )

df

For more information on `pandas.Dataframe` see: https://pandas.pydata.org/docs/reference/frame.

Example: get minimum and maximum dates:

In [None]:
df.index.min(), df.index.max()

Example: get list of observatories (IAGA codes) stored in the files:

In [None]:
df['IAGA_code'].unique()

Example: get list of observatories (IAGA codes) included in the following ranges of coordinates:
- $30 \leq Latitude \leq 70$
- $-10 \leq Longitude \leq 40$

In [None]:
df[(df['Latitude'] >= 30) & (df['Latitude'] <= 70) & (df['Longitude'] >= -10) & (df['Longitude'] <= 40)]['IAGA_code'].unique()

You can do the same using the `.query()` method (see: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html#pandas.DataFrame.query):

In [None]:
df.query('(30 <= Latitude <= 70) and (-10 <= Longitude <= 40)')['IAGA_code'].unique()

<a id="obsms-multifiles" />

### Read data from multiple files

[[TOP]](#top)

Pandas dataframes can be concatenated to represent data from more than one file. E.g. read data from the next AUX_OBSM2_ file:

In [None]:
test_file1 = sorted(OBS_MINUTE_DIR.glob('SW_OPER_AUX_OBSM2_*.DBL'))[1]

with closing(cdflib.cdfread.CDF(test_file1)) as data:
    ts = np.array(cdflib.cdfepoch.encode(data.varget('Timestamp'), iso_8601=True), dtype='datetime64[ms]')
    df1 = pd.DataFrame(
        {
            'IAGA_code': data.varget('IAGA_code')[:,0,0],
            'Latitude': data.varget('Latitude'),
            'Longitude': data.varget('Longitude'),
            'Radius': data.varget('Radius'),
            'B_N': data.varget('B_NEC')[:,0],
            'B_E': data.varget('B_NEC')[:,1],
            'B_C': data.varget('B_NEC')[:,2]
        },
        index=ts
    )

df1

The two dataframes can be concatenated using the `pandas.concat()` function (for more information see: https://pandas.pydata.org/docs/reference/api/pandas.concat.html#pandas.concat):

In [None]:
new_df = pd.concat([df, df1])
new_df.index.names = ['Timestamp']
new_df.sort_values(by=['IAGA_code', 'Timestamp'])

new_df.index.min(), new_df.index.max()

You can use the `cdf_to_pandas()` function defined above to concatenate data from multiple files. E.g.:

In [None]:
cdf_to_pandas(test_file, test_file1)

With AUX_OBSS2_ data:

In [None]:
cdf_to_pandas(*sorted(OBS_SECOND_DIR.glob('SW_OPER_AUX_OBSS2_*'))[:2])