# Keras-based Machine Learning Pipeline

This is a pipeline for doing data analsis using the Keras package and Tensorflow.

**Steps:**
1. Load SITL reports
2. Load MMS data
3. Data preprocessing
4. Train LSTM model

Introduction to LSTMs and Keras: https://adventuresinmachinelearning.com/keras-lstm-tutorial/

Can base this on an example from Machine Learning Mastery: https://machinelearningmastery.com/how-to-develop-rnn-models-for-human-activity-recognition-time-series-classification/

In [12]:
import os
from datetime import datetime, timedelta
import numpy as np
import pandas as pd
from matplotlib import pyplot
from matplotlib.dates import num2date, date2num

# Machine learning
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers import Dropout
from keras.layers import LSTM
from keras.utils import to_categorical

import retrieve_sitl
import retrieve_mms

## 1. Load SITL data

Load parsed reports in. Following would be useful:
- Reports reduced to "BBF" / "DF" / some other interesting signals
- Starttime and endtime and duration provided
- If the report was for multiple events or single event or not
- FOM value

In [2]:
# ----------------------------------
# nothing here yet... fill me in!
# ----------------------------------

In [3]:
# ... But for now, here's a basic approach:
pickle_path = "../pydata/reports_df.p"   # pickle created in example_sitl_read.ipynb
reports_df = pd.read_pickle(pickle_path)
reports_df = retrieve_sitl.parse_times(reports_df)
reports_df = reports_df.drop(columns='datetime')
reports_df = retrieve_sitl.combine_rows(reports_df)

# Pick out reports from 2016 with BBFs:
reports_df_2016 = reports_df.loc[np.logical_and(reports_df['Day'] >= '2016-01-01', reports_df['Day'] < '2017-01-01')]
reports_df_2016.loc[reports_df_2016.Discussion.str.contains('BBF',case=False)]

Unnamed: 0,FOM,ID,Discussion,Starttime,Endtime,Day
625,40.0,musanova(EVA),"BBFs, dipolarization front, energetic partic...",2016-05-22 09:03:34,2016-05-22 10:02:54,2016-05-22
629,40.0,musanova(EVA),"dipolarization front, BBFs, energetic partic...",2016-05-23 08:58:04,2016-05-23 09:16:34,2016-05-23
633,40.0,musanova(EVA),"BBFs, plasmasheet thinning, wave activity, en...",2016-05-24 05:50:34,2016-05-24 07:05:04,2016-05-24
637,40.0,musanova(EVA),"Dipolarization front, BBFs, energetic particl...",2016-05-25 03:57:54,2016-05-25 04:34:44,2016-05-25
639,30.0,musanova(EVA),"Dipolarization front, BBFs, waves, injections",2016-05-25 17:20:54,2016-05-25 17:34:04,2016-05-25
641,30.0,musanova(EVA),"Dipolarization front, BBF, energetic particl...",2016-05-26 06:20:44,2016-05-26 07:09:14,2016-05-26
642,20.0,musanova(EVA),"Dipolarization front, BBFs",2016-05-26 10:20:24,2016-05-26 10:47:34,2016-05-26
831,10.0,ajaynes(EVA),"Possible DFs/BBFs, particle enhancements",2016-06-11 06:11:24,2016-06-11 06:21:34,2016-06-11
833,10.0,ajaynes(EVA),"Tailward flows, particle enhancements, series...",2016-06-11 06:37:14,2016-06-11 06:50:54,2016-06-11
838,15.0,ajaynes(EVA),"BBFs/DFs, particle enhancements, nice proton ...",2016-06-11 11:39:44,2016-06-11 11:54:04,2016-06-11


In [4]:
# Here's an example event, but ideally we would have a list of these events:
event = reports_df_2016.loc[625]
event

FOM                                                        40.0
ID                                                musanova(EVA)
Discussion      BBFs, dipolarization front, energetic partic...
Starttime                                   2016-05-22 09:03:34
Endtime                                     2016-05-22 10:02:54
Day                                                  2016-05-22
Name: 625, dtype: object

## 2. Load MMS data

Load in raw data for processing.

Load both magnetic (FGM) and plasma/ion data (FPI, not yet implemented). Both will be useful to feed into a machine learning algorithm.

In [9]:
starttime, endtime = datetime(2016,5,22), datetime(2016,5,23)
local_data_dir = '../pydata'
base_path = os.path.join(local_data_dir, 'mms1', 'fgm', 'srvy', 'l2')
data = retrieve_mms.get_fgm(base_path, starttime, endtime)
t_utc, B_x, B_y, B_z, Bt = data

Reading file at ../pydata/mms1/fgm/srvy/l2/2016/05/mms1_fgm_srvy_l2_20160522_v4.40.0.cdf


In [10]:
# Also read FPI data!

## 3. Data proprocessing

First steps...

1. Resample data to 1s? I don't think we need a higher resolution. Maybe somebody else knows better here.
2. Label data with classes, e.g. all data has class 0 to start with, data marked as BBF in reports is then labelled with class 1, then data marked as DF is labelled with class 2, etc. We can try to predict the event type contained in a short time series within the data.
3. May want to remove general trends so data is static. I think the signals we're looking for are only related to variations, not to overall increase/decrease in magnetic field values.

In [13]:
# Resample data:
# --------------
t_sec = np.array([starttime + timedelta(seconds=n) for n in range(int((endtime-starttime).total_seconds()))])
t_sec_num = date2num(t_sec)
t_utc_num = date2num(t_utc)
# Interpolate to new time array:
B_x_sec = np.interp(t_sec_num, t_utc_num, B_x)
B_y_sec = np.interp(t_sec_num, t_utc_num, B_y)
B_z_sec = np.interp(t_sec_num, t_utc_num, B_z)
Bt_sec = np.interp(t_sec_num, t_utc_num, Bt)

In [14]:
# Create arrays with data classes:
# --------------------------------
t_classes = np.zeros(len(t_sec))
t_classes[np.logical_and(t_sec >= event.Starttime, t_sec < event.Endtime)] = 1
print("Values in array:", np.unique(t_classes))

Values in array: [0. 1.]


### Steps in preparation for machine learning

Combine all MMC variables into one array, and then...

1. Split MMS data into 30-minute sections (samples).
2. Give each section a class if it contains an interesting signal. Use keras.utils.to_categorical() on this.
3. Scale input data for LSTM (sklearn.MixMaxScaler).
4. Make a train-test split of all data.
5. Input data (X) should be n-dimensional and contain magnetic field variations and ion data. **What else can we add to this?** Output data (y) should be 1-dimensional and contain one class (int value) per 30-minute section. It's a good place to start.

In [15]:
# lots of preprocessing to do...

# example of splitting array into chunks
test = np.array_split(B_x_sec, 24*2)
test

[array([ -66.12447357,  -66.12447357,  -66.12447357, ..., -107.55329567,
        -107.5307406 , -107.45926477]),
 array([-107.42327801, -107.33263814, -107.26846468, ..., -103.12379227,
        -103.11797762, -103.1142554 ]),
 array([-103.08862392, -103.05110097, -102.97588577, ...,  -92.99384607,
         -93.00339458,  -92.98151217]),
 array([-92.96826781, -92.90722263, -92.8760217 , ..., -78.25201963,
        -78.21153439, -78.16517996]),
 array([-78.10604359, -78.08643259, -78.05366718, ..., -71.5058605 ,
        -71.52721376, -71.53060211]),
 array([-71.54413859, -71.50295362, -71.46344387, ..., -64.15184825,
        -64.18456652, -64.19085939]),
 array([-64.17307514, -64.17144794, -64.13855006, ..., -68.25921929,
        -67.91225867, -67.64160247]),
 array([-67.52788397, -67.60527692, -67.87689872, ..., -64.60936155,
        -64.63435936, -64.64636287]),
 array([-64.60705909, -64.62022575, -64.61281618, ..., -62.93120145,
        -63.20035394, -63.22907614]),
 array([-63.3367200

## 4. Train LSTM model

Take this as starting example: https://machinelearningmastery.com/how-to-develop-rnn-models-for-human-activity-recognition-time-series-classification/

In [None]:
# nothing here yet...

## 5. Apply trained model

Once model is trained, can we use it to find similar signals in data outside what we've trained it on? Would be cool.

In [None]:
# would be nice...