# Algorithm & Metric Deep Dive - Making appropriate choices for your problem

## Overview 

In the previous notebook, we built a machine learning containing all the important elements we would find in any sophisticated, real world pipeline, but in our example we used quite simple components for each step. Even still, there were quite a number of *hyperparameters* we had to choose along the way, without having understood what we were choosing and why, or more importantly, what would be a good choice for our particular problem and the dataset. This notebook will attempt to give a little insight into how key ML algorithms work with aim of giving some basic understanding of what the hyperparameters mean and how they influence the end result. This will hopefully be a starting point for making approproate chopices of algorithm, hyperparamters and metrics. 


### Prerequisites
* Completed notebooks 1 & 2

### Learning Outcomes 

* Understand mechanisms of key tree based and neural network algorithms
* Understand key hyperparameters and how to choose them
* Understand key metrics and how to select the right one for your problem

### Links to Best practices and Values
* ML Pitfalls - Avoid problems such as overfitting and underfitting through appropriate choice of algorithms and hyperparameters
* Ethics - Be able to justify your choices for what you have implemented.
* ML Lifecycle - Ensure you are able to reproduce results


### Data Science Framework
* Weather regimes - Discovery and Attribution
* Radiation emulation - Fusing Simulation and Data Science
* XBT - Uncertainty and Trust
* Rotors - Data to Decisions

## Tutorial - Decision Trees

![An example of manually created flowchart for making decisions, similar to the structure of the decision tree.](xbt_imeta_flowchart.png)

Root node: The base of the decision tree.
Splitting: The process of dividing a node into multiple sub-nodes.
Decision node: When a sub-node is further split into additional sub-nodes.
Leaf node: When a sub-node does not further split into additional sub-nodes; represents possible outcomes.
Pruning: The process of removing sub-nodes of a decision tree.
Branch: A subsection of the decision tree consisting of multiple nodes.
Hyoerparameters
Max depth
Image of tree https://miro.medium.com/max/1400/1*3P1333UmqEww6YMpjisj4Q.png from article https://towardsdatascience.com/decision-trees-explained-3ec41632ceb6 

Maths of CART algorithm

https://towardsdatascience.com/the-mathematics-of-decision-trees-random-forest-and-feature-importance-in-scikit-learn-and-spark-f2861df67e3
How does this relate to decision trees hyperparameters

In the decision tree, the nodes are split into subnodes on the basis of a threshold value of an attribute. The CART algorithm does that by searching for the best homogeneity for the subnodes, with the help of the Gini Index criterion. 
Quote from https://www.analyticssteps.com/blogs/classification-and-regression-tree-cart-algorithm  
Training terms
* Greedy algorithm
* Stopping criteria
* Pruning

Use XBT example for trees
Show XBT flowchart
Example code
Régression va classification trees

Décision tree visualisation
Variants:
• random forest
• Gradient boosted
• Xgboost 
Random forest discussion 
Deals with variance Leo breiman
Key terms
Bgging
Bootstapping
aggregation
Ensemble model


Advantages

Works for numerical or categorical data and variables.
Models problems with multiple outputs.
Tests the reliability of the tree.
Requires less data cleaning than other data modeling techniques. Easy to explain to those without an analytical background.

Disadvantages
Affected by noise in the data.
Not ideal for large datasets.
Can disproportionately value, or weigh, attributes.
The decisions at nodes are limited to binary outcomes, reducing the complexity that the tree can handle. Trees can become very complex when dealing with uncertainty and numerous linked outcomes. 




# Example: Decision Trees - XBT classification

In [26]:
import pathlib
import os
import functools
import math
import datetime

In [2]:
import pandas
import numpy

In [3]:
import matplotlib
import matplotlib.pyplot
%matplotlib inline

In [4]:
import sklearn
import sklearn.tree
import sklearn.preprocessing
import sklearn.ensemble

In [None]:
xbt_data_loc = pathlib.Path('/project/informatics_lab/xbt')
xbt_fname_template = 'xbt_{year}.csv'
year_range= (1966,2015)
xbt_df = pandas.concat([pandas.read_csv(xbt_data_loc / xbt_fname_template.format(year=year1)) for year1 in range(year_range[0], year_range[1])])
xbt_df

#### Clean dataset
Remove bad data points from the dataset.

In [None]:
xbt_df = xbt_df[~((xbt_df['max_depth'] < 0) | (xbt_df['max_depth'].isna()))]
xbt_df

In [None]:
target_feature = 'instrument'

In [None]:
xbt_df[target_feature].value_counts().index[:12]

In [None]:
xbt_df = xbt_df[~(xbt_df[target_feature].str.contains('UNKNOWN'))]
xbt_df.shape

In [None]:
xbt_df = xbt_df[xbt_df[target_feature].isin(list(xbt_df[target_feature].value_counts().index[:12]))]

In [None]:
xbt_labelled = xbt_df[xbt_df['imeta_applied'] == 0]
xbt_unlabelled = xbt_df[xbt_df['imeta_applied'] != 0]

In [None]:
xbt_labelled.shape

In [None]:
xbt_unlabelled.shape

In [None]:
xbt_train, xbt_test = sklearn.model_selection.train_test_split(xbt_labelled)

In [None]:
scaler_dict = {
    'year': sklearn.preprocessing.MinMaxScaler(),
    'max_depth': sklearn.preprocessing.MinMaxScaler(),
    'lat': sklearn.preprocessing.MinMaxScaler(),
    'lon': sklearn.preprocessing.MinMaxScaler(),
}
input_features = [list(scaler_dict.keys())]

preproc_input_features = []
for feature_name, scaler1 in scaler_dict.items():
    scaler1.fit(xbt_train[[feature_name]])
    preproc_input_features += [scaler1.transform(xbt_train[[feature_name]])]
    
X_train = numpy.concatenate( preproc_input_features, axis=1)



In [None]:
target_encoder = sklearn.preprocessing.LabelEncoder()
target_encoder.fit(xbt_train[target_feature])
y_train = target_encoder.transform(xbt_train[target_feature])

We can get the hyperparameters for our decision tree by creating a decision tree object. You can get more explanation from `help(sklearn.tree.DecisionTreeClassifier`, or from the [scikit-learn docs](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier).

In [None]:
sklearn.tree.DecisionTreeClassifier().get_params()

In [None]:
%%time
dt_clf = sklearn.tree.DecisionTreeClassifier(
    max_depth=5, # reduce chance of overfitting
    min_samples_leaf= 2, #ensure that there won't be too small a number of samples in a leaf node
    min_samples_split= 5, # ensure more sample at a node when it splits
)
dt_clf.fit(X_train, y_train)

https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html#sphx-glr-auto-examples-tree-plot-unveil-tree-structure-py

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

In [None]:
fig1 = matplotlib.pyplot.figure(figsize=(64,64))
_ = sklearn.tree.plot_tree(dt_clf)
matplotlib.pyplot.show()
fig1.savefig('treevis.svg',bbox_inches='tight')

In [None]:
random_forest_hyperparameters = {
    'max_depth' : 5, # reduce chance of overfitting
    'min_samples_leaf' :  2, #ensure that there won't be too small a number of samples in a leaf node
    'min_samples_split' :  5, # ensure more sample at a node when it splits
    'n_estimators' : 20, # restrict number of trees in forest
}

In [None]:
%%time
rf_clf = sklearn.ensemble.RandomForestClassifier(
    **random_forest_hyperparameters
)
rf_clf.fit(X_train, y_train)

In [None]:
for p1,param_val in rf_clf.get_params().items():
    print(f'param value {p1}={param_val}')

In [None]:
rf_clf.estimators_

In [None]:
# evaluate results from random forest and decision tree

In [None]:
X_test = numpy.concatenate(
    [scaler1.transform(xbt_test[[feature_name]]) for feature_name, scaler1 in scaler_dict.items()],
    axis=1)
y_test = target_encoder.transform(xbt_test[target_feature])

In [None]:
y_pred_dt = dt_clf.predict(X_test)
y_pred_rf = rf_clf.predict(X_test)

In [None]:
fig1 = matplotlib.pyplot.figure(figsize=(16,8))
ax1 = fig1.add_subplot(1,2,1, title='frequency of different labels for decision tree predictions.')
pandas.Series(target_encoder.inverse_transform(y_pred_dt)).value_counts().plot.pie(ax=ax1)
ax1 = fig1.add_subplot(1,2,2, title='frequency of different labels for random forest predictions.')
pandas.Series(target_encoder.inverse_transform(y_pred_rf)).value_counts().plot.pie(ax=ax1)
                       

In [None]:
len(list(target_encoder.classes_)), len(list(sklearn.metrics.precision_score(y_test, y_pred_dt, average=None)))

In [None]:
xbt_train['instrument'].unique().shape

In [None]:
prec_dt, recall_dt, f1_dt, support_dt = sklearn.metrics.precision_recall_fscore_support(y_test, y_pred_dt, average=None)
prec_rf, recall_rf, f1_rf, support_rf = sklearn.metrics.precision_recall_fscore_support(y_test, y_pred_rf, average=None)
metrics_xbt = pandas.DataFrame({
    'classes': list(target_encoder.classes_),
    'precision_dt': list(prec_dt),
    'precision_rf': list(prec_rf),
    'recall_dt': list(recall_dt),
    'recall_rf': list(recall_rf),
    'f1_dt': list(f1_dt),
    'f1_rf': list(f1_rf),
    'support_dt': list(support_dt),
    'support_rf': list(support_rf),  
})

In [None]:
fig1 = matplotlib.pyplot.figure(figsize=(20,30))
ax1 = fig1.add_subplot(2,2,1)
metrics_xbt.plot.bar(x='classes',y=['recall_dt','recall_rf'],ax=ax1)
ax1 = fig1.add_subplot(2,2,2)
metrics_xbt.plot.bar(x='classes',y=['precision_dt','precision_rf'],ax=ax1)
ax1 = fig1.add_subplot(2,2,3)
metrics_xbt.plot.bar(x='classes',y=['f1_dt','f1_rf'],ax=ax1)
ax1 = fig1.add_subplot(2,2,4)
metrics_xbt.plot.bar(x='classes',y=['support_dt','support_rf'],ax=ax1)


## Tutorial - Neural Networks

Description of Neural networks
* Explain a single perceptron
* History
* Weighted sum plus threshold (binary activation)
* Non linear activation, sigmoid neuron

Key terms
* neuron
* Perceptron
* Activation
* Weights
* Bias - the constant term
* Sigmoid function

Multi layer
* How are they joined together?

How do we train? 
* Gradient descent
* Stochastic gradient decscent
* Back propagation 

Key terms
* gradient descent
* Learning rate
* back propagation
* mini batch
* epoch
* copst function

Hyperparameters
* batch size
* learning rate
* solver
* maximum iterations

 
Types of NN
* feed forward
* Convolutional Neural Network
* Recurrent Neural Network
* Graphical Neural Network
 

## Example - Scorates Radiation Model Emulation

Explain problem and dataset

In [40]:
import tensorflow 
import keras
import tensorflow.keras
import tensorflow.keras.layers 
import tensorflow.keras.layers 
import tensorflow.keras.layers 
import tensorflow.keras.models 
import code

## Define inputs
Specify the hyperparametersfor the pipeline and the location of the input data

In [64]:
data_dir   = pathlib.Path('/project/informatics_lab/data_science_cop/socrates_emulation/')
output_dir = pathlib.Path(os.environ['SCRATCH']) / 'ml_weather_tutorial'

if not output_dir.is_dir():
    output_dir.mkdir()
    print(f'creating directory {output_dir}')

Set up the hyperparameters for training the neural network.

In [63]:
wl='sw'
target='nflx'
nsamps = '50.0K'
scale_data = True
if wl=='sw':
  model = 'sw_260'
  model_ref = 'sw_ga7'
elif wl=='lw':
  model = 'lw_300'
  model_ref = 'lw_ga7'


Construct the paths to file names that contain the data

In [65]:
fnext='train'
fn_meta = model+'_meta_'+nsamps+'_'+fnext+'.npz'
fn_dat_levs = model+'_dat_levs_'+nsamps+'_'+fnext+'.npz'
fn_dat_lays = model+'_dat_lays_'+nsamps+'_'+fnext+'.npz'
fn_dat_surf = model+'_dat_surf_'+nsamps+'_'+fnext+'.npz'
if target=='nflx':
  fn_trg = model+'_trg_levs_'+nsamps+'_'+fnext+'.npz'
if target=='ndiv':
  fn_trg = model+'_trg_lays_'+nsamps+'_'+fnext+'.npz'
 
fnext='test'
fn_meta_test = model+'_meta_'+nsamps+'_'+fnext+'.npz'
fn_dat_levs_test = model+'_dat_levs_'+nsamps+'_'+fnext+'.npz'
fn_dat_lays_test = model+'_dat_lays_'+nsamps+'_'+fnext+'.npz'
fn_dat_surf_test = model+'_dat_surf_'+nsamps+'_'+fnext+'.npz'
if target=='nflx':
  fn_trg_test = model+'_trg_levs_'+nsamps+'_'+fnext+'.npz'
  fn_trg_ref = model_ref+'_trg_levs_'+nsamps+'_'+fnext+'.npz'
if target=='ndiv':
  fn_trg_test = model+'_trg_lays_'+nsamps+'_'+fnext+'.npz'
  fn_trg_ref = model_ref+'_trg_lays_'+nsamps+'_'+fnext+'.npz'

print('root dir:',data_dir)

root dir: /project/informatics_lab/data_science_cop/socrates_emulation


### Loading and preparing the  training data

In [67]:
print('loading',fn_dat_lays)
with numpy.load(data_dir / fn_dat_lays) as npzfile:
    dat_lays = npzfile['dat_lays']
    
print('loading',fn_dat_surf)
with numpy.load(data_dir / fn_dat_surf) as npzfile:
    dat_surf = npzfile['dat_surf']
    
print('loading',fn_trg)
with numpy.load(data_dir / fn_trg) as npzfile:
    if target=='nflx':
         trg = npzfile['trg_levs']
    elif target=='ndiv':
         trg = npzfile['trg_lays']

loading sw_260_dat_lays_50.0K_train.npz
loading sw_260_dat_surf_50.0K_train.npz
loading sw_260_trg_levs_50.0K_train.npz


In [68]:
nsamps = trg.shape[0]
nlays = dat_lays.shape[1]
nlay_feats = dat_lays.shape[2]
nsurf_feats = dat_surf.shape[1]

In [69]:
dat_lays.shape

(50000, 70, 35)

In [71]:
if scale_data: # normalize by range
  scaler_lays = []
  use_lays = []
  for ic in range(nlay_feats):
    min0 = numpy.min(dat_lays[:,:,ic])
    range0 = numpy.max(dat_lays[:,:,ic]) - min0
    if range0 > 0.:
      dat_lays[:,:,ic] = (dat_lays[:,:,ic] - min0)/range0
      scaler_lays.append([min0, range0])
      use_lays.append(ic)
  if len(use_lays)<nlay_feats:
    print('removing constant layer features:', nlay_feats-len(use_lays))
    dat_lays = dat_lays[:,:,use_lays]
    nlay_feats = len(use_lays)
      

  scaler_surf = []
  use_surf = []
  for ic in range(nsurf_feats):
    min0 = numpy.min(dat_surf[:,ic])
    range0 = numpy.max(dat_surf[:,ic]) - min0
    if range0 > 0.:
      dat_surf[:,ic] = (dat_surf[:,ic] - min0)/range0
      scaler_surf.append([min0, range0])
      use_surf.append(ic)
  if len(use_surf)<nsurf_feats:
    print('removing constant surf features:', nsurf_feats-len(use_surf))
    dat_surf = dat_surf[:,:,use_surf]
    nsurf_feats = len(use_surf)

removing constant layer features: 6


In [72]:
ntrg_samps = trg.shape[0]

if target=='nflx' or target=='ndiv':
  nouts=1
  ntrg_levs = trg.shape[1]


In [93]:
def build_model_mlp(nlays, nlay_feats):
    profile_input = tensorflow.keras.layers.Input(shape=(nlays, nlay_feats), name='profile_input')
    surf_input = tensorflow.keras.layers.Input(shape=(nsurf_feats,), name='surf_input')
    flat_profs = tensorflow.keras.layers.Flatten()(profile_input)
    raw_in = tensorflow.keras.layers.concatenate([flat_profs, surf_input])
    raw_size = (nlays*nlay_feats)+nsurf_feats
    prof_size = nlays*nlay_feats

    x = tensorflow.keras.layers.Dense(512, use_bias=False, activation='relu')(raw_in)
    x = tensorflow.keras.layers.Dense(512, use_bias=False, activation='relu')(x)
    x = tensorflow.keras.layers.Dense(256, use_bias=False, activation='relu')(x)
    x = tensorflow.keras.layers.Dense(256, use_bias=False, activation='relu')(x)
    x = tensorflow.keras.layers.Dense(128, use_bias=False, activation='relu')(x)
    x = tensorflow.keras.layers.Dense(128, use_bias=False, activation='relu')(x)

    main_output = tensorflow.keras.layers.Dense(ntrg_levs, use_bias=True, activation='linear', name='main_output')(x)
    model = tensorflow.keras.models.Model(inputs=[profile_input, surf_input], outputs=[main_output])
    return model

In [104]:
def build_model_cnn(nlays, nlay_feats):
    profile_input = tensorflow.keras.layers.Input(shape=(nlays, nlay_feats), name='profile_input')
    surf_input = tensorflow.keras.layers.Input(shape=(nsurf_feats,), name='surf_input')
    flat_profs = tensorflow.keras.layers.Flatten()(profile_input)
    raw_in = tensorflow.keras.layers.concatenate([flat_profs, surf_input])
    raw_size = (nlays*nlay_feats)+nsurf_feats
    prof_size = nlays*nlay_feats

    out = tensorflow.keras.layers.ZeroPadding1D(padding=1)(profile_input)
    out = tensorflow.keras.layers.Conv1D(32, 3, strides=1, activation='relu', use_bias=False, kernel_initializer='glorot_uniform', bias_initializer='zeros')(out)
    ident = out
    out = tensorflow.keras.layers.ZeroPadding1D(padding=1)(out)
    out = tensorflow.keras.layers.Conv1D(32, 3, strides=1, activation='relu', use_bias=False, kernel_initializer='glorot_uniform', bias_initializer='zeros')(out)
    out = tensorflow.keras.layers.ZeroPadding1D(padding=1)(out)
    out = tensorflow.keras.layers.Conv1D(32, 3, strides=1, activation='relu', use_bias=False, kernel_initializer='glorot_uniform', bias_initializer='zeros')(out)
    x = tensorflow.keras.layers.add([out, ident])
    out = tensorflow.keras.layers.Flatten()(x)
    out = tensorflow.keras.layers.Dense(prof_size, use_bias=False, activation='relu')(out)

    out = tensorflow.keras.layers.concatenate([out, surf_input])
    x = tensorflow.keras.layers.add([out, raw_in])
    x = tensorflow.keras.layers.Dense(1024, use_bias=False, activation='relu')(x)
    x = tensorflow.keras.layers.Dense(1024, use_bias=False, activation='relu')(x)

    main_output = tensorflow.keras.layers.Dense(ntrg_levs, use_bias=True, activation='linear', name='main_output')(x)
    model = tensorflow.keras.models.Model(inputs=[profile_input, surf_input], outputs=[main_output])
    return model

In [105]:
model_dict = {'mlp': {'build_func': build_model_mlp,},
              'cnn_1d': {'build_func': build_model_cnn,},
             }

In [106]:
%%time
for model_name in model_dict.keys():
    print(f'building and training model {model_name}')
    model_dict[model_name]['model_object'] = model_dict[model_name]['build_func'](nlays=nlays, nlay_feats=nlay_feats)
    model_dict[model_name]['model_object'].compile(loss='mean_absolute_error',
                                                   optimizer='adam')
    model_dict[model_name]['model_object'].fit([dat_lays, dat_surf], trg, epochs=1, batch_size=32, verbose=0)

building and training model mlp
building and training model cnn_1d
CPU times: user 4min 39s, sys: 26.7 s, total: 5min 6s
Wall time: 1min 15s


In [115]:
predictions = {}

In [None]:
for model_name, selected_model in model_dict.items():
  predictions[model_name] = selected_model['model_object'].predict([dat_lays_test, dat_surf_test])

In [None]:
metrics_dict = {}
for model_name in model_dict.keys():
    metrics_dict['me_p'] = np.zeros(ntrg_levs)
    metrics_dict['me_ctl'] = np.zeros(ntrg_levs)
    metrics_dict['mae_p'] = np.zeros(ntrg_levs)
    metrics_dict['mae_ctl'] = np.zeros(ntrg_levs)
    for ilev in range(ntrg_levs):
      metrics_dict['me_p'][ilev] = np.mean(predictions[model_name][:,ilev] - trg_test[:,ilev])
      metrics_dict['me_ctl'][ilev] = np.mean(trg_ref[:,ilev] - trg_test[:,ilev])
      metrics_dict['mae_p'][ilev] = np.mean(np.abs(predictions[model_name][:,ilev] - trg_test[:,ilev]))
      metrics_dict['mae_ctl'][ilev] = np.mean(np.abs(trg_ref[:,ilev] - trg_test[:,ilev]))    

### Visualise metrics

Display the perfromance metrics for our trained algorithms.

In [None]:
yax = numpy.arange(1,len(me_p[1:])+1)[::-1]


In [None]:
fig1 = maplotlib.pyplot.figure('compare_NN_results', figsize=(16,6))
for ix1,model_name in enumerate(model_dict.keys()):
    ax1 = fig1.add_subplot(2,4,(4*ix1) + 1,title=f'results for {model_name}')
    ax1.plot(me_p[1:],yax, '-r', label='ME emu')
    ax1.set_xlabel('level')
    ax1.set_ylabel('flux / flux div. difference')
    ax1.legend(loc='upper right')
  
    ax1 = fig1.add_subplot(2,4,(4*ix1) + 2, title=f'results for {model_name}')
    ax1.plot(mae_p[1:],yax, '--r', label='MAE emu')
    ax1.set_xlabel('level')
    ax1.set_ylabel('flux / flux div. difference')
    ax1.legend(loc='upper right')
    
    ax1 = fig1.add_subplot(2,4, (4*ix1) + 3, title=f'results for {model_name}')
    plt.plot(me_ctl[1:],yax, '-c', label='ME ga7')
    ax1.set_xlabel('level')
    ax1.set_ylabel('flux / flux div. difference')
    ax1.legend(loc='upper right')
    
    ax1 = fig1.add_subplot(2,4, (4*ix1) + 4, title=f'results for {model_name}')
    ax1.plot(mae_ctl[1:],yax, '--c', label='MAE ga7')
    ax1.set_xlabel('level')
    ax1.set_ylabel('flux / flux div. difference')
    ax1.legend(loc='upper right')

### Testing the output

Next we load in the test data and do inferenceto check the result

In [102]:
print('loading',fn_dat_lays_test)
with numpy.load(data_dir / fn_dat_lays_test) as npzfile:
    dat_lays_test = npzfile['dat_lays']

print('loading',fn_dat_surf_test)
with numpy.load(data_dir / fn_dat_surf_test) as npzfile:
    dat_surf_test = npzfile['dat_surf']

print('loading',fn_trg_test)
with numpy.load(data_dir / fn_trg_test)as npzfile:
    if target=='nflx':
        trg_test = npzfile['trg_levs']
    elif target=='ndiv':
         trg_test = npzfile['trg_lays']

print('loading',fn_trg_ref)
with numpy.load(data_dir / fn_trg_ref) as npzfile:
    if target=='nflx':
         trg_ref = npzfile['trg_levs_ref']
    elif target=='ndiv':
         trg_ref = npzfile['trg_lays_ref']

loading sw_260_dat_lays_50.0K_test.npz
loading sw_260_dat_surf_50.0K_test.npz
loading sw_260_trg_levs_50.0K_test.npz
loading sw_ga7_trg_levs_50.0K_test.npz


In [103]:
# scale test data
if scale_data: # normalize by range
  dat_lays_test = dat_lays_test[:,:,use_lays]
  for ic in range(nlay_feats):
    dat_lays_test[:,:,ic] = (dat_lays_test[:,:,ic] - scaler_lays[ic][0])/scaler_lays[ic][1]

  dat_surf_test = dat_surf_test[:,use_surf]
  for ic in range(nsurf_feats):
    dat_surf_test[:,ic] = (dat_surf_test[:,ic] - scaler_surf[ic][0])/scaler_surf[ic][1]



In [None]:
# evaluate

In [None]:
# train 1D CNN

In [None]:
# evaluate and compare

### Example - Recurrent Neural Network

In [5]:
try:
    falklands_data_dir = os.environ['OPMET_ROTORS_DATA_ROOT']
except KeyError:
    falklands_data_dir = '/project/informatics_lab/data_science_cop/ML_challenges/2021_opmet_challenge'
falklands_data_dir = pathlib.Path(falklands_data_dir) /  'Rotors'

In [19]:
falklands_data_fname = 'new_training.csv'
falklands_data_path = falklands_data_dir / falklands_data_fname
falklands_df = pandas.read_csv(falklands_data_path)

In [13]:
temp_feature_names = [f'air_temp_{i1}' for i1 in range(1,23)]
humidity_feature_names = [f'sh_{i1}' for i1 in range(1,23)]
wind_direction_feature_names = [f'winddir_{i1}' for i1 in range(1,23)]
wind_speed_feature_names = [f'windspd_{i1}' for i1 in range(1,23)]
target_feature_name = 'rotors_present'


In [20]:
falklands_df = falklands_df.rename({'Rotors 1 is true': target_feature_name},axis=1)
falklands_df.loc[falklands_df[falklands_df[target_feature_name].isna()].index, target_feature_name] = 0
falklands_df['DTG'] = pandas.to_datetime(falklands_df['DTG'])
falklands_df = falklands_df.drop_duplicates(subset=['DTG'])
falklands_df = falklands_df[~falklands_df['DTG'].isnull()]
falklands_df = falklands_df[(falklands_df['wind_speed_obs'] >= 0.0) &
                            (falklands_df['air_temp_obs'] >= 0.0) &
                            (falklands_df['wind_direction_obs'] >= 0.0) &
                            (falklands_df['dewpoint_obs'] >= 0.0) 
                           ]
falklands_df = falklands_df.drop_duplicates(subset='DTG')
falklands_df[target_feature_name]  = falklands_df[target_feature_name] .astype(bool)
falklands_df['time'] = pandas.to_datetime(falklands_df['DTG'])

In [10]:
def get_v_wind(wind_dir_name, wind_speed_name, row1):
    return math.cos(math.radians(row1[wind_dir_name])) * row1[wind_speed_name]

def get_u_wind(wind_dir_name, wind_speed_name, row1):
    return math.sin(math.radians(row1[wind_dir_name])) * row1[wind_speed_name]

In [23]:
%%time
u_feature_template = 'u_wind_{level_ix}'
v_feature_template = 'v_wind_{level_ix}'
u_wind_feature_names = []
v_wind_features_names = []
for wsn1, wdn1 in zip(wind_speed_feature_names, wind_direction_feature_names):
    level_ix = int( wsn1.split('_')[1])
    u_feature = u_feature_template.format(level_ix=level_ix)
    u_wind_feature_names += [u_feature]
    falklands_df[u_feature] = falklands_df.apply(functools.partial(get_u_wind, wdn1, wsn1), axis='columns')
    v_feature = v_feature_template.format(level_ix=level_ix)
    v_wind_features_names += [v_feature]
    falklands_df[v_feature] = falklands_df.apply(functools.partial(get_v_wind, wdn1, wsn1), axis='columns')

CPU times: user 16.2 s, sys: 2.07 s, total: 18.3 s
Wall time: 18.3 s


In [27]:
rotors_train_df = falklands_df[falklands_df['time'] < datetime.datetime(2020,1,1,0,0)]
rotors_test_df = falklands_df[falklands_df['time'] > datetime.datetime(2020,1,1,0,0)]

In [28]:
def preproc_input(data_subset, pp_dict):
    return numpy.concatenate([scaler1.transform(data_subset[[if1]]) for if1,scaler1 in pp_dict.items()],axis=1)

def preproc_target(data_subset, enc1):
     return enc1.transform(data_subset[[target_feature_name]])


In [32]:
input_feature_names = temp_feature_names + humidity_feature_names + u_wind_feature_names + v_wind_features_names
preproc_dict = {}
for if1 in input_feature_names:
    scaler1 = sklearn.preprocessing.StandardScaler()
    scaler1.fit(rotors_train_df[[if1]])
    preproc_dict[if1] = scaler1
    
target_encoder = sklearn.preprocessing.LabelEncoder()
target_encoder.fit(rotors_train_df[[target_feature_name]])

  return f(*args, **kwargs)


LabelEncoder()

In [34]:
X_train_rotors = preproc_input(rotors_train_df, preproc_dict)
y_train_rotors = preproc_target(rotors_train_df, target_encoder)

  return f(*args, **kwargs)


In [38]:
X_test_rotors = preproc_input(rotors_test_df, preproc_dict)
y_test_rotors = preproc_target(rotors_test_df, target_encoder)

  return f(*args, **kwargs)


In [60]:
initial_learning_rate=2e-5
drop_out_rate=0.2
n_epochs=50
batch_size=100

In [57]:
n_nodes = 300
n_layers = 4
inputs_shape=X_train_rotors.shape[1]

In [58]:
def build_ff_nn(n_layers, input_shape):
    model = tensorflow.keras.models.Sequential()
    model.add(tensorflow.keras.layers.Dropout(drop_out_rate, input_shape=(input_shape,)))
    for i in numpy.arange(0,n_layers):
        model.add(tensorflow.keras.layers.Dense(n_nodes, activation='relu', kernel_constraint=tensorflow.keras.constraints.max_norm(3)))
        model.add(tensorflow.keras.layers.Dropout(drop_out_rate))
    model.add(tensorflow.keras.layers.Dense(2, activation='softmax'))             # This is the output layer 
    return model

In [61]:
%%time
rotors_ff_model = build_ff_nn(n_layers=n_layers, input_shape=inputs_shape)
opt = tensorflow.optimizers.Adam(learning_rate=initial_learning_rate)  
rotors_ff_model.compile(optimizer=opt, loss='mse', metrics=[tensorflow.keras.metrics.RootMeanSquaredError()])
rotors_history=rotors_ff_model.fit(X_train_rotors, 
                                   y_train_rotors, 
                                   validation_data=(X_test_rotors, 
                                                    y_test_rotors), 
                                   epochs=n_epochs, 
                                   batch_size=batch_size, 
                                   shuffle=True,
                                   verbose=False,
                                  )


CPU times: user 1min 55s, sys: 7.7 s, total: 2min 2s
Wall time: 39.9 s


In [52]:
def build_LSTM_model(input_shape):
    model = tensorflow.keras.Sequential()

    model.add(tensorflow.keras.layers.Input(shape=(1,input_shape,)))
    # Add a LSTM layer with 128 internal units.
    model.add(tensorflow.keras.layers.LSTM(64,))

    # Add a Dense layer with 10 units.
    model.add(tensorflow.keras.layers.Dense(10))
    
    # add output layer
    model.add(tensorflow.keras.layers.Dense(2, activation='softmax'))             # This is the output layer 

    return model


In [62]:
%%time 
rotors_rnn_model = build_LSTM_model(input_shape=inputs_shape)
opt = tensorflow.optimizers.Adam(learning_rate=initial_learning_rate)  
rotors_rnn_model.compile(optimizer=opt, loss='mse', metrics=[tensorflow.keras.metrics.RootMeanSquaredError()])
rotors_history_rnn = rotors_ff_model.fit(X_train_rotors, 
                                         y_train_rotors, 
                                         validation_data=(X_test_rotors, 
                                                          y_test_rotors), 
                                         epochs=n_epochs, 
                                         batch_size=batch_size, 
                                         shuffle=True,         
                                         verbose=False,
                                  )

CPU times: user 1min 51s, sys: 7.87 s, total: 1min 59s
Wall time: 38.6 s


## Tutorial - Metrics


Description of metrics
* classification metrics
* regression metrics

## Excercise - Metrics
xx


## Examples of use
xx


## Next steps
Rotor Challenge
Radiation emulation




## Dataset Info
* XBT Data
* Radiation Emulation
* Rotors

## References


Decision trees
* https://www.mastersindatascience.org/learning/introduction-to-machine-learning-algorithms/decision-tree/#:~:text=A
* https://towardsdatascience.com/decision-trees-explained-3ec41632ceb6
* 

Neural Networks
* [Introduction to Neural Networks - Kaggle](https://www.kaggle.com/code/carlosaguayo/introduction-to-neural-networks/notebook)
* [Back propagation - wikipedia](https://en.wikipedia.org/wiki/Backpropagation)
*[Back propagation - brilliant wiki](https://brilliant.org/wiki/backpropagation/#:~:text=Backpropagation%2C%20short%20for%20%22backward%20propagation,to%20the%20neural%20network's%20weights)
* [Introduction to Deep Learning - Kaggle](https://www.kaggle.com/learn/intro-to-deep-learning)
* [Introduction to Neural Networks - IBM](https://www.ibm.com/cloud/learn/neural-networks)
* [Neural Networks - MIT](https://news.mit.edu/2017/explained-neural-networks-deep-learning-0414)

* RNN
* CNN
* GNN

Metrics
* [Regression and Classification metrics - scikit-learn](https://scikit-learn.org/stable/modules/model_evaluation.html)
* 

