# 2. Building a machine learning pipeline

## Overview 
Having explored the data and defined the problem, we are ready to build an initial pipeline, with choices informed by our data exploration.

### Prerequisites

* Same as notebook 1 in this tutorial series, plus successful completion of notebook 1

### Learning Outcomes 
* Understand the fundamental steps in a machine learning pipeline
* Understand key terminology in describing the machine learning pipeline
* Initial understanding of how to choose appropriate components for each stage in the pipeline.

### Best Practices & Values

Link to Data Theme
Link to ethics Theme
link to ML lifecycle theme

## Tutorial - Key Elements of a Machine Learning Pipeline

Using a series of data science and machine learning and algorithms to go from input data to a series of predictions is usually referred to as a pipeline.  In this noteboook we will be exploring the key components of such a pipeline in constructing and training a machine learning algorithm with some input data.

In this notebook we will look at 2 pipeline, one for a **supervised** **classification** problem, and the other for an **unsupervised** clustering problem.

The steps we willl go through are as follows:
* *Data Loading & Cleaning* - Start by loading the data, and filtering out any data considered to be unsuitable training and evaluation of machine learning algorithms. Selection of appropriate data is an important way in which domain expertise in vital in getting good results.
* *Feauture Engineering* - The first step is to prepare the data for presenting to the algorithm. Different ways of presenting the data will emphasise different features, and choosing the right features is important for getting good results. Knowledge of what features represent based on domain knowledge is again very important.
* *Train/test Split* - Before we train the algorithm, we need to split into *train* and *test* sets. This is to ensure out algorithm doesn't *overfit*, learning irrelevant details that are not representative of the whole space of possible data, but rather that in generalises well.
* *Data Preparation* - The machine learning algorithm only sees numbers as numbers, with no inherent understnading oif meaning or context. We need to ensure different features are scaled to be comparable, otherwise big numbers will be treated as more important by the algorithm, irrespective of what those numbers mean. Value are typically scaled to a range of `[0,1]` or, assuming a gaussian distribution, to have `mean=0` and `std_dev=1`.
* *Algorithm Setup* - Here we select the particular algorithm e,.g. neural network, k-means clustering, and specify the *hyperparameters*. It is important to distinguish between *parameters* and *hyperparameters*. 
  * Parameters are the values that calculated by the training process. 
  * Hyperparameters are values specified in algorithm setup, which are not altered by training. These need to be fine-tuned using an additional outer training loop called hyperparameter tuning.
* *Algorithm Training* - Execute the algorithm to calculate the best parameters for the chosen ML algorithm to fit the supplied training data
* *Inference* - Once we have an algorithm, we use it to produce predictions, for both the train and test sets.
* *Evaluation* - We then compare the predictions of the trained algorithms to expected results. For supervised learning, this will be supplied target values. For unsupervised learning, we will expplore the results and their usefulness much like in exploratory data analysis.
* *Model Storage* - Model training can be an expensive process that we don't want to perform too often, and. once we have a model that performs well we save its state so it can be reloaded and used subsequently for inference on later problems.


### Key Terms

* supervised learning
* unsupervised learning
* regression
* classification
* metric
* parameter
* hyperparameter
* feature engineering
* training set
* validation set
* test set
* inference


## Problem 1: Supervised Classification - Falkland Islands Rotor Prediction

In [21]:
import pathlib
import datetime
import os
import functools
import math

In [4]:
import matplotlib
%matplotlib inline

In [10]:
import pandas

In [6]:
import iris
import iris.quickplot
import iris.coord_categorisation
import cartopy


In [17]:
import sklearn
import sklearn.neural_network
import sklearn.preprocessing

In [5]:
try:
    falklands_data_dir = os.environ['OPMET_ROTORS_DATA_ROOT']
except KeyError:
    falklands_data_dir = '/project/informatics_lab/data_science_cop/'
falklands_data_dir = pathlib.Path(falklands_data_dir) /  'Rotors'
print(falklands_data_dir)

/Users/stephen.haddad/data/ml_challenges/Rotors


In [7]:
falklands_data_fname = 'new_training.csv'

In [8]:
falklands_data_path = falklands_data_dir / falklands_data_fname
falklands_data_path

PosixPath('/Users/stephen.haddad/data/ml_challenges/Rotors/new_training.csv')

In [27]:
falklands_df = pandas.read_csv(falklands_data_path, header=0).loc[1:,:]

In [28]:
falklands_df = falklands_df.drop_duplicates(subset='DTG')

In [29]:
falklands_df

Unnamed: 0,DTG,air_temp_obs,dewpoint_obs,wind_direction_obs,wind_speed_obs,wind_gust_obs,air_temp_1,air_temp_2,air_temp_3,air_temp_4,...,windspd_18,winddir_19,windspd_19,winddir_20,windspd_20,winddir_21,windspd_21,winddir_22,windspd_22,Rotors 1 is true
1,01/01/2015 00:00,283.9,280.7,110.0,4.1,-9999999.0,284.000,283.625,283.250,282.625,...,5.8,341.0,6.0,334.0,6.1,330.0,6.0,329.0,5.8,
2,01/01/2015 03:00,280.7,279.7,90.0,7.7,-9999999.0,281.500,281.250,280.750,280.250,...,6.8,344.0,5.3,348.0,3.8,360.0,3.2,12.0,3.5,
3,01/01/2015 06:00,279.8,278.1,100.0,7.7,-9999999.0,279.875,279.625,279.125,278.625,...,6.0,345.0,5.5,358.0,5.0,10.0,4.2,38.0,4.0,
4,01/01/2015 09:00,279.9,277.0,120.0,7.2,-9999999.0,279.625,279.250,278.875,278.250,...,3.1,338.0,3.5,354.0,3.9,9.0,4.4,22.0,4.6,
5,01/01/2015 12:00,279.9,277.4,120.0,8.7,-9999999.0,279.250,278.875,278.375,277.875,...,1.6,273.0,2.0,303.0,2.3,329.0,2.5,338.0,2.4,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20101,31/12/2020 06:00,276.7,275.5,270.0,3.6,-9999999.0,277.875,277.750,277.625,277.500,...,12.1,223.0,11.8,221.0,11.4,219.0,11.3,215.0,11.4,
20102,31/12/2020 09:00,277.9,276.9,270.0,3.1,-9999999.0,277.875,277.625,277.875,277.875,...,10.2,230.0,10.8,230.0,11.6,227.0,12.3,222.0,12.0,
20103,31/12/2020 12:00,283.5,277.1,220.0,3.6,-9999999.0,281.125,280.625,280.125,279.625,...,10.3,218.0,11.9,221.0,12.8,222.0,11.9,225.0,10.6,
20104,31/12/2020 15:00,286.1,276.9,250.0,3.6,-9999999.0,284.625,284.125,283.625,283.000,...,9.4,218.0,8.6,212.0,8.3,218.0,8.7,226.0,10.1,


In [45]:
falklands_df['time'] = pandas.to_datetime(falklands_df['DTG'])

## Feature Engineering

Having loaded the data, we then do some preprocessing. This includes:
* Specify feature names
* convert wind speed / direction back to u/v wind. This is because these parameters will vary more smoothly for northerly winds, which is the wind we are interested in.
* prepare the target variable, including filling in missing data.

In [14]:
temp_feature_names = [f'air_temp_{i1}' for i1 in range(1,23)]
humidity_feature_names = [f'sh_{i1}' for i1 in range(1,23)]
wind_direction_feature_names = [f'winddir_{i1}' for i1 in range(1,23)]
wind_speed_feature_names = [f'windspd_{i1}' for i1 in range(1,23)]
target_feature_name = 'rotors_present'

In [15]:
obs_names = [
    'air_temp_obs',
    'dewpoint_obs',
    'wind_speed_obs',
    'wind_direction_obs',
]

obs_feature_names = [
    'air_temp_obs',
    'dewpoint_obs',
]

In [18]:
def get_v_wind(wind_dir_name, wind_speed_name, row1):
    return math.cos(math.radians(row1[wind_dir_name])) * row1[wind_speed_name]

def get_u_wind(wind_dir_name, wind_speed_name, row1):
    return math.sin(math.radians(row1[wind_dir_name])) * row1[wind_speed_name]

In [31]:
%%time
u_feature_template = 'u_wind_{level_ix}'
v_feature_template = 'v_wind_{level_ix}'
u_wind_feature_names = []
v_wind_features_names = []
for wsn1, wdn1 in zip(wind_speed_feature_names, wind_direction_feature_names):
    level_ix = int( wsn1.split('_')[1])
    u_feature = u_feature_template.format(level_ix=level_ix)
    u_wind_feature_names += [u_feature]
    falklands_df[u_feature] = falklands_df.apply(functools.partial(get_u_wind, wdn1, wsn1), axis='columns')
    v_feature = v_feature_template.format(level_ix=level_ix)
    v_wind_features_names += [v_feature]
    falklands_df[v_feature] = falklands_df.apply(functools.partial(get_v_wind, wdn1, wsn1), axis='columns')

CPU times: user 14.7 s, sys: 1.99 s, total: 16.7 s
Wall time: 17 s


In [32]:
wdn1 = 'wind_direction_obs'
wsn1 = 'wind_speed_obs'
u_feature = u_feature_template.format(level_ix='obs')
obs_feature_names += [u_feature]
falklands_df[u_feature] = falklands_df.apply(functools.partial(get_u_wind, wdn1, wsn1), axis='columns')
v_feature = v_feature_template.format(level_ix='obs')
obs_feature_names += [v_feature]
falklands_df[v_feature] = falklands_df.apply(functools.partial(get_v_wind, wdn1, wsn1), axis='columns')

In [33]:
falklands_df[obs_feature_names]

Unnamed: 0,air_temp_obs,dewpoint_obs,u_wind_obs,u_wind_obs.1,v_wind_obs,u_wind_obs.2,v_wind_obs.1
1,283.9,280.7,3.852740,3.852740,-1.402283e+00,3.852740,-1.402283e+00
2,280.7,279.7,7.700000,7.700000,4.714890e-16,7.700000,4.714890e-16
3,279.8,278.1,7.583020,7.583020,-1.337091e+00,7.583020,-1.337091e+00
4,279.9,277.0,6.235383,6.235383,-3.600000e+00,6.235383,-3.600000e+00
5,279.9,277.4,7.534421,7.534421,-4.350000e+00,7.534421,-4.350000e+00
...,...,...,...,...,...,...,...
20101,276.7,275.5,-3.600000,-3.600000,-6.613093e-16,-3.600000,-6.613093e-16
20102,277.9,276.9,-3.100000,-3.100000,-5.694608e-16,-3.100000,-5.694608e-16
20103,283.5,277.1,-2.314035,-2.314035,-2.757760e+00,-2.314035,-2.757760e+00
20104,286.1,276.9,-3.382893,-3.382893,-1.231273e+00,-3.382893,-1.231273e+00


In [51]:
falklands_df[target_feature_name] =  falklands_df['Rotors 1 is true']
falklands_df.loc[falklands_df[falklands_df['Rotors 1 is true'].isna()].index, target_feature_name] = 0.0
falklands_df[target_feature_name]  = falklands_df[target_feature_name] .astype(bool)

In [52]:
falklands_df[target_feature_name].value_counts()

False    17058
True       449
Name: rotors_present, dtype: int64

In [39]:
falklands_df.columns

Index(['DTG', 'air_temp_obs', 'dewpoint_obs', 'wind_direction_obs',
       'wind_speed_obs', 'wind_gust_obs', 'air_temp_1', 'air_temp_2',
       'air_temp_3', 'air_temp_4',
       ...
       'v_wind_19', 'u_wind_20', 'v_wind_20', 'u_wind_21', 'v_wind_21',
       'u_wind_22', 'v_wind_22', 'u_wind_obs', 'v_wind_obs', 'rotors_present'],
      dtype='object', length=142)

In [44]:
falklands_df['time'] = pandas.to_datetime(falklands_df['DTG'])

1       2015-01-01 00:00:00
2       2015-01-01 03:00:00
3       2015-01-01 06:00:00
4       2015-01-01 09:00:00
5       2015-01-01 12:00:00
                ...        
20101   2020-12-31 06:00:00
20102   2020-12-31 09:00:00
20103   2020-12-31 12:00:00
20104   2020-12-31 15:00:00
20105   2021-01-01 00:00:00
Name: DTG, Length: 17507, dtype: datetime64[ns]

## Splitting data into train/validation/test sets

To consider 
* consistency of distributions
* class imbalance
* correlation between samples


An initial option might be to split randomly, using the built-in scikit learn functionality.

In [60]:
test_fraction = 0.2

We know that our 2 classes (rotor detected/ no rotor detected) are imbalanced, so we might want to select from each class, to ensure our train/test splits have distributions which reflect the larger distribution.

In [65]:
num_no_rotors = sum(falklands_df[target_feature_name] == False)
num_with_rotors = sum(falklands_df[target_feature_name] == True)

In [62]:
data_no_rotors = falklands_df[falklands_df[target_feature_name] == False]
data_with_rotors = falklands_df[falklands_df[target_feature_name] == True]

In [66]:
data_test = pandas.concat([data_no_rotors.sample(int(test_fraction * num_no_rotors)), data_with_rotors.sample(int(test_fraction * num_with_rotors))])
data_test[target_feature_name].value_counts()

False    3411
True       89
Name: rotors_present, dtype: int64

In [68]:
falklands_df['test_set'] = False
falklands_df.loc[data_test.index,'test_set'] = True

We also know though that data points from adjacent points in time are likely to be correlated. As a result dfata in our test set will be correlated with that in our train set if we split randomly. Instead for this problem, we should split by time.

In [56]:
train_df = falklands_df[falklands_df['time'] < datetime.datetime(2020,1,1,0,0)]
test_df = falklands_df[falklands_df['time'] > datetime.datetime(2020,1,1,0,0)]

In [58]:
train_df[target_feature_name].value_counts()

False    14273
True       320
Name: rotors_present, dtype: int64

In [59]:
test_df[target_feature_name].value_counts()

False    2784
True      129
Name: rotors_present, dtype: int64

## Setting up a classifier

In [None]:
# create a random forest (sklearn) and a neural network (tensorflow)

### Training a classifier

In [None]:
# train a random forest

In [None]:
# train neural network

### Evaluating classifier performance

Discuss classification metrics
* recall, precision 
* false alarm, miss rate
* SEDI
* F1


In [None]:
# calculate and plot for different metrics
# calculate and plot for train and test

### Storing and reusing the trained classifier

discuss what elements need to be stored.



In [None]:
# pickle a random forest object

In [None]:
# save a random forest object using ONNX

In [None]:
# save neural network using keras format

In [None]:
# load NN and do inference

In [None]:
# save neural netowrk

In [None]:
# load in and start training (transfer learning)

## Problem 2 - Clustering weather types from ERA5

### Loading the data

In [None]:
# load era5 pressure data

### Seperating into train/test sets

In [None]:
# 2019-2020 train, 2021 test

### Preprocessing features

* Load MLSP data (hourly, global)
* Extract UK area from global ERA5 data
* Calculate seasonal averages
* Calculating the seasonal pressure anomaly for each hour



In [None]:
# cut out area of interest

In [None]:
# calculate season means

In [None]:
# substract seasonal means

### Different ways of clustering
* flatten and k-means
* PCA and k-means
* Auto-encoder and k-means

#### Method 1 - Flatten and K-Means

#### Method 2 - PCA and K-Means

#### Method 3 - Autoencoder and Latent Space K-Means

Exploring

## Problem 3 - Radiation Emulation (Regression)

In [None]:
# load radition dataset


In [None]:
# prepare featureds

In [None]:
# split into train test

In [None]:
# set up 1D CNN architectureb

In [1]:
# do training

In [None]:
# calculate appropriate metrics

## Examples of use
* You can see more example notebook relating to this challenge on the [Data Science CoP GitHub repositor](https://github.com/MetOffice/data_science_cop/tree/master/challenges/2021_falklands_rotors).
 


## Next steps



## Dataset Info
xx


## References
xx
