### Model training

The filtered, calculated and encoded features can now be trained with appropriate models.

The following approaches are considered:
- Multi classification problem with clustered data
- Mutli regression problem with two outputs (longitude/latitude)

In terms of the data we have the following approaches:
- variable sequence length - can take all points in POLYLINE in consideration, zero-padd start of sequence if necessary
- fixed sequence length - take 5 points from beginning of POLYLINE and 5 points from end of polyline 


Algorithms:
- Long term short term NN (multi-class classification and regression)
    - able to handle variable sequence length, therefore the total trip POLYLINE can be used
- Random forest(only regression) 
    - can handle outliers well as dataset still contains outliers
    - runs efficiently on large data set

In [1]:
import tensorflow

2023-01-14 12:36:02.517741: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-14 12:36:04.801829: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [2]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import LSTM, Softmax
from tensorflow.keras.optimizers import Adam

In [3]:
import pandas as pd 
import numpy as np 
import json
import preprocessing
import dask.array as da

In [4]:
n_cluster=4000

In [5]:
train_data = pd.read_parquet(f's3://think-tank-casestudy/features_engineered/n_cluster_{n_cluster}/feature_engineered_train.parquet')
test_data = pd.read_parquet(f's3://think-tank-casestudy/features_engineered/n_cluster_{n_cluster}/feature_engineered_test.parquet')

###  1) LSTM

In [None]:
#fill nas with arbitrary large number to mask later
df_sequence_train = da.from_array(pd.DataFrame(train_data.SEQUENCE.tolist()).fillna(2000).to_numpy())
df_sequence_test = da.from_array(pd.DataFrame(test_data.SEQUENCE.tolist()).fillna(2000).to_numpy())

In [None]:
features_config_2 = ['N_COORDINATE_POINTS','TOTAL_DISTANCE_KM','2013_10',
       '2013_11', '2013_12', '2013_7', '2013_8', '2013_9', '2014_1', '2014_2',
       '2014_3', '2014_4', '2014_5', '2014_6', '2014_7', '10.0', '12.0',
       '13.0', '14.0', '15.0', '18.0', '20.0', '21.0', '23.0', '25.0', '26.0',
       '27.0', '28.0', '33.0', '34.0', '35.0', '36.0', '38.0', '40.0', '42.0',
       '52.0', '53.0', '54.0', '56.0', '57.0', '58.0', '6.0', '60.0', '61.0',
       '63.0', '7.0', '9.0', 'OTHER', 'Cloudy', 'Foggy', 'Rainy', 'Sunny',
       'Windy', 'A', 'B', 'C', '16.0', '2014_10', '2014_11', '2014_12',
       '2014_8', '2014_9', '47.0', '49.0']
label_config_2 = ['CLUSTER_LABEL']

In [None]:
X_train = train_data[features_config_2].to_numpy().astype(float)
X_test = test_data[features_config_2].to_numpy().astype(float)

In [None]:
X_train = np.concatenate((df_sequence_train, X_train), axis=1)
X_test = np.concatenate((df_sequence_test, X_test), axis=1)

In [None]:
y_train = train_data[label_config_2]
y_test =  test_data[label_config_2

In [7]:
model = Sequential()
model.add(Input(input_shape=(X_train.shape[0], X_train.shape[1])))
model.add(tf.keras.layers.Masking(mask_value=2000,
                                  input_shape=(0, features)))
model.add(LSTM(50, activation='relu'))
model.add(Softmax(4000), activation='linear')
print(model.summary())

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=[])

X_pred_train = model.fit(X_train)

2023-01-14 12:39:07.928027: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-01-14 12:39:07.928142: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (ip-172-16-27-112.eu-west-1.compute.internal): /proc/driver/nvidia/version does not exist
2023-01-14 12:39:07.932059: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


NameError: name 'Input' is not defined

### 2) Random Forest - Multiclass

In [None]:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [None]:
train_data = preprocessing.create_fix_length_sequences(train_data, 10)
test_data = preprocessing.create_fix_length_sequences(test_data, 10)

In [None]:
features_config_1 = ['N_COORDINATE_POINTS','TOTAL_DISTANCE_KM', 'START_POINT_LON','START_POINT_LAT','2013_10',
       '2013_11', '2013_12', '2013_7', '2013_8', '2013_9', '2014_1', '2014_2',
       '2014_3', '2014_4', '2014_5', '2014_6', '2014_7', '10.0', '12.0',
       '13.0', '14.0', '15.0', '18.0', '20.0', '21.0', '23.0', '25.0', '26.0',
       '27.0', '28.0', '33.0', '34.0', '35.0', '36.0', '38.0', '40.0', '42.0',
       '52.0', '53.0', '54.0', '56.0', '57.0', '58.0', '6.0', '60.0', '61.0',
       '63.0', '7.0', '9.0', 'OTHER', 'Cloudy', 'Foggy', 'Rainy', 'Sunny',
       'Windy', 'A', 'B', 'C', '16.0', '2014_10', '2014_11', '2014_12',
       '2014_8', '2014_9', '47.0', '49.0','START_SEQUENCE_STACK','STOP_SEQUENCE_STACK']
label_config_1 = ['cluster_label']

In [None]:
estimator = RandomForestClassifier(n_jobs=-1, random_state=0, n_estimators=200, min_samples_split=0.001)

In [None]:
X = train_data[features_config_1]
y = train_data[label_config_1]'

In [None]:
from sklearn.ensemble import RandomForestRegressor, RandomForestCl
from sklearn.multioutput import MultiOutputRegressor
rf = RandomForestRegressor(random_state=0)
wrapper = MultiOutputRegressor(rf)
cv = RepeatedKFold(n_splits=10,random_state=1)
n_scores = cross_val_score(wrapper, X, y, scoring='mean_squared_error', cv=cv)