### Model training

The filtered, calculated and encoded features can now be trained with appropriate models.

The following approaches are considered:
- Multi classification problem with clustered data
- Mutli regression problem with two outputs (longitude/latitude)

In terms of the data we have the following approaches:
- variable sequence length - can take all points in POLYLINE in consideration, mask sequence if necessary
- fixed sequence length - take 10 points from beginning of POLYLINE and 10 points from end of polyline 


Algorithms:
- Long term short term NN (multi-class classification and regression)
    - able to handle variable sequence length, therefore the total trip POLYLINE can be used
- Random forest(regression and classification) 
    - can handle outliers well as dataset still contains outliers
    - runs efficiently on large data set
    
Metrics:
- Classification of clusters: AUC + Avg distance of last point to cluster center
- Regression: MAPE + Avg distance of last point to cluster center


In [1]:
import tensorflow

2023-01-17 19:28:44.108636: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-17 19:29:02.388869: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [2]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import LSTM, Softmax
from tensorflow.keras.optimizers import Adam

In [3]:
import pandas as pd 
import numpy as np 
import json
import preprocessing
import dask.array as da

Matplotlib is building the font cache; this may take a moment.


In [4]:
!pip install fastparquet

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com
Collecting fastparquet
  Downloading fastparquet-2022.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.5/1.5 MB[0m [31m23.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m
Collecting cramjam>=2.3
  Downloading cramjam-2.6.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m39.2 MB/s[0m eta [36m0:00:00[0m:00:01[0m
Collecting pandas>=1.5.0
  Downloading pandas-1.5.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m102.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: cramjam, pandas, fastparquet
  Attempting uninstall: pandas
    Found existing installation: pandas 1.4.4
    Uninstalli

In [5]:
n_cluster=4000

In [6]:
train_data = pd.read_parquet(f's3://think-tank-casestudy/features_engineered/n_cluster_{n_cluster}/feature_engineered_train.parquet')
test_data = pd.read_parquet(f's3://think-tank-casestudy/features_engineered/n_cluster_{n_cluster}/feature_engineered_test.parquet')

### 1a) Random Forest - Multiclass

In [7]:
from sklearn.ensemble import RandomForestClassifier

In [8]:
train_data = preprocessing.create_fix_length_sequences(train_data, 10)
test_data = preprocessing.create_fix_length_sequences(test_data, 10)

In [9]:
features_config_1 = ['N_COORDINATE_POINTS','TOTAL_DISTANCE_KM', '2013_10',
       '2013_11', '2013_12', '2013_7', '2013_8', '2013_9', '2014_1', '2014_2',
       '2014_3', '2014_4', '2014_5', '2014_6', '2014_7', '10.0', '12.0',
       '13.0', '14.0', '15.0', '18.0', '20.0', '21.0', '23.0', '25.0', '26.0',
       '27.0', '28.0', '33.0', '34.0', '35.0', '36.0', '38.0', '40.0', '42.0',
       '52.0', '53.0', '54.0', '56.0', '57.0', '58.0', '6.0', '60.0', '61.0',
       '63.0', '7.0', '9.0', 'OTHER', 'Cloudy', 'Foggy', 'Rainy', 'Sunny',
       'Windy', 'A', 'B', 'C', '16.0', '2014_10', '2014_11', '2014_12',
       '2014_8', '2014_9', '47.0', '49.0']
label_config_1 = ['CLUSTER_LABEL']

In [10]:
estimator = RandomForestClassifier(n_jobs=-1, random_state=0, n_estimators=200, min_samples_split=0.001)

In [11]:
start_sequence_train = pd.DataFrame(train_data.START_SEQUENCE.tolist()).to_numpy()
stop_sequence_train = pd.DataFrame(train_data.STOP_SEQUENCE.tolist()).to_numpy()

start_sequence_test = pd.DataFrame(test_data.START_SEQUENCE.tolist()).to_numpy()
stop_sequence_test = pd.DataFrame(test_data.STOP_SEQUENCE.tolist()).to_numpy()

In [12]:
X_train = train_data[features_config_1].to_numpy()
y_train = train_data[label_config_1]

In [13]:
X_test = test_data[features_config_1].to_numpy()
y_test = test_data[label_config_1]

In [14]:
X_train = np.concatenate((X_train, start_sequence_train,stop_sequence_train), axis=1).astype(float)
X_test = np.concatenate((X_test, start_sequence_test,stop_sequence_test), axis=1).astype(float)

In [None]:
X_train_pred = estimator.fit(X_train,y_train)

  X_train_pred = estimator.fit(X_train,y_train)


###  2) LSTM

In [None]:
#fill nas with arbitrary large number to mask later
df_sequence_train = da.from_array(pd.DataFrame(train_data.SEQUENCE.tolist()).fillna(2000).to_numpy())
df_sequence_test = da.from_array(pd.DataFrame(test_data.SEQUENCE.tolist()).fillna(2000).to_numpy())

In [None]:
features_config_2 = ['N_COORDINATE_POINTS','TOTAL_DISTANCE_KM','2013_10',
       '2013_11', '2013_12', '2013_7', '2013_8', '2013_9', '2014_1', '2014_2',
       '2014_3', '2014_4', '2014_5', '2014_6', '2014_7', '10.0', '12.0',
       '13.0', '14.0', '15.0', '18.0', '20.0', '21.0', '23.0', '25.0', '26.0',
       '27.0', '28.0', '33.0', '34.0', '35.0', '36.0', '38.0', '40.0', '42.0',
       '52.0', '53.0', '54.0', '56.0', '57.0', '58.0', '6.0', '60.0', '61.0',
       '63.0', '7.0', '9.0', 'OTHER', 'Cloudy', 'Foggy', 'Rainy', 'Sunny',
       'Windy', 'A', 'B', 'C', '16.0', '2014_10', '2014_11', '2014_12',
       '2014_8', '2014_9', '47.0', '49.0']
label_config_2 = ['CLUSTER_LABEL']

In [None]:
X_train = train_data[features_config_2].to_numpy().astype(float)
X_test = test_data[features_config_2].to_numpy().astype(float)

In [None]:
X_train = np.concatenate((df_sequence_train, X_train), axis=1)
X_test = np.concatenate((df_sequence_test, X_test), axis=1)

In [None]:
y_train = train_data[label_config_2]
y_test =  test_data[label_config_2]

In [None]:
model = Sequential()
model.add(Input(input_shape=(X_train.shape[0], X_train.shape[1])))
model.add(tf.keras.layers.Masking(mask_value=2000))
model.add(LSTM(32, activation='relu'))
model.add(Softmax(4000), activation='linear')
print(model.summary())

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['AUC'])

X_pred_train = model.fit(X_train)