### Model training

The filtered, calculated and encoded features can now be trained with appropriate models.

The following approaches are considered:
- Multi classification problem with clustered data
- Mutli regression problem with two outputs (longitude/latitude)

In terms of the data we have the following approaches:
- variable sequence length - can take all points in POLYLINE in consideration, mask sequence if necessary
- fixed sequence length - take 10 points from beginning of POLYLINE and 10 points from end of polyline 


Algorithms:
- Long term short term NN (multi-class classification and regression)
    - able to handle variable sequence length, therefore the total trip POLYLINE can be used
- Random forest(regression and classification) 
    - can handle outliers well as dataset still contains outliers
    - runs efficiently on large data set
    
Metrics:
- Classification of clusters: AUC + Avg distance of last point to cluster center
- Regression: MAPE + Avg distance of last point to cluster center


In [1]:
import tensorflow

2023-01-21 14:59:09.830834: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [25]:
from keras import Sequential
from keras.layers import LSTM, Softmax, Dense, Dropout, Flatten, Embedding, Input,Concatenate
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping

In [3]:
import pandas as pd 
import numpy as np 
import json
import preprocessing
import dask.array as da

In [4]:
#!pip install fastparquet

In [4]:
n_cluster=4000

In [5]:
train_data = pd.read_parquet(f's3://think-tank-casestudy/features_engineered/n_cluster_{n_cluster}/feature_engineered_train.parquet')
test_data = pd.read_parquet(f's3://think-tank-casestudy/features_engineered/n_cluster_{n_cluster}/feature_engineered_test.parquet')

In [6]:
train_data = preprocessing.create_fix_length_sequences(train_data, 10)
test_data = preprocessing.create_fix_length_sequences(test_data, 10)

### 1a) Random Forest - Multiclass

In [7]:
from sklearn.ensemble import RandomForestClassifier

In [9]:
features_config_1 = ['N_COORDINATE_POINTS','TOTAL_DISTANCE_KM', '2013_10',
       '2013_11', '2013_12', '2013_7', '2013_8', '2013_9', '2014_1', '2014_2',
       '2014_3', '2014_4', '2014_5', '2014_6', '2014_7', '10.0', '12.0',
       '13.0', '14.0', '15.0', '18.0', '20.0', '21.0', '23.0', '25.0', '26.0',
       '27.0', '28.0', '33.0', '34.0', '35.0', '36.0', '38.0', '40.0', '42.0',
       '52.0', '53.0', '54.0', '56.0', '57.0', '58.0', '6.0', '60.0', '61.0',
       '63.0', '7.0', '9.0', 'OTHER', 'Cloudy', 'Foggy', 'Rainy', 'Sunny',
       'Windy', 'A', 'B', 'C', '16.0', '2014_10', '2014_11', '2014_12',
       '2014_8', '2014_9', '47.0', '49.0']
label_config_1 = ['CLUSTER_LABEL']

In [10]:
estimator = RandomForestClassifier(n_jobs=-1, random_state=0, n_estimators=200, min_samples_split=0.001)

In [11]:
start_sequence_train = pd.DataFrame(train_data.START_SEQUENCE.tolist()).to_numpy()
stop_sequence_train = pd.DataFrame(train_data.STOP_SEQUENCE.tolist()).to_numpy()

start_sequence_test = pd.DataFrame(test_data.START_SEQUENCE.tolist()).to_numpy()
stop_sequence_test = pd.DataFrame(test_data.STOP_SEQUENCE.tolist()).to_numpy()

In [12]:
X_train = train_data[features_config_1].to_numpy()
y_train = train_data[label_config_1]

In [13]:
X_test = test_data[features_config_1].to_numpy()
y_test = test_data[label_config_1]

In [14]:
X_train = np.concatenate((X_train, start_sequence_train,stop_sequence_train), axis=1).astype(float)
X_test = np.concatenate((X_test, start_sequence_test,stop_sequence_test), axis=1).astype(float)

In [None]:
X_train_pred = estimator.fit(X_train,y_train)

  X_train_pred = estimator.fit(X_train,y_train)


###  2) LSTM

In [7]:
df_sequence_train_start = pd.DataFrame(train_data.START_SEQUENCE.tolist()).fillna(2000)
df_sequence_train_stop = pd.DataFrame(train_data.STOP_SEQUENCE.tolist()).fillna(2000)

df_sequence_test_start = pd.DataFrame(test_data.START_SEQUENCE.tolist()).fillna(2000)
df_sequence_test_stop = pd.DataFrame(test_data.STOP_SEQUENCE.tolist()).fillna(2000)

In [19]:
df_sequence_train_start.shape

(1371135, 20)

In [8]:
#fill nas with arbitrary large number to mask later
#df_sequence_train = pd.DataFrame(train_data.SEQUENCE.tolist()).fillna(2000).to_numpy()
#df_sequence_test = pd.DataFrame(test_data.SEQUENCE.tolist()).fillna(2000).to_numpy()

In [9]:
features_config_2 = ['N_COORDINATE_POINTS','TOTAL_DISTANCE_KM','2013_10',
       '2013_11', '2013_12', '2013_7', '2013_8', '2013_9', '2014_1', '2014_2',
       '2014_3', '2014_4', '2014_5', '2014_6', '2014_7', '10.0', '12.0',
       '13.0', '14.0', '15.0', '18.0', '20.0', '21.0', '23.0', '25.0', '26.0',
       '27.0', '28.0', '33.0', '34.0', '35.0', '36.0', '38.0', '40.0', '42.0',
       '52.0', '53.0', '54.0', '56.0', '57.0', '58.0', '6.0', '60.0', '61.0',
       '63.0', '7.0', '9.0', 'OTHER', 'Cloudy', 'Foggy', 'Rainy', 'Sunny',
       'Windy', 'A', 'B', 'C', '16.0', '2014_10', '2014_11', '2014_12',
       '2014_8', '2014_9', '47.0', '49.0']
label_config_2 = ['CLUSTER_LABEL']

In [17]:
X_train = train_data[features_config_2].astype(float)
X_test = test_data[features_config_2].astype(float)

In [20]:
X_train = pd.concat([df_sequence_train_start,df_sequence_train_stop, X_train], axis=1)
X_test = pd.concat([df_sequence_test_start,df_sequence_test_stop, X_test], axis=1)

In [21]:
y_train = train_data[label_config_2]
y_test =  test_data[label_config_2]

In [33]:
model = Sequential()
model.add(Input(shape=(X_train.shape[1])))
#model.add(tensorflow.keras.layers.Masking(mask_value=2000))
model.add(LSTM(200, activation='relu'))
model.add(Dense(4000, activation='softmax'))
print(model.summary())

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['AUC'])
          
#X_pred_train = model.fit(X_train)

ValueError: Input 0 of layer "lstm_4" is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: (None, 104)