# Task 3. Time Series

### Student: Sandra M Nino A

For this task, I conducted different approaches for each model. One major difficulty of this task is that each model took long time to run, so it was very hard to test and get to the best possible solution. 

Many of the experiments were based on predicting the 100 users, times or cluster names with a multioutput approach. This means, having the last dense layer of the network with 100 neurons. However, the sequence length had to be, at least, 100 to get some decent results. But, this takes a lot of time to run even with a simple model of 1 or 2 LSTM layers. Therefore, I tried decreasing the sequence length to predict the 100 "at once" but the R2 metric was negative. 

To simplify the task, the solution presented here is a one step forecast with univariate approach. 

### Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, r2_score, accuracy_score, precision_score, recall_score, f1_score
from keras.models import Sequential
from keras.layers import LSTM, Dense
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.optimizers import Adam
import os
from keras.layers import Dropout

In [2]:
seed = 123
np.random.seed(seed)
tf.compat.v1.set_random_seed(seed)

Update the path to where the file is located to load the data. The file is called `UserLog.csv`

In [3]:
column_names = ['Date_Time', 'Event_Type', 'Cluster_Name', 'Duration', 'Number_Users']
data = pd.read_csv('./UserLog.csv',names=column_names)

## Helper functions

This function creates the input sequences required for a time series task and its labels. 

In [4]:
def create_sequences(data, window_size, future_target):
  data_np = data.to_numpy()
  X = []
  y = []
  for i in range(len(data_np)-window_size-future_target+1):
    row = [[a] for a in data_np[i:i+window_size]]
    X.append(row)
    label = data_np[i+window_size]
    y.append(label)
  return np.array(X), np.array(y)

This function creates the model for regression tasks. These are for number of users and date time predictions. To simplify the task, it has 1 LSTM layer, 1 Dropout layer to avoid overfitting, and 1 output dense layer for the step prediction, in our case, one step forecast.  

In [79]:
def build_model_regression(input_shape, output_shape):
    model = Sequential([
        LSTM(64, input_shape=input_shape, activation='relu', return_sequences=True),
        Dropout(0.5),
        Dense(output_shape)
    ])

    model.compile(loss='mse', optimizer='adam', metrics=['mae'])
    
    return model

This function creates the model for classification tasks. This is for the cluster names. The model consists in two LSTM layers, 3 Dropout layers, 2 dense layers, and one final dense layer to make the multi-class classification.

In [112]:
def build_model_classification(window_size, num_features, num_classes):
    model = keras.Sequential()

    model.add(LSTM(64, input_shape=(window_size, num_features), return_sequences=True))
    model.add(LSTM(32, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(64, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(num_classes, activation='softmax'))

    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

    return model

The following functions are used to calculate the metrics for the predictions and evaluate the model performance. 

In [71]:
from sklearn.metrics import mean_squared_error

def calculate_metrics_regression(y_true, y_hat):
    mae = mean_absolute_error(y_true, y_hat)
    mse = mean_squared_error(y_true, y_hat)
    r2 = r2_score(y_true, y_hat)
    
    return mae, mse, r2

In [133]:
def evaluate_model_regression(model, X_test, y_test, scaler):
    score = model.evaluate(X_test, y_test, verbose=1)
    print('Summary: Loss over the test dataset: %.2f, MAE: %.2f' % (score[0], score[1]))

    y_true = scaler.inverse_transform(y_test.reshape(-1, 1))
    y_hat = scaler.inverse_transform(model.predict(X_test).reshape(-1, 1))

    mae, mse, r2 = calculate_metrics_regression(y_true, y_hat)

    print('MAE:', mae)
    print('MSE:', mse)
    print('R2:', r2)

    return mae, mse, r2

def evaluate_model_classification(model, X_test, y_test, le):

    score = model.evaluate(X_test, y_test, verbose=1)
    print('Summary: Loss over the test dataset: %.2f, Accuracy: %.2f' % (score[0], score[1]))

    y_true = le.inverse_transform(y_test.reshape(-1, 1))
    y_hat = model.predict(X_test)
    y_hat = np.argmax(y_hat, axis=1)
    y_hat = le.inverse_transform(y_hat.reshape(-1, 1))

    accuracy = accuracy_score(y_true, y_hat)
    precision = precision_score(y_true, y_hat, average='macro')
    recall = recall_score(y_true, y_hat, average='macro')
    f1 = f1_score(y_true, y_hat, average='macro')

    print('Accuracy:', accuracy)
    print('Precision:', precision)
    print('Recall:', recall)
    print('F1-Score:', f1)

    return accuracy, precision, recall, f1

## Data Preprocessing

### For number of users

This is a regression task, therefore, we scale the numbers to be in the range of [0,1]

In [8]:
scaler = MinMaxScaler()

dataset = data['Number_Users'].to_numpy()
dataset = dataset.astype('float32')

dataset = scaler.fit_transform(dataset.reshape(-1,1))

data['Number_Users'] = dataset.flatten()

### For cluster names

This is a classification task, therefore, we encode the cluster names in numbers as the neural network need it. 

In [10]:
le = LabelEncoder()
data['Cluster_Name'] = le.fit_transform(data['Cluster_Name'])

### For login/logout times

For this task, we convert the date time to a Unix Timestamp, which is a number. Therefore, this becomes a regression task. At the end, we scale the numbers to be in the range of [0,1]

In [11]:
import pytz
from datetime import datetime

bst_timezone = pytz.timezone('Europe/London')
gmt_timezone = pytz.timezone('Etc/GMT')

def convert_to_gmt(date_str):
    date_obj = datetime.strptime(date_str, '%a %b %d %H:%M:%S %Z %Y')
    
    if date_obj.tzinfo == bst_timezone:
        gmt_date_obj = date_obj.astimezone(gmt_timezone)
        gmt_date_str = gmt_date_obj.strftime('%Y-%m-%d %H:%M:%S')
        return gmt_date_str
    else:
        return date_obj.strftime('%Y-%m-%d %H:%M:%S')

data['Date_Format'] = data['Date_Time'].apply(convert_to_gmt)

In [12]:
data.index = data['Date_Format']

In [13]:
data.head()

Unnamed: 0_level_0,Date_Time,Event_Type,Cluster_Name,Duration,Number_Users,Date_Format
Date_Format,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2010-01-01 00:00:00,Fri Jan 01 00:00:00 GMT 2010,LOGIN,9,1261840,0.000977,2010-01-01 00:00:00
2010-01-01 00:00:00,Fri Jan 01 00:00:00 GMT 2010,LOGIN,15,10058927,0.001953,2010-01-01 00:00:00
2010-01-01 00:00:00,Fri Jan 01 00:00:00 GMT 2010,LOGIN,28,6868990,0.00293,2010-01-01 00:00:00
2010-01-01 00:00:00,Fri Jan 01 00:00:00 GMT 2010,LOGIN,15,2997017,0.003906,2010-01-01 00:00:00
2010-01-01 00:00:00,Fri Jan 01 00:00:00 GMT 2010,LOGIN,15,8919800,0.004883,2010-01-01 00:00:00


In [14]:
import calendar, time; 

data['Date_Unix'] = data['Date_Format'].apply(lambda date_str : calendar.timegm(time.strptime(date_str, '%Y-%m-%d %H:%M:%S')))

In [15]:
data.head()

Unnamed: 0_level_0,Date_Time,Event_Type,Cluster_Name,Duration,Number_Users,Date_Format,Date_Unix
Date_Format,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2010-01-01 00:00:00,Fri Jan 01 00:00:00 GMT 2010,LOGIN,9,1261840,0.000977,2010-01-01 00:00:00,1262304000
2010-01-01 00:00:00,Fri Jan 01 00:00:00 GMT 2010,LOGIN,15,10058927,0.001953,2010-01-01 00:00:00,1262304000
2010-01-01 00:00:00,Fri Jan 01 00:00:00 GMT 2010,LOGIN,28,6868990,0.00293,2010-01-01 00:00:00,1262304000
2010-01-01 00:00:00,Fri Jan 01 00:00:00 GMT 2010,LOGIN,15,2997017,0.003906,2010-01-01 00:00:00,1262304000
2010-01-01 00:00:00,Fri Jan 01 00:00:00 GMT 2010,LOGIN,15,8919800,0.004883,2010-01-01 00:00:00,1262304000


In [16]:
scaler_date = MinMaxScaler()
df = data['Date_Unix'].to_numpy().astype('float32')
df = scaler_date.fit_transform(df.reshape(-1,1))

data.loc[:, 'Date_Unix'] = df.flatten()

## 3.1 Predict Number of Users

For this task, we select input sequences of 50 and one future target. As mentioned at the beginning, I tried window sizes of 5, 10, 20, 50, 100, 120 to predict 100 number of users at once, but I didn't get good results and the training time takes long time which was impossible to manage due to the rest of the tasks. Therefore, I opted for this simpler solution.

In [18]:
window_size = 50
future_target = 1

In [19]:
X, y = create_sequences(data['Number_Users'], window_size, future_target)

We make our training, validation and test split like 70/10/20. 

In [98]:
train_size = int(len(X) * 0.7)
val_size = int(len(X) * 0.1)

X_train, y_train = X[:train_size], y[:train_size]
X_val, y_val = X[train_size:train_size+val_size], y[train_size:train_size+val_size]
X_test, y_test = X[train_size+val_size:], y[train_size+val_size:]

print('Shape of X_train:', X_train.shape)
print('Shape of y_train:', y_train.shape)
print('Shape of X_test:', X_test.shape)
print('Shape of y_test:', y_test.shape)
print('Shape of X_val:', X_val.shape)
print('Shape of y_val:', y_val.shape)

Shape of X_train: (1721711, 50, 1)
Shape of y_train: (1721711,)
Shape of X_test: (491919, 50, 1)
Shape of y_test: (491919,)
Shape of X_val: (245958, 50, 1)
Shape of y_val: (245958,)


We build our regression model and use early stopping.

In [80]:
model_users = build_model_regression((window_size, 1), future_target)

In [81]:
early_stopping_users = EarlyStopping(monitor='val_loss', patience=5, verbose=1, restore_best_weights=True)

We fit our model with 20 epochs. However, it reaches until the 8th epoch because it overfits very quickly. 

In [27]:
num_epochs = 20
history_users = model_users.fit(X_train, y_train,
                    validation_data=(X_val, y_val),
                    epochs=num_epochs,
                    batch_size=32,
                    callbacks=[early_stopping_users])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 8: early stopping


In [33]:
model_users.save('./checkpoints/timeseries/ts-users-model-1.keras')

After saving our model, we can run the following cell to load our final model. Update the path where the file is located. The file is called `ts-users-model-1.keras`

In [96]:
final_model = keras.models.load_model('./checkpoints/timeseries/ts-users-model-1.keras')

Now we can make some predictions. We see that the R2 is near 1. However, the values for MAE and MSE are quite high indicating low performance in the model.

In [99]:
_ = evaluate_model_regression(final_model, X_test, y_test, scaler)

Summary: Loss over the test dataset: 0.00, MAE: 0.01
MAE: 7.400751
MSE: 72.11
R2: 0.9990146468169545


## 3.2 Predict Cluster Names

For this task, we select input sequences of 50 and one future target.

In [123]:
X, y = create_sequences(data['Cluster_Name'], 50, 1)

We create our training, validation and test split like 70/10/20.

In [124]:
train_size = int(len(X) * 0.7)
val_size = int(len(X) * 0.1)

X_train, y_train = X[:train_size], y[:train_size]
X_val, y_val = X[train_size:train_size+val_size], y[train_size:train_size+val_size]
X_test, y_test = X[train_size+val_size:], y[train_size+val_size:]

print('Shape of X_train:', X_train.shape)
print('Shape of y_train:', y_train.shape)
print('Shape of X_test:', X_test.shape)
print('Shape of y_test:', y_test.shape)
print('Shape of X_val:', X_val.shape)
print('Shape of y_val:', y_val.shape)

Shape of X_train: (1721711, 50, 1)
Shape of y_train: (1721711,)
Shape of X_test: (491919, 50, 1)
Shape of y_test: (491919,)
Shape of X_val: (245958, 50, 1)
Shape of y_val: (245958,)


We build our regression model and use early stopping.

In [125]:
num_clusters = len(pd.unique(data['Cluster_Name']))
model_clusters = build_model_classification(50, 1, num_clusters)

In [126]:
early_stopping_clusters = EarlyStopping(monitor='val_loss', patience=3, verbose=1, restore_best_weights=True)

For this task, different experiments did not get an accuracy greater than 0.13. Therefore, I reduced the number of epochs to 8 to show the training for this model. 

In [127]:
num_epochs = 8
history_clusters = model_clusters.fit(X_train, y_train,
                    validation_data=(X_val, y_val),
                    epochs=num_epochs,
                    batch_size=32,
                    callbacks=[early_stopping_clusters])

Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 6: early stopping


In [128]:
model_clusters.save('./checkpoints/timeseries/ts-clusters-model-1.keras')

After saving our model, we can run the following cell to load our final model. Update the path where the file is located. The file is called `ts-clusters-model-1.keras`

In [129]:
final_model_clusters = keras.models.load_model('./checkpoints/timeseries/ts-clusters-model-1.keras')

We can now evaluate our model for the classification task. We can see that the accuracy is 12%. 

In [134]:
_ = evaluate_model_classification(final_model_clusters, X_test, y_test, le) 

Summary: Loss over the test dataset: 3.29, Accuracy: 0.12
   31/15373 [..............................] - ETA: 52s

  y = column_or_1d(y, warn=True)




  y = column_or_1d(y, warn=True)
  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.11634639036101473
Precision: 0.02542384284990699
Recall: 0.03535080505250191
F1-Score: 0.013102093308347336


## 3.3 Predict Login/Logout times

For this task, we select input sequences of 50 and one future target. 

In [115]:
X, y = create_sequences(data['Date_Unix'], 50, 1)

We create our training, validation and test sets like 70/10/20.

In [116]:
train_size = int(len(X) * 0.7)
val_size = int(len(X) * 0.1)

X_train, y_train = X[:train_size], y[:train_size]
X_val, y_val = X[train_size:train_size+val_size], y[train_size:train_size+val_size]
X_test, y_test = X[train_size+val_size:], y[train_size+val_size:]

print('Shape of X_train:', X_train.shape)
print('Shape of y_train:', y_train.shape)
print('Shape of X_test:', X_test.shape)
print('Shape of y_test:', y_test.shape)
print('Shape of X_val:', X_val.shape)
print('Shape of y_val:', y_val.shape)

Shape of X_train: (1721711, 50, 1)
Shape of y_train: (1721711,)
Shape of X_test: (491919, 50, 1)
Shape of y_test: (491919,)
Shape of X_val: (245958, 50, 1)
Shape of y_val: (245958,)


We build our regression model and use early stopping.

In [117]:
model_time = build_model_regression((window_size, 1), future_target)

In [118]:
early_stopping_time = EarlyStopping(monitor='val_loss', patience=5, verbose=1, restore_best_weights=True)

We fit our model with 10 epochs.

In [119]:
num_epochs = 10
history_times = model_time.fit(X_train, y_train,
                    validation_data=(X_val, y_val),
                    epochs=num_epochs,
                    batch_size=32,
                    callbacks=[early_stopping_time])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [120]:
model_time.save('./checkpoints/timeseries/ts-login-logout-time-model-1.keras')

After saving our model, we can run the following cell to load our final model. Update the path where the file is located. The file is called `ts-login-logout-time-model-1.keras`

In [121]:
final_model_time = keras.models.load_model('./checkpoints/timeseries/ts-login-logout-time-model-1.keras')

Now we can make some predictions. The very high MAE, MSE, and negative R2 indicate that the model is performing poorly and is not able to effectively predict the timestamp for login/logout.

In [122]:
_ = evaluate_model_regression(final_model_time, X_test, y_test, scaler_date)

Summary: Loss over the test dataset: 0.01, MAE: 0.10
MAE: 3101496.2
MSE: 10483952000000.0
R2: -8.347839716688487
