The [LSTM model](https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/) learns a function that maps a sequence of past observations as input to an output observation. As such, the sequence of observations must be transformed into multiple examples from which the LSTM can learn.

The goal of this notebook is to convert the original event sequence dataset to one with n-item sliding window for the LSTM model training.

### Package Installation

In [1]:
import pandas as pd
import numpy as np
import math
from matplotlib import pyplot
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

import tensorflow as tf
from tensorflow import keras
from keras import layers

from keras.utils import to_categorical
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.layers import Dense, LSTM, Embedding
from keras.models import Sequential


2023-08-03 16:26:55.879317: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-03 16:26:56.576181: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-08-03 16:26:56.576260: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory


### Read Data
Get an idea of the number of unique event sequence in the original dataset.

In [None]:
#read dataset
seq = pd.read_csv('event_seqs.csv',header=None, delimiter=',', sep=',')
seq.head()


In [None]:
tval = seq.values

In [6]:
#number of unique categories
seq.nunique()

0    6345
dtype: int64

### Split Dataset into n-item sequences
n can be any integer that makes sense for the LSTM model. In this example, we make 5-item and 10-item sequences for illustration and comparison.

In [None]:
#function to apply n-item sliding window
def splitSequence(seq, n_steps):
    #Declare X and y as empty list
    X = []
    y = []
    
    for i in range(len(seq)):
        #get the last index
        result = [int(i) for i in seq[i][0].split(',')]  
        for j in range(len(result)):
        #if lastIndex is greater than length of sequence then break
            lastIndex = j + n_steps
            if lastIndex > len(result) - 1:
                break
            
            #Create input and output sequence
            seq_X, seq_y = result[j:lastIndex], result[lastIndex]
            
            #append seq_X, seq_y in X and y list
            X.append(seq_X)
            y.append(seq_y)
            pass

    X = np.array(X)
    y = np.array(y)
    return X,y 

In [None]:
#make 10-item seq
n_steps = 10
X, y = splitSequence(tval, n_steps = n_steps)
X_df = pd.DataFrame(X)
y_df = pd.DataFrame(y)
X_df.shape

In [None]:
# save dataframe to csv, reuse later 
X_df['y'] = y_df
X_df.to_csv('seq10.csv')

In [None]:
#make 5-item seq
n_steps = 5
X, y = splitSequence(tval, n_steps = n_steps)
X_df = pd.DataFrame(X)
y_df = pd.DataFrame(y)
X_df.shape


In [None]:
# save dataframe to csv, reuse later
X_df['y'] = y_df
X_df.to_csv('seq5.csv')

Not that we have the n-item sequence dataset saved to local, we are able to direct comsume without recreating every time training model.