<a href="https://colab.research.google.com/github/gkmurphy/PUS2024_GMurphy/blob/ClassDemos/Transformers(Willow).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Time Series Classifications with Transformers via KERAS
##*The Great Wrong-doing by Willow*

Transformers are a neural network using multi-layer attention layers. These layers are performing a series of computations (activation functions and weights). Botswani (2017) *Attention is All you Need* - attention mechanisms are essentially the backbone for chatgbt by finding relationships to words within a sentence. The attention mechanism find relationships between elements in the array by transforming the word to numbers that can then be associated.

It's super great in natural language processing, but is so nifty and great that people want to apply it to all the things - including classification problems - which are not directly applicable to this type of methodology.

Positional encoding is pretty sweet too and isn't used for categorical.

In [1]:
import numpy as np
import keras
import keras_nlp
from keras import layers

In [9]:
def readucr(filename):
    data = np.loadtxt(filename, delimiter="\t")
    y = data[:, 0]
    x = data[:, 1:]
    return x, y.astype(int)


root_url = "https://raw.githubusercontent.com/hfawaz/cd-diagram/master/FordA/" #data is hosted here ; extracted from here

x_train, y_train = readucr(root_url + "FordA_TRAIN.tsv")
x_test, y_test = readucr(root_url + "FordA_TEST.tsv")

In [10]:
x_train.shape #This dataset is a set of engine recordings = 3601 samples
np.unique(y_train) #engine is bueno or no bueno

array([-1,  1])

In [11]:
x_train[...] #ellipses means include all other dimensions

array([[-0.79717168, -0.66439208, -0.37301463, ..., -0.66439208,
        -1.0737958 , -1.5643427 ],
       [ 0.80485472,  0.63462859,  0.37347448, ..., -0.71488505,
        -0.56044294, -0.31908642],
       [ 0.7279851 ,  0.11128392, -0.49912439, ...,  0.39446303,
         0.33940042,  0.25539062],
       ...,
       [-0.57005428, -0.33316523, -0.29351853, ..., -1.3937145 ,
        -0.94273327, -0.27072168],
       [ 2.0067321 ,  2.0791499 ,  2.0220362 , ..., -0.43214504,
        -0.44123126, -0.28070891],
       [-0.12524091, -0.32536268, -0.48823697, ...,  0.55576053,
         0.57445102,  0.57311598]])

In [12]:
x_train = x_train[:,:, None] #these do the same thing ; NONE just adds another array/dimensions
x_test = x_test[..., None]

x_train.shape, x_test.shape


((3601, 500, 1), (1320, 500, 1))

In [13]:
n_classes = len(np.unique(y_train))
n_classes

2

In [14]:
#Randomize a dataset
idx = np.random.permutation(len(x_train))#length of x_train is unclear because it has multiple dimensions
idx = np.random.permutation(x_train.shape[0])


x_train = x_train[idx] #randomized index - data is still the same
y_train = y_train[idx]

y_train[y_train == -1] = 0 #changing this for convention so it's -1 and 0 so classification options are sequential ; keras likes that
y_test[y_test == -1] = 0

In [26]:
#You would have to do this if you didn't install the nlp above
def transformer_encoder(inputs):
  x = layers.LayerNormalization(epsilon=1e-6)(inputs) #this must go first for reasons unknown
  x = layers.MultiHeadAttention(key_dim=8, num_heads=8, dropout=0)(x, x) #it just takes two inputs.. weird ; #pass it the previous input as the input
  res = x + inputs #add the inputs and outputs together to preserve the original data throughout the process

  x = layers.LayerNormalization(epsilon=1e-6)(x)
  x = layers.Conv1D(filters=2000, kernel_size = 1, activation= "relu")(x)
  x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)

  return res + x


def build_model(input_shape, n_classes):
  inputs = keras.Input(shape=input_shape)
  x = inputs

  for _ in range(2): #underscore is used as a dummy variable to tell you to loop thru it 2x
    x = transformer_encoder(x) #the first time it happens - the input will be the output from the last time ; essentially runs this line twice

  x = layers.GlobalMaxPooling1D(data_format = "channels_first")(x)
  x = layers.Dense(78, activation="relu")(x)
  x = layers.Dense(50, activation = "relu")(x)

  outputs = layers.Dense(n_classes, activation = "softmax")(x)

  model = keras.Model(inputs, outputs) #add the inputs and outputs

  return model

In [27]:
input_shape = x_train.shape[1:]
model = build_model(input_shape, n_classes)

In [31]:
model.summary()

In [30]:
model.compile(
    loss = "sparse_categorical_crossentropy", #sparse means it's one hot encoded
    optimizer = keras.optimizers.Adam(learning_rate=1e-4),
    metrics = ["sparse_categorical_accuracy"]
)

In [32]:
callback = [
    keras.callbacks.EarlyStopping(patience = 10, restore_best_weights=True) #stop running after 10 epochs if it's not good then turn back to the best weights at that point
]

In [None]:
model.fit(
    x_train,
    y_train,
    validation_split = 0.2, #add validation to set aside some of the variation to evaluate it on data it's never seen before
    epochs = 100,
    batch_size = 32, #batch is how many examples you're giving the model to run at once ; smaller helps prevent overfitting (only seeing/learning a small part of the variation at once - higher variability)
                     #learning not to be too specific (overfitting) but still not too low so it converges faster
    verbose = 2  #how much info the model should print out while it's running ; smaller progress bars are better if you want to save it into a file
)

Epoch 1/100
90/90 - 509s - 6s/step - loss: 0.6099 - sparse_categorical_accuracy: 0.6660 - val_loss: 0.6343 - val_sparse_categorical_accuracy: 0.6449
Epoch 2/100


# Why is the model pretty much perfect?
The transformer changes the input data (e.g. they're no longer soundwaves by the time that they reaach the dense layers), but they still have some kind of understanding. The dense layers at this point are learning the transformer and not necessarily the data so the output would be the same with or without the transformer in terms of the model overfitting.

Moral of the story is that transformers are very new and people don't really understand how to use them. Miracle of the transformer - it was added in ChatGBT3. As you add thousands of transformers (# of params), the performance exponentially improves. The "miracle" is that they haven't hit the point of diminishing returns - there doesn't seem to be a ceiling.

But, climate yo. Massive water and electricity use for powering and cooling data centers. Also, privacy.