# Keras

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 11/12/2024   | Martin | Created   | Started Keras Preprcocessing API | 

# Content

* [Sequence Preprocessing](#sequence-preprocessing)
* [Text Preprocessing](#text-preprocessing)

# Keras Preprocessing API

For data processing and data augmentation

In [2]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
from tensorflow.keras import preprocessing
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator, pad_sequences, skipgrams, make_sampling_table
from tensorflow.keras.preprocessing.text import text_to_word_sequence, hashing_trick, Tokenizer

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GRU

import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
os.environ["GRPC_VERBOSITY"] = "ERROR"
os.environ["GLOG_minloglevel"] = "2"

## Sequence preprocessing

_Sequence_ data is where older data matter like text or time series data

### Time series generator

`TimeseriesGenerator` takes consecutive data points andapplies transformations using time series parameters

In [14]:
series = np.array([i for i in range(10)])
print(f'Original data: {series}')

# predict the next value based on the last 5 lagging observations
generator = TimeseriesGenerator(
  data=series,
  targets=series,
  length=5,
  batch_size=1,
  shuffle=False,
  reverse=False
)
print(f"Samples: {len(generator)}")

for i in range(len(generator)):
  x, y = generator[i]
  print(f"{x} => {y}")

Original data: [0 1 2 3 4 5 6 7 8 9]
Samples: 5
[[0 1 2 3 4]] => [5]
[[1 2 3 4 5]] => [6]
[[2 3 4 5 6]] => [7]
[[3 4 5 6 7]] => [8]
[[4 5 6 7 8]] => [9]


In [18]:
# Define model
model = Sequential()
model.add(Dense(10, activation='relu', input_dim=5))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

# Train model
model.fit(
  generator,
  epochs=10
)

Epoch 1/10


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 60.4673  
Epoch 2/10


  self._warn_if_super_not_called()


[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 52.3983 
Epoch 3/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 5ms/step - loss: 49.2140 
Epoch 4/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 51.8469 
Epoch 5/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 51.6252 
Epoch 6/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 50.5210 
Epoch 7/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 44.9877 
Epoch 8/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step - loss: 37.1479 
Epoch 9/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 42.7858 
Epoch 10/10
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step - loss: 41.5767 


<keras.src.callbacks.history.History at 0x7f93f81e9290>

### Padding sequences

Sequence data often have different lengths that need to be processed to fit the same dimensions.
Padding is to increase the length of shorter sequences to match the larger ones.

For time series data padding is usually done at the beginning of the sequence.

In [None]:
sentences = [["What", "do", "you", "like", "?"],
             ["I", "like", "basket-ball", "!"],
             ["And", "you", "?"],
             ["I", "like", "coconut", "and", "apple"]]

# Build the vocabulary
text_set = set(np.concatenate(sentences))
vocab_to_int = dict(zip(text_set, range(len(text_set))))
int_to_vocab = {vocab_to_int[word]:word for word in vocab_to_int.keys()}

In [None]:
# Encode the sentences
encoded_sentences = []
for sentence in sentences:
  encoded_sentence = [vocab_to_int[word] for word in sentence]
  encoded_sentences.append(encoded_sentence)

# Pad the shorter ones
print(encoded_sentences)
pad_sequences(encoded_sentences)
# maxlen, truncating

[[8, 1, 6, 0, 10], [4, 0, 7, 5], [3, 6, 10], [4, 0, 9, 11, 2]]


array([[ 8,  1,  6,  0, 10],
       [ 0,  4,  0,  7,  5],
       [ 0,  0,  3,  6, 10],
       [ 4,  0,  9, 11,  2]], dtype=int32)

### Skip-grams

Unsupervised learning techniques in NLP - finds the most related words for a given word and predicts the context of the given word.

`skipgrams` in Tensorflow takes in a integer-encoded pair of words and returns their relevance (1 if relevant 0 otherwise). A context word is selected which all examples are compared against, then a window is selected to determine the number of comparisons to perform.

In [29]:
# Encode sentence into integers
sentence = "I like coconut and apple"
encoded_sentence = [vocab_to_int[word] for word in sentence.split()]
vocabulary_size = len(encoded_sentence)

# Setup skipgram
pairs, labels = skipgrams(
  encoded_sentence,
  vocabulary_size,
  window_size=1,
  negative_samples=0
)

# Print the relevancy
for i in range(len(pairs)):
  print(f"({int_to_vocab[pairs[i][0]]} -> {int_to_vocab[pairs[i][0]]}) -> {labels[i]}")

(apple -> apple) -> 1
(and -> and) -> 1
(coconut -> coconut) -> 1
(and -> and) -> 1


---

## Text preprocessing

Need to encode text as numbers and provide integers as inputs

### Split text to word sequence

`text_to_word_sequence` - Transforms a sequence into a list of words/ tokens. Able to set to lowercsae and remove punctuations

In [3]:
sentence = "I like coconut, I like apple"
text_to_word_sequence(sentence, lower=False)

['I', 'like', 'coconut', 'I', 'like', 'apple']

In [4]:
text_to_word_sequence(sentence, lower=True, filters=[])

['i', 'like', 'coconut,', 'i', 'like', 'apple']

### Tokeniser

`Tokenizer` - converts strings/ paragraphs into individual tokens based on the configuration specified

Inputs:

* Max number of words to keep, based on frequency
* List of characters to filter out
* Boolean to convert lower case or not
* Separator for word splitting

In [5]:
sentences = [
  ["What", "do", "you", "like", "?"],
  ["I", "like", "basket-ball", "!"],
  ["And", "you", "?"],
  ["I", "like", "coconut", "and", "apple"]
]

In [7]:
# Create the tokenizer
t = Tokenizer()

# fit tokenizer on documents
t.fit_on_texts(sentences)

In [9]:
# Tokenizer contains useful information in metadata
## Count of each word across all documents
print(t.word_counts)

## Number of documents
print(t.document_count)

## Unique index identifier
print(t.word_index)

## Number of documents (in this case lists) that each word appears in
print(t.word_docs)

OrderedDict([('what', 1), ('do', 1), ('you', 2), ('like', 3), ('?', 2), ('i', 2), ('basket-ball', 1), ('!', 1), ('and', 2), ('coconut', 1), ('apple', 1)])
4
{'like': 1, 'you': 2, '?': 3, 'i': 4, 'and': 5, 'what': 6, 'do': 7, 'basket-ball': 8, '!': 9, 'coconut': 10, 'apple': 11}
defaultdict(<class 'int'>, {'?': 2, 'do': 1, 'what': 1, 'like': 3, 'you': 2, 'basket-ball': 1, 'i': 2, '!': 1, 'and': 2, 'apple': 1, 'coconut': 1})


In [None]:
pg 157