# Small Language Models
This notebook explores several neural network architectures for language modeling tasks. The models are trained on an artificially generated language. To the task remains the same. To predict the probability of the next word $Y$ given the previous words $X$ in a sentence.

$$
P(Y | X) 
$$

## Background and Motivation
Training language models on real-world language data often requires extensive computational resources typically found in data centers. To explore the fundamental concepts of language modeling in a more accessible manner, we will use an artificially constructed language. This approach allows us to control complexity, understand the underlying mechanics, and ensure resource efficiency.

## Objective
Understand, implement, and compare different neural network architectures (RNN, LSTM, Transformer) in language modeling tasks using an artificially generated language.

## Tasks
1. **Artificial Language Generation for Training**: Generate an initial dataset of sentences using a simple, rule-based grammar (see below).
2. **Tokenization**: Convert these sentences into sequences of integers.
3. **Model Building**: Implement at least the first 3 models - a basic RNN (required), an LSTM (required), 1DCNN (required) and a Transformer (optional)
4. **Training**: Train each model on the artificial language dataset and evaluate the performance of each model while training on the validation set by showing the learning curves for each model and comparing them.

You can use the starter code provided below to get started. Feel free to modify the code as you see fit. The architecture of the models is up to you but can be simple. For example for the RNN it might look like this:

| Layer (type)            | Output Shape      | Param # |
|-------------------------|-------------------|---------|
| embedding_1 (Embedding) | (1, 35, 8)        | 208     |
| simple_rnn (SimpleRNN)  | (1, 50)           | 2,950   |
| dense (Dense)           | (1, 26)           | 1,275   |

Important is to choose an appropriate loss function for the next predicted token. In the above example, the vocabulary size is 25.

In [5]:
################# Keras with JAX Backend ################
##### We use the JAX backend for Keras
import jax 
print(jax.__version__)

##### We set the KERAS_BACKEND environment variable to "jax"
import os
os.environ["KERAS_BACKEND"] = "jax"

##### We import Keras
import keras
print(keras.__version__)

0.4.26
3.6.0


## Strange Language

We use the following procedure to generate samples from the language

In [7]:
import random
# Define the grammar rules
expanded_grammar = {
    'S': [['NP', 'VP'], ['S', 'CONJ', 'S']],
    'NP': [['Det', 'ADJ', 'N'], ['Det', 'N'], ['Det', 'N', 'PP']],
    'VP': [['V', 'NP'], ['V', 'NP', 'PP'], ['V', 'ADV']],
    'PP': [['P', 'NP']],
    'Det': ['a', 'the'],
    'N': ['cat', 'dog', 'bird', 'tree'],
    'V': ['sits', 'runs', 'flies', 'jumps'],
    'ADJ': ['big', 'small', 'quick', 'lazy'],
    'ADV': ['quickly', 'slowly', 'carefully'],
    'P': ['on', 'in', 'under', 'over'],
    'CONJ': ['and', 'but', 'or']
}

def generate_sentence(symbol, depth=0, max_depth=5):
    if depth > max_depth:  # Limiting recursion depth
        return ''

    if symbol not in expanded_grammar:
        return symbol

    expansion = random.choice(expanded_grammar[symbol])
    if isinstance(expansion, list):
        return ' '.join([generate_sentence(sym, depth+1, max_depth) for sym in expansion]).strip()
    else:
        return expansion

### Generation of Training and Test Data 

Here we generate the training and test data for the language. 

In [8]:
import random
import numpy as np
# Set seed for reproducibility
random.seed(42)
np.random.seed(42)

# Generating sentences
num_train = 1000                        # Number of training examples
num_test  = 2000                        # Number of test examples
num_sentences = num_train + num_test    # Number of sentences to generate
sentences = [generate_sentence('S') for _ in range(num_sentences)]

# Sample some sentences
for i in range(5):
    print(f"{i} {sentences[i]}")

max_length = max(len(sentence.split()) for sentence in sentences)

0 the small dog sits a tree on a small cat
1 the dog under a lazy bird runs the big cat on the bird
2 the cat on the bird on runs slowly
3 a lazy bird flies the quick dog on a dog over
4 the cat on the bird runs the dog over but a dog flies carefully


### Tokenization

In the Tokenization step, we convert the input text into integer numbers. 

In [9]:
import numpy as np
import jax.numpy as jnp
from keras.layers import TextVectorization

# Define TextVectorization layer
vectorize_layer = TextVectorization(
    max_tokens=40, # Maximum vocabulary size (will be less)
    output_mode="int"
)

# Adapt the vectorization layer
vectorize_layer.adapt(sentences)

# Vectorize sentences (output is JAX-compatible)
sequences = vectorize_layer(sentences)
print(sequences[:4])

vocab_size = len(vectorize_layer.get_vocabulary())
print(vocab_size)


# get frequency of each word
#from collections import Counter
#word_counts = Counter()
#for sentence in sentences:
#    word_count

[[ 2 18  7 13  3  4 15  3 18  6  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0]
 [ 2  7 16  3 22  5 12  2 20  6 15  2  5  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0]
 [ 2  6 15  2  5 15 12 24  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0]
 [ 3 22  5 14  2 19  7 15  3  7 17  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0]]
26


### Padding

Since the input text can have variable length, the tokenization added 0 at the end of the text to make all the text of the same length. However, we want that the text ends with a real token. So we take the 0 at the end and move it to the beginning of the text.

In [10]:
# Function to shift padding tokens to the beginning
def move_padding_to_front(sequence, padding_value=0):
    non_padding = sequence[sequence != padding_value]
    padding = sequence[sequence == padding_value]
    return jnp.concatenate([padding, non_padding])

# Apply the function to all sequences
adjusted_sequences = jnp.array([move_padding_to_front(seq) for seq in sequences])

# Print adjusted sequences
print(adjusted_sequences[:4])
del(sequences)

[[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  2 18  7 13  3  4 15  3 18  6]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  2
   7 16  3 22  5 12  2 20  6 15  2  5]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  2  6 15  2  5 15 12 24]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  3 22  5 14  2 19  7 15  3  7 17]]


### Training and Test Data
The first part of the sentences is used for training data the aim is to predict the last token of the sentence.


In [13]:
# Prepare X and y for training
X = adjusted_sequences[:num_train, :-1]
Y = keras.utils.to_categorical(adjusted_sequences[:num_train, -1], num_classes=vocab_size)

# Prepare X_test and y_test
X_test = adjusted_sequences[num_train:, :-1]
y_test = keras.utils.to_categorical(adjusted_sequences[num_train:, -1], num_classes=vocab_size)

# Changes end here
# The first sentence
print(f"The first sentence :\'{sentences[0]}\'")
print(f"The first sentence after tokenization and padding :\n{adjusted_sequences[0]}")
print(f"The first sentence X for training (note the last token is missing):\n{X[0]}")
print(f"The first sentence Y for training (note that this is one-hot-encoded):\n{Y[0]}") 

print(f"Shape of X: {X.shape}")
print(f"Shape of Y: {Y.shape}")

The first sentence :'the small dog sits a tree on a small cat'
The first sentence after tokenization and padding :
[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  2 18  7 13  3  4 15  3 18  6]
The first sentence X for training (note the last token is missing):
[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  2 18  7 13  3  4 15  3 18]
The first sentence Y for training (note that this is one-hot-encoded):
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0.]
Shape of X: (1000, 35)
Shape of Y: (1000, 26)


### Shapes and Embeddings
`X` is a sparse representation in the code. Instead of one-hot-coding the tokens, we are using the token index, i.e. every word / token corresponds to an interger valu. This is a more efficient representation of the data, but conceptually it is the same.

`Y` is one hot encoded has the shape Batch, Vocabulary. It is the one-hot encoded result of the next token. Note that, here we are using the **last predicted word**. This makes it easier in our code but is less efficient. In technical realizations people also use the word for all $t=1..T$ as the targets.

#### Embedding layer
The first step, we do from $X(B,T,V)$ to $X(B,T,C)$ is the embedding layer. This is a simple matrix multiplication. The size of the embedding layer `C` is a hyperparameter. 

In [None]:
from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN, Dense

EMB_DIM = 8  # Embedding dimension (a.k.a. C)
# Define the model

emb_model.add(Embedding(vocab_size, EMB_DIM)) 
### 
# <---- Add your model here ---->
###

# Predict embeddings for the input data
embeddings = emb_model.predict(X)  # X is already prepared as input sequences
print(f"Shape {embeddings.shape} (Batch, Time, Embedding)")

[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 2ms/step
Shape (1000, 35, 8, 8, 8) (Batch, Time, Embedding)


Starting from the embedding layer above. We can now apply the RNN, LSTM, 1DCNN, or Transformer.
### RNN

#### LSTM
Adpot the code from the basic RNN and replace the `SimpleRNN` layer with an `LSTM` layer. 

#### 1DCNN
Start same as LSTM and RNN. Replace the `LSTM` layer with serval `Conv1D` layer as shown in the lecture slides. The last layer should only return a single value. There are two ways to do this.

If you use the Sequential API 
```python
cnn_model.add(Lambda(lambda x: x[:, -1, :]))
```

If you use the Functional API
```python
output_layer = Dense(vocab_size, activation='softmax')(x[:, -1, :]) # x is the output of the last Conv1D layer
```


#### Transformer
Here we use a single Attention block consisting of the following layers. As for the CNN, the last layer should only return a single value. 

```python

###### Attention Block
MultiHeadAttention #(B,T,C)
LayerNormalization #(B,T,C)
Dense              #(B,T,C)    
##### 
# At the end we add a Dense layer to get the output
Dense              #(B,V)
```

# ----------- End of Starter Code ------------