# Small Language Models

This notebook explores several neural network architectures for language modeling tasks. The models are trained on an artificially generated language. 

## Background and Motivation
Training language models on real-world language data often requires extensive computational resources typically found in data centers. To explore the fundamental concepts of language modeling in a more accessible manner, we will use an artificially constructed language. This approach allows us to control complexity, understand the underlying mechanics, and ensure resource efficiency.

## Objective
Understand, implement, and compare different neural network architectures (RNN, LSTM, Transformer) in language modeling tasks using an artificially generated language.

## Tasks
1. **Artificial Language Generation for Training**: Generate an initial dataset of sentences using a simple, rule-based grammar (see below).
2. **Tokenization**: Convert these sentences into sequences of integers.
3. **Model Building**: Implement at least the first 3 models - a basic RNN (required), an LSTM (required), 1DCNN (required) and a Transformer (optional)
4. **Training**: Train each model on the artificial language dataset and evaluate the performance of each model while training on the validation set by showing the learning curves for each model and comparing them.

You can use the starter code provided below to get started. Feel free to modify the code as you see fit. The architecture of the models is up to you but can be simple. For example for the RNN it might look like this:

| Layer (type)            | Output Shape      | Param # |
|-------------------------|-------------------|---------|
| embedding_1 (Embedding) | (1, 35, 8)        | 200     |
| simple_rnn (SimpleRNN)  | (1, 50)           | 2,950   |
| dense (Dense)           | (1, 25)           | 1,275   |

Important is to choose an appropriate loss function for the next predicted token. In the above example, the vocabulary size is 25.

In [1]:
import random
# Define the grammar rules
expanded_grammar = {
    'S': [['NP', 'VP'], ['S', 'CONJ', 'S']],
    'NP': [['Det', 'ADJ', 'N'], ['Det', 'N'], ['Det', 'N', 'PP']],
    'VP': [['V', 'NP'], ['V', 'NP', 'PP'], ['V', 'ADV']],
    'PP': [['P', 'NP']],
    'Det': ['a', 'the'],
    'N': ['cat', 'dog', 'bird', 'tree'],
    'V': ['sits', 'runs', 'flies', 'jumps'],
    'ADJ': ['big', 'small', 'quick', 'lazy'],
    'ADV': ['quickly', 'slowly', 'carefully'],
    'P': ['on', 'in', 'under', 'over'],
    'CONJ': ['and', 'but', 'or']
}

def generate_sentence(symbol, depth=0, max_depth=5):
    if depth > max_depth:  # Limiting recursion depth
        return ''

    if symbol not in expanded_grammar:
        return symbol

    expansion = random.choice(expanded_grammar[symbol])
    if isinstance(expansion, list):
        return ' '.join([generate_sentence(sym, depth+1, max_depth) for sym in expansion]).strip()
    else:
        return expansion

### Preprocessing the data (Tokenization and Padding)

In this first step, we will tokenize the data and pad it to the maximum length of the sentences. 

In [2]:
random.seed(42) # For reproducibility
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

# Generating a sentences of sentences
num_train = 1000                        # Number of training examples
num_test  = 2000                        # Number of test examples
num_sentences = num_train + num_test    # Number of sentences to generate
sentences = [generate_sentence('S') for _ in range(num_sentences)]

# Creating tokens from sentences
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sentences)
sequences = tokenizer.texts_to_sequences(sentences)

# Preparing data for model input 
vocab_size = len(tokenizer.word_index) + 1  # Plus 1 for padding

# Pad all sequences to the same length
max_length = max(len(s) for s in sequences)
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')

# Prepare X, y for training
X = sequences[0:num_train,:-1]
y = to_categorical(sequences[0:num_train,-1], num_classes=vocab_size)

# Prepare X, y for testing
X_test = sequences[num_train:,:-1]
y_test = to_categorical(sequences[num_train:,-1], num_classes=vocab_size)

# The first sentence
print(f"The first sentence :\'{sentences[0]}\'")
print(f"The first sentence after tokanization and padding :\n{sequences[0]}")
print(f"The first sentence X for training (note the last token is missing):\n{X[0]}") 
print(f"The first sentence Y for training (note that this is one-hot-encoded):\n{y[0]}") 

T = X.shape[1]  # Length of input sequence
V = y.shape[1]  # Vocabolary size
print(f"Shape of X (Batch, T={T}, V={V})")
print(f"Shape of X (in code)): {X.shape}, shape of y: {y.shape}")

The first sentence :'the small dog sits a tree on a small cat'
The first sentence after tokanization and padding :
[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  1 17  6 12  2  3 14  2 17  5]
The first sentence X for training (note the last token is missing):
[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
  0  0  1 17  6 12  2  3 14  2 17]
The first sentence Y for training (note that this is one-hot-encoded):
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0.]
Shape of X (Batch, T=35, V=25)
Shape of X (in code)): (1000, 35), shape of y: (1000, 25)


We have the following shapes

**X**
Note that `X` is a sparse representation in the code. Instead of one-hot-coding the tokens, we are using the token index. This is a more efficient representation of the data, but conceptually it is the same.

**y**
y has the same Batch, Vocabulary. It is the one-hot encoded result of the next token. Note that, here are are using the **last predicted word**. This makes it easier in our code but is less efficient. In technical realizations people also use the word for all $t=1..T$ as the targets.


#### Embedding layer
The first step, we do from $X(B,T,V)$ to $X(B,T,C)$ is the embedding layer. This is a simple matrix multiplication. The size of the embedding layer `C` is a hyperparameter.

In [3]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense

EMB_DIM = 8  # Embedding dimension (a.k.a. C)
emb_model = Sequential()
emb_model.add(Embedding(vocab_size, EMB_DIM)) 
print(f"Shape {emb_model.predict(X).shape}")

[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 459us/step
Shape (1000, 35, 8)


### Further Hints
#### Basic RNN

In the basic RNN model, you can use the `SimpleRNN` layer. The beginning of the code is given below.

In [None]:
# RNN Model
rnn_model = Sequential()
rnn_model.add(Embedding(vocab_size, EMB_DIM))           # Embedding layer
##### YOUR CODE HERE #####

#### LSTM
Adpot the code from the basic RNN and replace the `SimpleRNN` layer with an `LSTM` layer. 

#### 1DCNN
Start same as LSTM and RNN. Replace the `LSTM` layer with serval `Conv1D` layer as shown in the lecture slides. The last layer should only return a single value. This can be achived with 
```python
cnn_model.add(Lambda(lambda x: x[:, -1, :]))
```

#### Attention
Here we use a single Attention block. The code is given below. 

```python
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, MultiHeadAttention, Dense, LayerNormalization, Dropout, Embedding


MAX_LEN = T
FF_DIM = 2* EMB_DIM  # Hidden layer size in the feedforward network (after attention)
NUM_HEADS = 2 # Number of attention heads

# Input layer
input_layer = Input(shape=(MAX_LEN,))
embedding_layer = Embedding(vocab_size, EMB_DIM)(input_layer)
pos_encoding_layer = embedding_layer#PositionalEncoding(EMB_DIM)(embedding_layer)

attention_output = MultiHeadAttention(num_heads=NUM_HEADS, key_dim=EMB_DIM)(pos_encoding_layer, pos_encoding_layer)
attention_output = Dropout(0.1)(attention_output)
attention_output = LayerNormalization(epsilon=1e-6)(attention_output + pos_encoding_layer)

feed_forward_output = Dense(FF_DIM, activation='relu')(attention_output)
feed_forward_output = Dense(EMB_DIM)(feed_forward_output)
feed_forward_output = Dropout(0.1)(feed_forward_output)
feed_forward_output = LayerNormalization(epsilon=1e-6)(feed_forward_output + attention_output)

# Output layer (is still a tensor of shape (batch_size, T, EMB_DIM)
# We just need the last word, so we select it with [:,-1,:]
output_layer = Dense(vocab_size, activation='softmax')(feed_forward_output[:,-1,:])



# Create the model
transformer_model = Model(inputs=input_layer, outputs=output_layer)
# Compile the model
transformer_model.compile(loss='categorical_crossentropy', optimizer='adam')

transformer_model.predict(X[1].reshape(1,-1))  # Predict the first sentence (needs to be called)
```

How does this code differ from the diagram in the lecture slides below?

![Attention](trafo.png)


# ----------- End of Starter Code ------------