# Short lecture on "Basics of Neural Language Model"

**Lecturer: Prof. Kosuke Takano, Kanagawa Institute of Technology**

This short lecture instructs the basics of neural language model along with simple python codes. The Large Language Model (LLM) such as OpenAI's ChatGPT and Goolge's Gemini are dramatically changing our life and society with their awesome human-like capability, however their mechanism is not so complicated. This lecture aims to focus on basic components to build the LLM and enlighten how they work in a neural network architecture. Student will write small codes of basic functions consisting of neural networks for the natural language processing and deepen the understanding on the principle.

## Content

Day 1:
* Basic of neural network
* Word embedding
* Sequential neural model for Natural Language Processing

Day 2:
* Sequential neural model for Natural Language Processing (Cont.)
* Transformer
* Conversation application by GPT

## Requirement
* PC and Internet connection
* Google Colaboratory ... Google account is required

## Execution environment

Python programs are very version sensitive.Since the execution environment of Colaboratory will be updated at google's discretion, so we need to check it.<br>
Python: 3.10.12 (Februrary 27, 2024)<br>
TensorFlow: 2.15.0 (Februrary 27, 2024

Be sure to specify GPU or TPU as the runtime type.

In [None]:
!python -V

Python 3.10.12


In [None]:
import tensorflow as tf

print(tf.__version__)

2.15.0


# Part-5

## Neural machine translation

* Translation function realized using neural network
* In 2014, a sequence-to-sequence model using RNN was devised and put into practical use.
* Transformer was invented in 2017 and contributes to improving the performance of machine translation.

## Seqence to sequence model

* For input sequence data, a sequence-to-seqence (seq-to-seq) model outputs it as another sequence data.
* Application: Neural translation, text generation, etc.
* A squence to sequence model is also called an encode/decode model because it (1) encodes the input series data, and (2) decodes the encoded result to output the series data.
* The encoded result is called a semantic vector.
* Since the semantic vector has a fixed length, learning becomes difficult as the length of the input sequence data increases.

<center>
<img src='https://drive.google.com/uc?export=view&id=1xnshTq3kThH13CRLV1vbEuRmOvGJ5KAC' width='60%'>
</center>
<center>
Figure 1. Seqence to sequence model
</center>


## Applying a sequence to sequence model of RNN for machine translation

* Input the text to be translated as series data, and output the translated text as series data.
 * I like cat. You like dog. → ฉัน ชอบ แมว คุณ ชอบ สุนัข
 * I like cat. You like dog. → 私は猫が好きです。あなたは犬が好きです。

<center>
<img src='https://drive.google.com/uc?export=view&id=1JBOuHVL_NuonIraFS1MtGkhdtJm-4rhO' width='70%'>
</center>
<center>
Figure 2. Basic architecture of a sequence to sequence model for machine translation
</center>


## Attention
* Introduced by Bahdanau, Cho, and Bengio for neural machine translation (2014).
* Mechanism to focus on specific features of input data (attention) and emphasize them.
* Contributes to improving the performance of sequence-to-sequence models.
* Also functions as an important component in Transformers.


## Self-attention

* Adjust the sequence data to emphasize the elements to be focused on within the same input sequence.

### **Code example**

In [None]:
!wget http://mattmahoney.net/dc/text8.zip
!unzip text8.zip

In [None]:
from gensim.models.word2vec import Word2Vec, Text8Corpus

sentences = Text8Corpus('text8')
model = Word2Vec(sentences, vector_size=100)

model.save('model.bin')

In [None]:
model = Word2Vec.load('model.bin')

In [None]:
text = "I book a room at the hotel."

In [None]:
text = text.lower() # lowercase
text = text.replace('.', ' .') # separate period
words = text.split(' ') # Split words by white space

First, we create a self-attention weight matrix

In [None]:
import numpy as np

# Creating a self-attention weight matrix
a = np.array([])
for w1 in words:
  for w2 in words:
    try:
      score = model.wv.similarity(w1, w2)
    except:
      score = 0

    #print(w1, w2, score)
    a = np.append(a, score)

Then, we draw a heat map of the self-attention weight

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns

length = len(words)

attention_matrix = a.reshape(length, length)
feature_names = words
# Make a heat map of self-attention weights
sns.heatmap(attention_matrix, annot=True,
            xticklabels=feature_names,
            yticklabels=feature_names)

# Draw a graph
plt.show()

### **Practice 5-1**
Draw a heat map of the self-attention weight for the following English sentence.
<br><br>
Sentence:<br>
I cut orages with a knife.

## Attention in a sequence to seqence model with RNN
* The concatenated outputs of each cell for the input sequence form the sequence of semantic vectors.
* When inputting to the decoder cells, considering which part of the context vectors to focus (attend) on, generates the context vector.
* Even when the input sequence data is long, the accuracy remains high.

<center>
<img src='https://drive.google.com/uc?export=view&id=1Logb1lxDG7YCZ2ndITEHo6AVHtV-OzAZ' width='70%'>
</center>
<center>
Figure 3. Attention in a sequence to seqence model with RNN
</center>

## Architecture of a RNN-based sequence to seqence model with attention
Figure 4 shows an architecture of a RNN-based sequence to seqence model with attention, where attention layer is added in the original architecture as shown in Figure 3.In addition, encode outputs a sequence of semantic vectors that is used for the attention calculation to the input sequence at a decorder.

<center>
<img src='https://drive.google.com/uc?export=view&id=1E4EdUJluX2Tad0beRfcdtXA3n6qpifXY' width='70%'>
</center>
<center>
Figure 4. Architecture of a RNN-based sequence to seqence model with attention
</center>

## Creating context vector in attenion calculation

Context vector is created in attenion calculation in the following steps.

Step-1: For the output of cell $h'_i$ at the decoder, calculate the inner product with each semantic vector $[\mathbf{h}_1, \mathbf{h}_2, \cdots, \mathbf{h}_n]$ in the semantic vector sequence, and calculate the weight vector $[a_1 , a_2, \cdots, a_n]$ is obtained.

$$ \mathbf{a} = [a_1 , a_2, \cdots, a_n] = [\mathbf{h}_1, \mathbf{h}_2, \cdots, \mathbf{h}_n] \cdot \mathbf{h}'_i \tag{1}$$

Step-2: Normalize $[a_1, a_2, \cdots, a_n]$ applying softmax so that the sum is 1, and create the normalized weight vector $[a'_1, a'_2, \cdots , a'_n]$

$$ [a'_1, a'_2, \cdots , a'_n] = softmax([a_1 , a_2, \cdots, a_n]) \tag{2}$$

Step-3: Calculate the weighted sum of each semantic vector in the semantic vector sequence to obtain the context vector $c_i$.

$$ \mathbf{c}_i = a'_1 \mathbf{h'}_1 + a'_2 \mathbf{h'}_2 + \cdots + a'_m \mathbf{h'}_m = \sum^m_{k=1}a'_i \mathbf{h}'_i \tag{3}$$
<br>

<center>
<img src='https://drive.google.com/uc?export=view&id=1TXp0poDkllbu3sFjBzvJxbKfjSThmPyB' width='70%'>
</center>
<center>
Figure 5. Creation of context vector in attention layer
</center>

### Code example

First, we calculate each output $\mathbf{h}_j$ of the left encoder in a pseudo manner, and create a sequence of semantic vectors using the example input sentence as follows.
<br><br>
Sentence:<br>
I book a room at the hotel.

Let's define RNN_cell0(x, Wx, b) and RNN_cell(x, o, Wx, Wo, b) again.

In [None]:
import numpy as np

wordvec_size = 100
hidden_size = 5

Wx = np.random.randn(wordvec_size, hidden_size)
Wo = np.random.randn(hidden_size, hidden_size)
b = np.zeros(hidden_size)

In [None]:
def RNN_cell0(x, Wx, b):

  _o = np.dot(x, Wx) + b
  o = np.tanh(_o)

  return o

In [None]:
def RNN_cell(x, o, Wx, Wo, b):

  _o = np.dot(o, Wo) + np.dot(x, Wx) + b
  o = np.tanh(_o)

  return o

Load pre-trained word2vec model.

In [None]:
model = Word2Vec.load('model.bin')

In [None]:
x1 = model.wv["i"]
x2 = model.wv["book"]
x3 = model.wv["a"]
x4 = model.wv["room"]
x5 = model.wv["at"]
x6 = model.wv["the"]
x7 = model.wv["hotel"]

We calculate semantic vectors.

In [None]:
h1 = RNN_cell0(x1, Wx, b)
h2 = RNN_cell(x2, h1, Wx, Wo, b)
h3 = RNN_cell(x3, h2, Wx, Wo, b)
h4 = RNN_cell(x4, h3, Wx, Wo, b)
h5 = RNN_cell(x5, h4, Wx, Wo, b)
h6 = RNN_cell(x6, h5, Wx, Wo, b)
h7 = RNN_cell(x7, h6, Wx, Wo, b)

In [None]:
print(h1)
print(h2)

Suppose we enter the sentence "Dinner at the restaurant is my favorite." into the decoder on the right. At this example, output $\mathbf{hd}'_1$ of the first cell in a pseudo manner.

In [None]:
# "Dinner at the restaurant is my favorite."

y1 = model.wv["dinner"]
hd1 = RNN_cell0(y1, Wx, b)

We generate a context vector paying attention on "dinner". First, calculate the weights.

In [None]:
a1 = np.dot(h1, hd1)
a2 = np.dot(h2, hd1)
a3 = np.dot(h3, hd1)
a4 = np.dot(h4, hd1)
a5 = np.dot(h5, hd1)
a6 = np.dot(h6, hd1)
a7 = np.dot(h7, hd1)

Then, generate a context vector for "dinner" by computing a weighted sum. In this example, normalization by softmax is not applied for the weight values.

In [None]:
c1 = a1 * h1 + a2 * h2 + a3 * h3 + a4 * h4 + a5 * h5 + a6 * h6 + a7 * h7

In [None]:
print(c1)

### Practice 5-2
Generate a context vector for paying attention on "restaurant". Please use RNN_cell(x, o, Wx, Wo, b) for calculating outputs $\mathbf{hd}_2$, $\mathbf{hd}_3$, and so on.

In [None]:
# "Dinner at the restaurant is my favorite."

# at
y2 = model.wv["at"]
hd2 = RNN_cell(y2, hd1, Wx, Wo, b)

# the
y3 = model.wv["the"]
hd3 = RNN_cell(y3, hd2, Wx, Wo, b)

# restaurant
y4 = model.wv["restaurant"]
hd4 = RNN_cell(y4, hd3, Wx, Wo, b)

In [None]:
# Calculate attention weight for "restaurant"
a1 = np.dot(h1, hd4)
a2 = np.dot(h2, hd4)
a3 = np.dot(h3, hd4)
a4 = np.dot(h4, hd4)
a5 = np.dot(h5, hd4)
a6 = np.dot(h6, hd4)
a7 = np.dot(h7, hd4)

In [None]:
# Calculate context vector for "restaurant"
c4 = a1 * h1 + a2 * h2 + a3 * h3 + a4 * h4 + a5 * h5 + a6 * h6 + a7 * h7

# Part-6

## Transformer

* Proposed by Vaswani et al. in 2017
* Although it was proposed as a machine translation model, it is also widely used in natural language processing and image processing.
* BLEU score of 28.4 with English-German translation
 * BLEU score: score to evaluate the accuracy of machine translation
* Does not have a sequential structure like RNN, and can be accelerated by parallel calculation
* Based on deep learning models such as BERT and GPT-n
* Vision Transformer (ViT) is an example of application to image processing.

## Architecture of Transformer
* Multi-head attention: Attention mechanism based on a scaled-dot product calculation that takes query, key, and value as input. A pararell calculation structure for the different contexts is called a multi-head.
* Masked multi-head attention: An attention mechanism that prevents the model from referring to subsequent words.
* Positional encoding: Embedded information about the position of a word (Vaswani et al. proposed a calculation method using sine and cosine functions)

<center>
<img src='https://drive.google.com/uc?export=view&id=1_lq0sXwIOjnzYbm4MZYzg3Tr1V6-xw71' width='50%'>
</center>
<center>
Figure 6. Architecture of Transformer (Vaswani, A. et. al, Attention Is All You Need, 2017)
</center>



## Positional encoding
* Encoding processing that gives positional information to each word (token) in a sentence

$$ PE_{(pos,2i)} = sin(pos/10000^{2i/d_{model}}) \tag{4}$$
$$ PE_{(pos,2i+1)} = cos(pos/10000^{2i/d_{model}}) \tag{5}$$


<center>
<img src='https://drive.google.com/uc?export=view&id=1SzCdieqFTyHQjzh-Hk3G0jvCv-fJRU7n' width='40%'>
</center>
<center>
Figure 8. Positional encoding
</center>


### **Code example**

Define a positional-encoding function positional_encoding(pos, i, dim).

In [None]:
import numpy as np

def positional_encoding(pos, i, dim):
  if i//2 == 0:
    return np.sin(pos/10000**(1/dim))
  else:
    return np.cos(pos/10000**((i-1)/dim))

Let's generate the first (word) positional information in a 100-dimensional vector.

In [None]:
for i in range(0, 100):
  print(positional_encoding(1, i, 100))

### **Practice 6-1**
Generate the second and third positional information in 100-dimensional vector.

### **Code example**
Store the positional information up to the 30th in an array (30 positions x 100 dimensional vector). Then visualize it.

In [None]:
pe = np.zeros((30, 100))
for i in range(0,30):
  for j in range(0, 100):
    pe[i][j] = positional_encoding(i, j, 100)

print(pe)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(pe)
plt.show()

Embed positional encoding into each word of "I book room at the hotel."

In [None]:
xe1 = x1 + pe[0] # I
xe2 = x2 + pe[1] # book

print(xe1.shape)
print(xe2.shape)

### **Practice 6-2**
Generate vectors with positional encoding embedded for the remaining words "room", "at", "the", and "hotel".

## Multi-head attention

* Transform input data in different contexts by linear layers and process in parallel with each scaled dot-product attention
* The output is generated by concatinating horizontally the vector outputs of each attention.

<center>
<img src='https://drive.google.com/uc?export=view&id=1cpMJTclA31kwMsLZN_Jv19cvXjp8ETEN' width='30%'>
</center>
<center>
Figure 7. Multi-head attention
</center>




## Scaled Dot-Product Attention

* Attention mechanism by Query, Key, Value
* Normalized (scaled) by the size of the vector
* Positional information of words using positional encoding to input word vector




$$ Attention (Q,K,V) = softmax(\frac{QK^T}{\sqrt{d_k}}{})V \tag{9}$$

Step-1: Matrix product of $\mathbf{Q}$ and $\mathbf{K}^T$ <br>
Step-2: Normalization with $\sqrt{d_k}$ <br>
Step-3: Applying softmax function <br>
Step-4: Multiply $\mathbf{V}$


<center>
<img src='https://drive.google.com/uc?export=view&id=1OIk3G99JJNBquU5k8aNTXwkNTSQ4SoFA' width='70%'>
</center>
<center>
Figure 6. Scaled dot-product attention
</center>


### Code example

For now, to simplify the discussion, we will use randomly generated vectors.Finally, we use word2vec vectors.

In [None]:
import numpy as np

batch_size = 1 # 1 sentence e.g. I like dog
seq_length = 3 # 3 words
input_dim = 4 # Vector dimensions for word  *word2vec uses 100 dimenseions vector

x = np.random.randn(batch_size, seq_length, input_dim)
print (x)

We use pytorch in this example.

In [None]:
import os
import random
import math
import torch
import torch.nn as nn
import torch.nn.functional as F

Convert numpy array to pytorch tensor.

In [None]:
x = torch.from_numpy(x.astype(np.float32)).clone()

# Check the dimension.
d_k = x.size()[-1]
print(d_k)

Step-1: Matrix product of $\mathbf{Q}$ and $\mathbf{K}^T$

We calculate matrix product of $\mathbf{Q}$ and $\mathbf{K}^T$ in the denominator

In [None]:
q = x
k = x
v = x

In [None]:
print(q)
print(k)
print(v)

In [None]:
k.transpose(-2, -1)

In [None]:
attn_logits = torch.matmul(q, k.transpose(-2, -1))

Step-2: Normalization with $\sqrt{d_k}$

In [None]:
attn_logits = attn_logits / math.sqrt(d_k)

print(attn_logits)

Step-3: Applying softmax function

In [None]:
attention_weights = F.softmax(attn_logits, dim=-1)

print(attention_weights)

Step-4: Multiply $\mathbf{V}$

The resulting values ​​are the context vector by attention.

In [None]:
context_vectors = torch.matmul(attention_weights, v)

print(context_vectors)

### **Practice 6-3**
Make a function of attention(q, k, v) that returns context vectors and attention weights.

In [None]:
def attention(q, k, v):
  d_k = q.size()[-1]
  attn_logits = torch.matmul(q, k.transpose(-2, -1))
  attn_logits = attn_logits / math.sqrt(d_k)
  attention_weights = F.softmax(attn_logits, dim=-1)
  context_vectors = torch.matmul(attention_weights, v)

  return context_vectors, attention_weights

In [None]:
values, att = attention (q, k, v)

print(values)
print(att)

### Code example
The linear layer play a roll of projection function. That means the linear layer output the projected vectors for the input vectors, $\mathbf{q}, \mathbf{k}, \mathbf{v}$.

Create projection spaces (q_proj, k_proj, v_proj) for each of q, k, v. Here, the number of dimensions of the projection space is embed_dim = 4.

In [None]:
embed_dim = 4

q_proj = nn.Linear(input_dim, embed_dim)

In [None]:
q = q_proj(x)
print (q)

### **Practice 6-4**

1. Similary, make functions k_proj(k) and v_proj(v).
2. Then get the projected vectors for k and v.

In [None]:
k_proj = nn.Linear(input_dim, embed_dim)
v_proj = nn.Linear(input_dim, embed_dim)

k = k_proj(x)
print (k)

v = v_proj(x)
print (v)

### Code example

Let's use word vectors obtained by word2vec.

In [None]:
x1 = model.wv['i']
x2 = model.wv['like']
x3 = model.wv['dog']

print(x1)
print(x2)
print(x3)

Embed a vector of positional encoding before projection.

In [None]:
x1_pe = x1 + pe[0] # I
x2_pe = x2 + pe[1] # like
x3_pe = x3 + pe[2] # dog

In [None]:
x_pe = np.array([x1_pe, x2_pe, x3_pe])

x_pe = torch.from_numpy(x_pe.astype(np.float32)).clone()
print(x_pe)

In [None]:
input_dim = 100
embed_dim = 30

q_proj = nn.Linear(input_dim, embed_dim)
k_proj = nn.Linear(input_dim, embed_dim)
v_proj = nn.Linear(input_dim, embed_dim)

In [None]:
q = q_proj(x_pe)
k = k_proj(x_pe)
v = v_proj(x_pe)

In [None]:
context_vectors, atttention_weights = attention (q, k, v)

print(context_vectors, atttention_weights)

## Applying Transformer encoder for sentiment analysis

Transformer encoder alone can be applied for NLP tasks such as sentiment analysis, document classification.

### **Code example**

Let's classify movie review texts in the IMDB dataset using the Transformer encoder.

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        attention_output = self.attention(
            inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

In [None]:
vocab_size = 20000 # Number of words
embed_dim = 256 # Dimension of embedding
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

In [None]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup

In [None]:
import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)

In [None]:
from tensorflow.keras import layers

max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [None]:
for item in int_train_ds:
  print(item)

In [None]:
model.fit(int_train_ds, validation_data=int_val_ds, epochs=3)

In [None]:
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

### **Practice 6-5**
* Increase the number of epochs to 10 in the Transformer encoder and check if the classification accuracy is improved. (If it takes longer to execute, you can reduce the number of epochs.)
* In addition to SimpleRNN, LSTM, and GRU, which we checked last time, compare and discuss the classification accuracy of four models including Transformer encoder. Furthermore, let's compare by also focusing on the number of model parameters.



## Reference
* Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, https://arxiv.org/abs/1810.04805v1, 2018.
* Keras official Website, https://keras.io/examples/