# Short lecture on "Basics of Neural Language Model"

**Lecturer: Prof. Kosuke Takano, Kanagawa Institute of Technology**

This short lecture instructs the basics of neural language model along with simple python codes. The Large Language Model (LLM) such as OpenAI's ChatGPT and Goolge's Gemini are dramatically changing our life and society with their awesome human-like capability, however their mechanism is not so complicated. This lecture aims to focus on basic components to build the LLM and enlighten how they work in a neural network architecture. Student will write small codes of basic functions consisting of neural networks for the natural language processing and deepen the understanding on the principle.

## Content

Day 1:
* Basic of neural network
* Word embedding
* Sequential neural model for Natural Language Processing

Day 2:
* Sequential neural model for Natural Language Processing (Cont.)
* Transformer
* Conversation application by GPT

## Requirement
* PC and Internet connection
* Google Colaboratory ... Google account is required

## Execution environment

Python programs are very version sensitive.Since the execution environment of Colaboratory will be updated at google's discretion, so we need to check it.<br>
Python: 3.10.12 (Februrary 27, 2024)<br>
TensorFlow: 2.15.0 (Februrary 27, 2024

Be sure to specify GPU or TPU as the runtime type.

In [None]:
!python -V

Python 3.10.12


In [None]:
import tensorflow as tf

print(tf.__version__)

2.15.0


# Part-5

## Neural machine translation

* Translation function realized using neural network
* In 2014, a sequence-to-sequence model using RNN was devised and put into practical use.
* Transformer was invented in 2017 and contributes to improving the performance of machine translation.

## Seqence to sequence model

* For input sequence data, a sequence-to-seqence (seq-to-seq) model outputs it as another sequence data.
* Application: Neural translation, text generation, etc.
* A squence to sequence model is also called an encode/decode model because it (1) encodes the input series data, and (2) decodes the encoded result to output the series data.
* The encoded result is called a semantic vector.
* Since the semantic vector has a fixed length, learning becomes difficult as the length of the input sequence data increases.

<center>
<img src='https://drive.google.com/uc?export=view&id=1xnshTq3kThH13CRLV1vbEuRmOvGJ5KAC' width='60%'>
</center>
<center>
Figure 1. Seqence to sequence model
</center>


## Applying a sequence to sequence model of RNN for machine translation

* Input the text to be translated as series data, and output the translated text as series data.
 * I like cat. You like dog. → ฉัน ชอบ แมว คุณ ชอบ สุนัข
 * I like cat. You like dog. → 私は猫が好きです。あなたは犬が好きです。

<center>
<img src='https://drive.google.com/uc?export=view&id=1JBOuHVL_NuonIraFS1MtGkhdtJm-4rhO' width='70%'>
</center>
<center>
Figure 2. Basic architecture of a sequence to sequence model for machine translation
</center>


## Attention
* Introduced by Bahdanau, Cho, and Bengio for neural machine translation (2014).
* Mechanism to focus on specific features of input data (attention) and emphasize them.
* Contributes to improving the performance of sequence-to-sequence models.
* Also functions as an important component in Transformers.


## Self-attention

* Adjust the sequence data to emphasize the elements to be focused on within the same input sequence.

### **Code example**

In [None]:
!wget http://mattmahoney.net/dc/text8.zip
!unzip text8.zip

In [None]:
from gensim.models.word2vec import Word2Vec, Text8Corpus

sentences = Text8Corpus('text8')
model = Word2Vec(sentences, vector_size=100)

model.save('model.bin')

In [None]:
model = Word2Vec.load('model.bin')

In [None]:
text = "I book a room at the hotel."

In [None]:
text = text.lower() # lowercase
text = text.replace('.', ' .') # separate period
words = text.split(' ') # Split words by white space

In [None]:
import numpy as np

# Creating a self-attention weight matrix
a = np.array([])
for w1 in words:
  for w2 in words:
    try:
      score = model.wv.similarity(w1, w2)
    except:
      score = 0

    #print(w1, w2, score)
    a = np.append(a, score)

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns

length = len(words)

attention_matrix = a.reshape(length, length)
feature_names = words
# Make a heat map of self-attention weights
sns.heatmap(attention_matrix, annot=True,
            xticklabels=feature_names,
            yticklabels=feature_names)

# Draw a graph
plt.show()

### **Practice 5-1**
Draw a heat map of the self-attention weight for the following English sentence.
<br><br>
Sentence:<br>
I cut orages with a knife.

## Attention in a sequence to seqence model with RNN
* The concatenated outputs of each cell for the input sequence form the sequence of semantic vectors.
* When inputting to the decoder cells, considering which part of the context vectors to focus (attend) on, generates the context vector.
* Even when the input sequence data is long, the accuracy remains high.

<center>
<img src='https://drive.google.com/uc?export=view&id=1Logb1lxDG7YCZ2ndITEHo6AVHtV-OzAZ' width='70%'>
</center>
<center>
Figure 3. Attention in a sequence to seqence model with RNN
</center>

## Architecture of a RNN-based sequence to seqence model with attention
Figure 4 shows an architecture of a RNN-based sequence to seqence model with attention, where attention layer is added in the original architecture as shown in Figure 3.In addition, encode outputs a sequence of semantic vectors that is used for the attention calculation to the input sequence at a decorder.

<center>
<img src='https://drive.google.com/uc?export=view&id=1E4EdUJluX2Tad0beRfcdtXA3n6qpifXY' width='70%'>
</center>
<center>
Figure 4. Architecture of a RNN-based sequence to seqence model with attention
</center>

## Creating context vector in attenion calculation

Context vector is created in attenion calculation in the following steps.

Step-1: For the output of cell $h'_i$ at the decoder, calculate the inner product with each semantic vector $[\mathbf{h}_1, \mathbf{h}_2, \cdots, \mathbf{h}_n]$ in the semantic vector sequence, and calculate the weight vector $[a_1 , a_2, \cdots, a_n]$ is obtained.

$$ \mathbf{a} = [a_1 , a_2, \cdots, a_n] = [\mathbf{h}_1, \mathbf{h}_2, \cdots, \mathbf{h}_n] \cdot \mathbf{h}'_i \tag{1}$$

Step-2: Normalize $[a_1, a_2, \cdots, a_n]$ applying softmax so that the sum is 1, and create the normalized weight vector $[a'_1, a'_2, \cdots , a'_n]$

$$ [a'_1, a'_2, \cdots , a'_n] = softmax([a_1 , a_2, \cdots, a_n]) \tag{2}$$

Step-3: Calculate the weighted sum of each semantic vector in the semantic vector sequence to obtain the context vector $c_i$.

$$ \mathbf{c}_i = a'_1 \mathbf{h'}_1 + a'_2 \mathbf{h'}_2 + \cdots + a'_m \mathbf{h'}_m = \sum^m_{k=1}a'_i \mathbf{h}'_i \tag{3}$$
<br>

<center>
<img src='https://drive.google.com/uc?export=view&id=1TXp0poDkllbu3sFjBzvJxbKfjSThmPyB' width='70%'>
</center>
<center>
Figure 5. Creation of context vector in attention layer
</center>

### Code example

First, we calculate each output $\mathbf{h}_j$ of the left encoder in a pseudo manner, and create a sequence of semantic vectors using the example input sentence as follows.
<br><br>
Sentence:<br>
I book a room at the hotel.

Let's define RNN_cell0(x, Wx, b) and RNN_cell(x, o, Wx, Wo, b) again.

In [2]:
import numpy as np

wordvec_size = 100
hidden_size = 5

Wx = np.random.randn(wordvec_size, hidden_size)
Wo = np.random.randn(hidden_size, hidden_size)
b = np.zeros(hidden_size)

In [3]:
def RNN_cell0(x, Wx, b):

  _o = np.dot(x, Wx) + b
  o = np.tanh(_o)

  return o

In [5]:
def RNN_cell(x, o, Wx, Wo, b):

  _o = np.dot(o, Wo) + np.dot(x, Wx) + b
  o = np.tanh(_o)

  return o

Train word2vec model using text8 corpus.

In [None]:
!wget http://mattmahoney.net/dc/text8.zip
!unzip text8.zip

In [8]:
from gensim.models.word2vec import Word2Vec, Text8Corpus

sentences = Text8Corpus('text8')
model = Word2Vec(sentences, vector_size=100)

model.save('model.bin')

In [9]:
model = Word2Vec.load('model.bin')

In [10]:
x1 = model.wv["i"]
x2 = model.wv["book"]
x3 = model.wv["a"]
x4 = model.wv["room"]
x5 = model.wv["at"]
x6 = model.wv["the"]
x7 = model.wv["hotel"]

In [11]:
h1 = RNN_cell0(x1, Wx, b)
h2 = RNN_cell(x2, h1, Wx, Wo, b)
h3 = RNN_cell(x3, h2, Wx, Wo, b)
h4 = RNN_cell(x4, h3, Wx, Wo, b)
h5 = RNN_cell(x5, h4, Wx, Wo, b)
h6 = RNN_cell(x6, h5, Wx, Wo, b)
h7 = RNN_cell(x7, h6, Wx, Wo, b)

In [None]:
print(h1)
print(h2)

Suppose we enter the sentence "Dinner at the restaurant is my favorite." into the decoder on the right. At this example, output $\mathbf{hd}'_1$ of the first cell in a pseudo manner.

In [15]:
# "Dinner at the restaurant is my favorite."

y1 = model.wv["dinner"]
hd1 = RNN_cell0(y1, Wx, b)

We generate a context vector paying attention on "dinner". First, calculate the weights.

In [16]:
a1 = np.dot(h1, hd1)
a2 = np.dot(h2, hd1)
a3 = np.dot(h3, hd1)
a4 = np.dot(h4, hd1)
a5 = np.dot(h5, hd1)
a6 = np.dot(h6, hd1)

Then, generate a context vector for "dinner" by computing a weighted sum. In this example, normalization by softmax is not applied for the weight values.

In [17]:
c1 = a1 * h1 + a2 * h2 + a3 * h3 + a4 * h4 + a5 * h5 + a6 * h6

In [None]:
print(c1)

### Practice 5-2
Generate a context vector for paying attention on "restaurant". Please use RNN_cell(x, o, Wx, Wo, b) for calculating outputs $\mathbf{hd}_2$, $\mathbf{hd}_3$, and so on.

# Part-6

## Transformer

* Vaswaniらが、2017年に提案
* 機械翻訳モデルとして提案されたが、広く自然言語処理や画像処理にも利用されている
* 英-独翻訳で28.4のBLEUスコア
 * BLEUスコア: 機械翻訳の精度を評価するためのスコア
* RNNのような系列構造を持たず、並列計算による高速化が可能
* BERTやGPT-nなどの深層学習モデルもとになっている
* 画像処理への適用例としては、Vision Transformer (ViT)がある。


## Architecture of Transformer

<img src='https://drive.google.com/uc?export=view&id=1UoTp8e9Y1NCCsr7-m7QlvQBwWRCFgfNt' width='50%'>

* Multi-head attention: query, key, valueを入力としたアテンション機構。入力データを分割して、複数の構成にしたものをMulti-headと呼んでいる。
* Masked multi-head attention: 後続の単語を参照しないようにするアテンション機構
* Positional encoding: 単語の出現位置についての埋め込み情報 (Vaswaniらは、sin関数、cos関数を用いた計算方法を提案)

## Scaled Dot-Product Attention

* Query、Key、ValueによるAttention機構
* ベクトルの大きさで正規化される（scaledされる）
* Positional Encodeingを用いて、単語の位置情報を加算する

<img src='https://drive.google.com/uc?export=view&id=1w1Wqzmli6PkzPBXR3tpc02pT5WF3LpxZ' width='50%'>

## Multi-head attention

* 入力データを分割して、複数のscaled dot-product attentionで並列処理
* 出力は、各attentionの出力(ベクトル形式)を横方向に連結(concat)

<img src='https://drive.google.com/uc?export=view&id=1cpMJTclA31kwMsLZN_Jv19cvXjp8ETEN' width='30%'>




## Positional encoding
* 文章中の各単語(トークン)に位置情報を与える処理

### **Code example**

In [None]:
import numpy as np

def positional_encoding(pos, i, dim):
  if i//2 == 0:
    return np.sin(pos/10000**(1/dim))
  else:
    return np.cos(pos/10000**((i-1)/dim))

1番目の(単語の)位置情報を100次元ベクトルで生成してみます。

In [None]:
for i in range(0, 100):
  print(positional_encoding(1, i, 100))

### **Practice 6-1**
2番目と3番目の位置情報を100次元ベクトルで生成してみましょう。

### **Code example**
30番目までの位置情報を配列(30個 x 100次元ベクトル)に格納します。その後可視化します。

In [None]:
pe = np.zeros((30, 100))
for i in range(0,30):
  for j in range(0, 100):
    pe[i][j] = positional_encoding(i, j, 100)

print(pe)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(pe)
plt.show()

"I book room at the hotel."の各単語にPositional Encodingを埋め込んでみます。

In [None]:
xe1 = x1 + pe[0] # I
xe2 = x2 + pe[1] # book

print(xe1.shape)
print(xe2.shape)

### **Practice 6-2**
残りの単語"room"、"at"、"the"、"hotel"についても、Posisional Encodingを埋め込んだベクトルを生成してみましょう。

## Applying Transformer encoder for sentiment analysis

Transformerエンコーダを用いて、IMDBデータセットの映画レビュー文章を分類します。

### **Code example**

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

class TransformerEncoder(layers.Layer):
    def __init__(self, embed_dim, dense_dim, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.embed_dim = embed_dim
        self.dense_dim = dense_dim
        self.num_heads = num_heads
        self.attention = layers.MultiHeadAttention(
            num_heads=num_heads, key_dim=embed_dim)
        self.dense_proj = keras.Sequential(
            [layers.Dense(dense_dim, activation="relu"),
             layers.Dense(embed_dim),]
        )
        self.layernorm_1 = layers.LayerNormalization()
        self.layernorm_2 = layers.LayerNormalization()

    def call(self, inputs, mask=None):
        if mask is not None:
            mask = mask[:, tf.newaxis, :]
        attention_output = self.attention(
            inputs, inputs, attention_mask=mask)
        proj_input = self.layernorm_1(inputs + attention_output)
        proj_output = self.dense_proj(proj_input)
        return self.layernorm_2(proj_input + proj_output)

    def get_config(self):
        config = super().get_config()
        config.update({
            "embed_dim": self.embed_dim,
            "num_heads": self.num_heads,
            "dense_dim": self.dense_dim,
        })
        return config

In [None]:
vocab_size = 20000 # Number of words
embed_dim = 256 # Dimension of embedding
num_heads = 2
dense_dim = 32

inputs = keras.Input(shape=(None,), dtype="int64")
x = layers.Embedding(vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, dense_dim, num_heads)(x)
x = layers.GlobalMaxPooling1D()(x)
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation="sigmoid")(x)
model = keras.Model(inputs, outputs)
model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])
model.summary()

In [None]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz
!rm -r aclImdb/train/unsup

In [None]:
import os, pathlib, shutil, random
from tensorflow import keras
batch_size = 32
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
    os.makedirs(val_dir / category)
    files = os.listdir(train_dir / category)
    random.Random(1337).shuffle(files)
    num_val_samples = int(0.2 * len(files))
    val_files = files[-num_val_samples:]
    for fname in val_files:
        shutil.move(train_dir / category / fname,
                    val_dir / category / fname)

train_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/train", batch_size=batch_size
)
val_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/val", batch_size=batch_size
)
test_ds = keras.utils.text_dataset_from_directory(
    "aclImdb/test", batch_size=batch_size
)
text_only_train_ds = train_ds.map(lambda x, y: x)

In [None]:
from tensorflow.keras import layers

max_length = 600
max_tokens = 20000
text_vectorization = layers.TextVectorization(
    max_tokens=max_tokens,
    output_mode="int",
    output_sequence_length=max_length,
)
text_vectorization.adapt(text_only_train_ds)

int_train_ds = train_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_val_ds = val_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)
int_test_ds = test_ds.map(
    lambda x, y: (text_vectorization(x), y),
    num_parallel_calls=4)

In [None]:
for item in int_train_ds:
  print(item)

In [None]:
model.fit(int_train_ds, validation_data=int_val_ds, epochs=3)

In [None]:
print(f"Test acc: {model.evaluate(int_test_ds)[1]:.3f}")

### **Practice 6-3**
* Increase the number of epochs to 10 in the Transformer encoder and check if the classification accuracy is improved. (If it takes longer to execute, you can reduce the number of epochs.)
* In addition to SimpleRNN, LSTM, and GRU, which we checked last time, compare and discuss the classification accuracy of four models including Transformer encoder. Furthermore, let's compare by also focusing on the number of model parameters.



## Reference
* Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, https://arxiv.org/abs/1810.04805v1, 2018.
* Keras official Website, https://keras.io/examples/