# Short lecture on "Basics of Neural Language Model"

**Lecturer: Prof. Kosuke Takano, Kanagawa Institute of Technology**

This short lecture instructs the basics of neural language model along with simple python codes. The Large Language Model (LLM) such as OpenAI's ChatGPT and Goolge's Gemini are dramatically changing our life and society with their awesome human-like capability, however their mechanism is not so complicated. This lecture aims to focus on basic components to build the LLM and enlighten how they work in a neural network architecture. Student will write small codes of basic functions consisting of neural networks for the natural language processing and deepen the understanding on the principle.

## Content

Day 1:
* Basic of neural network
* Word embedding
* Sequential neural model for Natural Language Processing

Day 2:
* Sequential neural model for Natural Language Processing (Cont.)
* Transformer
* Conversation application by GPT

## Requirement
* PC and Internet connection
* Google Colaboratory ... Google account is required

## Execution environment

Python programs are very version sensitive.Since the execution environment of Colaboratory will be updated at google's discretion, so we need to check it.<br>
Python: 3.10.12 (Februrary 27, 2024)<br>
TensorFlow: 2.15.0 (Februrary 27, 2024

Be sure to specify GPU or TPU as the runtime type.

In [None]:
!python --version

Python 3.10.12


In [None]:
import tensorflow as tf

print(tf.__version__)

2.15.0


# Part 3

## RNN (Recurrent Neural Network)

* Neural network with recurrent architecture
* Used in fields such as natural language processing.
 * Document classification, sentiment analysis, machine translation, text generation
* There are derivative models such as LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit).

<center>
<img src='https://drive.google.com/uc?export=view&id=1WkAbRVWHZd6aA1-X9WeqO75ojjiQQihk' width='70%'>
</center>
<center>
Figure 1. RNN architecture
</center>


## Detail of RNN cell
In Figure 1, the part marked "*$H$*" will be called the RNN layer. Each cell $h_t$ in $H$ is called RNN cell, and the output is recullently calculated using the output in prevous RNN cell. Figure 2 shows the architecture of the RNN cell. As shown in Figure 2, the relationship between the input $\mathbf{x}_t$ and output $\mathbf{o}_t$ in the RNN cell $h_t$ is calculated by the following formula.<br><br>

$$\mathbf{o}_t = tanh(\mathbf{o}_{t-1} \mathbf{W}_o + \mathbf{x}_t \mathbf{W}_x  + \mathbf{b}) \tag{1}$$
<br>

* $\mathbf{x}_t$ is a term vector
* $\mathbf{o}_{t-1}$ is an output from the previous layer
* $\mathbf{W}_x$ for $\mathbf{x}_t$ is the weighting coefficient
* $\mathbf{W}_o$ for $\mathbf{o}_{t-1}$ is the weighting coefficient
* $\mathbf{b}$ is the bias value
* $tanh$ is a hyperbolic tangent function, which is used as an activation function.

<center>
<img src='https://drive.google.com/uc?export=view&id=105u8Lk7z7RkSfgDTAc_DMgGIpGWbid2q' width='70%'>
</center>
<center>
Figure 2. RNN cell architecture
</center>



## Basic calculation of RNN cell
Figure 3 shows a basic calculation of RNN cell. RNN cell $h_{t}$ returns one output $\mathbf{o}_t$ vector for two input vectors $\mathbf{x}_t$ and $\mathbf{o}_{t-1}$. Here, $\mathbf{x}_t$ is a term vector and $\mathbf{o}_{t-1}$ is a vector of the output from previous RNN cell $h_{t-1}$. The calculation steps are as follows.

Step-1: Calculate the output of previous RNN cell $h_{t-1}$ and input it into $h_{t}$<br>
Step-2: Input a term vector $\mathbf{x}_t$ into  $h_{t}$<br>
Step-3: Multiply $\mathbf{o}_{t-1}$ with the weighting cofficient $\mathbf{W}_o$<br>
Step-4: Multiply $\mathbf{x}_{t}$ with the weighting cofficient $\mathbf{W}_x$
Step-5: Add bias $\mathbf{b}$
Step-6: Finally, tanh function is applied and the $\mathbf{o}_{t}$ is output

In Step-3, as shown in Figure 3, if $\mathbf{x}_{t}$ is a 9 dimensions vector and $\mathbf{W}_x$ is a 9 x 3 matrix, a 3 dimensions vector is output by calculation.

<center>
<img src='https://drive.google.com/uc?export=view&id=1UtAIUFPtClrCQpkXY86azqNzG8yFW4TF' width='65%'>
</center>
<center>
Figure 3. Calculation of RNN cell
</center>



### Code example

Consider the leftmost cell. The leftmost cell has no $o_{t-1}$ input, so the formula is as follows.Here, a word vector is input as $\mathbf{x}_1$.

$$ \mathbf{o}_1 = tanh(\mathbf{x}_1 \mathbf{W}_x + \mathbf{b}) \tag{3}$$

Let's write a code for the above calculation formula with the number of dimensions of the word vector as 100 and the number of dimensions inside the cell (the number of dimensions of the hidden layer) as 5.

In [None]:
import numpy as np

wordvec_size = 100
hidden_size = 5

Wx = np.random.randn(wordvec_size, hidden_size)
b = np.zeros(hidden_size)
print(Wx.shape)
print(b.shape)

We use word vectors of an embedding matrix trained by word2vec. Run the previous code and generate a word embedding matrix.

In [None]:
!wget http://mattmahoney.net/dc/text8.zip

In [None]:
!unzip text8.zip

In [None]:
import logging
from gensim.models.word2vec import Word2Vec, Text8Corpus

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = Text8Corpus('text8')
model = Word2Vec(sentences, vector_size=100)

model.save('model.bin')

In [None]:
model = Word2Vec.load('model.bin')

x = model.wv['dog']

In [None]:
print(x)

The word embedding vector of "dog" is obtained. Let's do the calculation above.

In [None]:
_o = np.dot(x, Wx) + b
print(_o)

Next, input it to the activation function tanh().

In [None]:
o = np.tanh(_o)
print(o)

### Practice 3-1
1. Let's create a function RNN_cell0(x, Wx, b) that calculates the leftmost RNN cell.
2. Calculate the output of RNN_cell0(x, Wx, b) for the word of "cat".

In [None]:
def RNN_cell0(x, Wx, b):
  _o = np.dot(x, Wx) + b
  o = np.tanh(_o)

  return o

In [None]:
x = model.wv['cat']
o = RNN_cell0(x, Wx, b)

print(o)

## Code example

In the second and subsequent RNN cells, $\mathbf{o}_{t-1}$, which is the output of the previous RNN cell, is also input and calculated. The calculation formula for the second RNN cell is as follows.

$$ \mathbf{o}_2 = tanh(\mathbf{o}_1 \mathbf{W}_o + \mathbf{x}_2 \mathbf{W}_x +\mathbf{b}) \tag{4}$$




Let's calculate the first and second RNN cells for the sentence "I like dog".

In [None]:
x1 = model.wv['i']
x2 = model.wv['like']

Set the weight matrix and bias in the same way as before. $\mathbf{W}_{o}$ is a new addition.

In [None]:
wordvec_size = 100
hidden_size = 5

Wx = np.random.randn(wordvec_size, hidden_size)
Wo = np.random.randn(hidden_size, hidden_size)
b = np.zeros(hidden_size)

In [None]:
_o1 = np.dot(x1, Wx) + b
o1 = np.tanh(_o1)

_o2 = np.dot(o1, Wo) + np.dot(x2, Wx) + b
o2 = np.tanh(_o2)

print(o1)
print(o2)

### Practice 3-2
Calculate the output of the third RNN cell for the sentence 'I like dog'.

In [None]:
x3 = model.wv['dog']
_o3 = np.dot(o2, Wo) + np.dot(x3, Wx) + b
o3 = np.tanh(_o3)

print(o3)

### **Practice 3-3**
1. Create a function RNN_cell(x, o, Wx Wo, b) that calculates the second and subsequent RNN cells.
2. Calculate the output of the third RNN cell for the sentence "You look mirror".

In [None]:
def RNN_cell(x, o, Wx, Wo, b):

  _o = np.dot(o, Wo) + np.dot(x, Wx) + b
  o = np.tanh(_o)

  return o

In [None]:
x1 = model.wv['you']
x2 = model.wv['look']
x3 = model.wv['mirror']

o1 = RNN_cell0(x1, Wx, b)
o2 = RNN_cell(x2, o1, Wx, Wo, b)
o3 = RNN_cell(x3, o2, Wx, Wo, b)

print(o1)
print(o2)
print(o3)

# Part-4

## Sentiment analysis by RNN


Sentiment analysis and document classification are applied natural language processing tasks. Sentiment analysis is a task for prediting emotion for the input sentence. Document classification task requires to classify documents according to their content.
When building a neural network for natural language processing tasks, a word embedding layer is often used for the input layer.

Figure 4 shows an example architecture of RNN for postive-negative judgement on input sentece.Sentence is split into $n$ words,$t_1, t_2, \cdots, t_n$, and each splited word is input into each RNN cell recurrently.

<center>
<img src='https://drive.google.com/uc?export=view&id=1RN08BTtS54Ubgk0QA1qLNFBK_ixwWchF' width='50%'>
</center>
<center>
Figure 4. Positive - negative judgment by RNN
</center>

## IMDb dataset
* The IMDb dataset (`http://ai.stanford.edu/~amaas/data/sentiment/`) is a collection of 50,000 movie reviews, with 25,000 positive and negative (25,000) reviews. Consists of 25,000 negative (negarive) ratings. Here, movie reviews are text data. Positive and negative evaluations are assigned to each sentence as correct answer.

### Code example

Let's classify movie reviews into positive and negative for the IMDB dataset. Here, we will use the IMDB data included with keras. There are 25,000 pieces of training data (x_train, y_train) and 25,000 pieces of test data (x_test, y_test). Also, y_train (y_test) stores postive (= 1) or negative (= 0).

In [None]:
import tensorflow as tf

max_words = 10000  # Number of terms
max_len = 500  # Length of input sequence
embedding_dim = 128 # Dimension of word embedding

(x_train, y_train), (x_test, y_test) =  tf.keras.datasets.imdb.load_data(num_words=max_words)

We check both train data and test data.Train data cosists of consisting of body data (x_train) and answers (y_train) and is used for training a model, In IMdB, body data (x_train) is a list of word ids proprocessed from the original review sentence. Answer (y_train) is answer for the review sentence, 1 or 0. Test data (x_test, y_test) is used to test the trained model and consists of data set with the same structure as training data.  

In [None]:
print(len(x_train), 'train (sequence)')
print(len(x_test), 'test (sequence)')

In [None]:
print(len(x_train[0]))
print(x_train[0])

print(len(x_train[1]))
print(x_train[1])

### **Practice 4-1**

1. Check the number of data in y_train and y_test.
2. Display the values ​​of y_train[0] and y_train[1].

### **Code example**

Using index information, each word id can be converted to the correspond word.

In [None]:
word2id =  tf.keras.datasets.imdb.get_word_index()
id2word = {i: word for word, i in word2id.items()}

print(len(word2id))
print(len(id2word))

In [None]:
print("Index number of 'car':")
print(word2id['car'])

print('Words review sentence:')
print([id2word.get(i, ' ') for i in x_train[0]])

### **Code example**

**Padding**

The word count (document length) of review data x_train[0] and x_train[1] was 218 words and 189 words, respectively. When training a neural network, input the same number of words. The process of filling in spaces with characters (such as 0) to equalize the number of words is called "padding".

In [None]:
# Padding

max_len = 500
x_train = tf.keras.utils.pad_sequences(x_train, maxlen=max_len)
x_test = tf.keras.utils.pad_sequences(x_test, maxlen=max_len)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)

In [None]:
print(len(x_train[0]))
print(x_train[0])

### **Practice 4-2**

1. Display x_train[1] and x_train[2].
2. Similarly, apply padding to x_test to a length of 500.

## Applying RNN model


### Code example

* Build an RNN model using SimpleRNN.
* Embedding() is a word embedding layer that vectorizes words.

In [None]:
from keras.layers import SimpleRNN, Embedding, Dense
from keras.models import Sequential

embedding_dim=128
model_rnn=Sequential()
model_rnn.add(Embedding(max_words, embedding_dim, input_length=max_len))
model_rnn.add(SimpleRNN(100))
model_rnn.add(Dense(1, activation='sigmoid'))
print(model_rnn.summary())

In [None]:
model_rnn.compile(loss='binary_crossentropy',
             optimizer='adam',
             metrics=['acc'])

batch_size = 128
num_epochs = 3

history = model_rnn.fit(x_train, y_train,
              validation_split=0.2,
              batch_size=batch_size, epochs=num_epochs)

We evaluate the trained model.

In [None]:
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

In [None]:
scores = model_rnn.evaluate(x_test, y_test, verbose=0)
print('Test accuracy:', scores[1])

In [None]:
print(x_test[0].shape)
print(x_test[0][np.newaxis].shape)

In [None]:
pred0 = model_rnn.predict(x_test[0][np.newaxis])

print('Predition: ', pred0)
print('Answer: ', y_test[0])

### **Practice 4-3**
Similarly, check the prediction result is correct for x_test[1] and x_test[2].

In [None]:
pred1 = model_rnn.predict(x_test[1][np.newaxis])

print('Prediction: ', pred1)
print('Answer: ', y_test[1])

pred2 = model_rnn.predict(x_test[2][np.newaxis])

print('Prediction: ', pred2)
print('Answer: ', y_test[2])

In [None]:
pred0_2 = model_rnn.predict(x_test[0:3])

print(pred0_2)

print(y_test[0:3])

### **Practice 4-4**
* Increase the number of epochs to 10 in the RNN model and run it. Let's also draw classification accuracy and loss graphs. (If it takes longer to execute, you can reduce the number of epochs.)
* Predict x_test[0], x_test[1], x_test[2] and see if there is a change in the results compared to when the number of epochs is small.

## LSTM (Long short-term memory )
* It is a derivative type of neural network called RNN.
* RNNs have the problem of not being able to learn well because gradient vanishing occurs during learning.
* LSTM prevents the problem of vanishing gradients during learning by adding a separate memory cell to transfer memory to the next LSTM layer.

<center>
<img src='https://drive.google.com/uc?export=view&id=1FtxOh6SmN56d8cDyuAFwKN6dtl5AO-4C' width='70%'>
</center>
<center>
Figure 5. LSTM cell architecture
</center>


### Applying LSTM model


### **Code example**
* Build a prediction model using LSTM.
* You can build the model you created earlier by simply replacing SimpleRNN with LSTM.

In [None]:
from keras import Sequential
from keras.layers import Embedding, LSTM, Dense, Dropout

embedding_dim=128
model_lstm=Sequential()
model_lstm.add(Embedding(max_words, embedding_dim, input_length=max_len))
model_lstm.add(LSTM(100))
model_lstm.add(Dense(1, activation='sigmoid'))
print(model_lstm.summary())

In [None]:
model_lstm.compile(loss='binary_crossentropy',
             optimizer='adam',
             metrics=['acc'])

batch_size = 128
num_epochs = 5

history = model_lstm.fit(x_train, y_train,
              validation_split=0.2,
              batch_size=batch_size, epochs=num_epochs)

In [None]:
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

In [None]:
scores = model_lstm.evaluate(x_test, y_test, verbose=0)
print('Test accuracy:', scores[1])

### **Practice 4-5**
* Increase the number of epochs to 10 in the LSTM model and run it. Let's also draw classification accuracy and loss graphs. (If it takes longer to execute, you can reduce the number of epochs.)
* Predict x_test[0], x_test[1], x_test[2] and see if there is a change in the results compared to when the number of epochs is small.

### GRU
* Similar to LSTM, it is a derivative type of neural network for RNN.
*This model has fewer parameters than LSTM and has lower calculation costs.
* Unlike LSTM, it does not use memory cells, but it is structured that gradient vanishing does not occur during learning.

<br>
<img src='https://drive.google.com/uc?export=view&id=13TskGR7UE5Mm2E5ofvOKxGIXl2xQotfQ' width='90%'>
</center>
<center>
Figure 6. Comaprison of cell architectures of RNN, LSTM, GRU
</center>


In [None]:
from keras.layers import GRU
embedding_dim=128
model_gru=Sequential()
model_gru.add(Embedding(max_words, embedding_dim, input_length=max_len))
model_gru.add(GRU(100))
model_gru.add(Dense(1, activation='sigmoid'))
print(model_gru.summary())

In [None]:
model_gru.compile(loss='binary_crossentropy',
             optimizer='adam',
             metrics=['acc'])

batch_size = 128
num_epochs = 5

history = model_gru.fit(x_train, y_train,
              validation_split=0.2,
              batch_size=batch_size, epochs=num_epochs)

In [None]:
import matplotlib.pyplot as plt

acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

In [None]:
scores = model_gru.evaluate(x_test, y_test, verbose=0)
print('Test accuracy:', scores[1])

### **Practice 4-6**
* Increase the number of epochs to 10 in the GRU model and run it. Let's also draw classification accuracy and loss graphs. (If it takes longer to execute, you can reduce the number of epochs.)
* Predict x_test[0], x_test[1], x_test[2] and see if there is a change in the results compared to when the number of epochs is small.

### **Practice 4-7**
* Compare and discuss the classification accuracy of the three models: SimpleRNN, LSTM, and GRU.
* Furthermore, let's compare by also focusing on the number of model parameters.


## References
* https://www.kaggle.com/shivamb/beginners-guide-to-text-generation-using-lstms
* Keras official Website, https://keras.io/examples/