# Short lecture on "Basics of Neural Language Model"

**Lecturer: Prof. Kosuke Takano, Kanagawa Institute of Technology**

This short lecture instructs the basics of neural language model along with simple python codes. The Large Language Model (LLM) such as OpenAI's ChatGPT and Goolge's Gemini are dramatically changing our life and society with their awesome human-like capability, however their mechanism is not so complicated. This lecture aims to focus on basic components to build the LLM and enlighten how they work in a neural network architecture. Student will write small codes of basic functions consisting of neural networks for the natural language processing and deepen the understanding on the principle.

## Content

Day 1:
* Basic of neural network
* Word embedding
* Sequential neural model for Natural Language Processing

Day 2:
* Sequential neural model for Natural Language Processing (Cont.)
* Transformer
* Conversation application by GPT

## Requirement

* PC and Internet connection
* Google Colaboratory  ... Google account is required


## Execution environment

Python programs are very version sensitive.Since the execution environment of Colaboratory will be updated at google's discretion, so we need to check it.<br>
Python: 3.10.12 (Februrary 27, 2024)

Be sure to specify GPU or TPU as the runtime type.

In [None]:
!python --version

Python 3.10.12


# Part-1

## 1. Principle of neural network



## Introduction of neural network

Neural networks are mathematical models that represent nerve cells (neurons) in the brain and their connections as a network of artificial neurons.
The first artificial neuron was devised in 1943 by Dr. W. McCulloch as a formal neuron. Also, in 1957, Dr. F. Rosenblatt devised a perceptron that applied formal neurons. Among perceptrons, those with two network layers are called simple perceptrons, and those with three or more layers are called multilayer perceptrons (Figure 1). The multilayer perceptron was repeatedly improved, and in the 1980s, error backpropagation, a method for efficiently learning neural networks, was applied, resulting in a multilayer perceptron consisting of several layers. It has become possible to realize so-called "shallow" neural networks.

<center>
<img src='https://drive.google.com/uc?export=view&id=1WAX0TCABY4V-qTHPjTeFrwZtrW_W7l61' width='80%'>
</center>

<center>Figure 1. Initial perseptron</center>

Furthermore, in the late 2000s, (1)problems such as the vanishing gradient problem, which is a phenomenon in which the learning efficiency of error backpropagation decreases as the network layer becomes deeper, were solved, and (2)the computational performance of computers improved, making it possible to reduce the learning time of neural networks. As a result, it has become possible to realize deep neural networks (Figure 2) with deeper network layers.

<center>
<img src='https://drive.google.com/uc?export=view&id=13x2zKSy1HHKz5ReZcUPMddO-aIq-sZpi' width='70%'>
</center>

<center>Figure 2. Deep neural network</center>

## Principle of perceptron

As shown in Figure 1, the simple perceptron accepts n values ​​as input and outputs one value. This calculation is done in two stages. First, the n input values ​​are multiplied by their corresponding weight values, and the sum is calculated. We will call this a "weighted sum." Next, the weighted sum is input to a transformation function called an "activation function", and an output value is calculated depending on the properties of the activation function. Step functions, sigmoid functions, ReLU functions, etc. are often used as activation functions.

<center>
<img src='https://drive.google.com/uc?export=view&id=157Bff_Wen0w72_VDbmLvdG7xAEBPG5k-' width='50%'>
</center>

<center>Figure 3. Overview of calculations in perceptron</center>

Figure 3 shows an overview of the calculation in the perceptron. In Figure 3, if $n$ input values ​​are $x_1$, $x_2$, …, $x_n$, the corresponding weights are $w_1$, $w_2$, …, $w_n$, and the bias is b, the weighted sum $\_y$ can be calculated as follows.

$$\_y=\sum_{k=1}^n x_k\cdot w_k + b \tag{1}$$

In the case of two input values $x_1$ and $x_2$, the formula is as follows.

$$\_y=\sum_{k=1}^n x_k\cdot w_k + b = x_1\cdot w_1 + x_2\cdot w_2 + b \tag{2}$$

Also, suppose the activation function be Activate(), the output $y$ of the perceptron with the weighted sum $\_y$ as input is expressed by the following equation.

$$y = Activate(\_y) = Activate(\sum_{k=1}^n x_k\cdot w_k + b) \tag{3}$$

### **Python basic 1**

Using python, calculate the following formula.

(1) 1 + 2<br>
(2) 4 - 1 + 2<br>
(3) (4 + 5) x 4<br>
(4) 5 / (1 + 2 + 3)


In [None]:
a = 1 + 2
b = 4 - 1 + 2
c = (4 + 5) * 4
d = 5 / (1 + 2 + 3)

print(a)
print(b)
print(c)
print(d)

### **Python basic 2**

Using python, calculate the following formula when p = 1, q = 2.

(1) p + q<br>
(2) p - q<br>
(3) p $\times$ q<br>
(4) p / q

In [None]:
p = 1
q = 2

a = p + q
b = p - q
c = p * q
d = p / q

print(a)
print(b)
print(c)
print(d)

### **Python basic 3**
Definfe a function add(x, y) for adding two input values x and y. Furthermore, calculate the result when x = 2 and y = 3.

In [None]:
def add(x, y):
  z = x + y
  return z

In [None]:
x =2
y =3

z = add(x, y)

print(z)

### **Code example**

We can write Equation (2) in python code as follow.

In [None]:
_y = x1 * w1 + x2 * w2 + b

Let's calculate using values, (x1,x2)=(2,4),(w1,w2)=(3,5), b=1.

In [None]:
(x1, x2) = (2, 4)
(w1, w2) = (3, 5)
b = 1

_y = x1 * w1 + x2 * w2 + b

print(_y)

### **Practice 1-1**
1. Write a python program that calculates the weighted sum for two input values ​​$x_1$ and $x_2$ as a function _two_input_weight(). As arguments, take two input values ​​x1, x2, two weight values ​​w1, w2, and bias b, and output _y as the return value as follows.<br><br>
_y = _two_input_weight(x1, x2, w1, w2, b)

2. Furthermore, calculate results when $(x_1, x_2) = (2, 4), (w_1, w_2) = (3, 5)$, and $b = 1$.

In [None]:
import numpy as np

def _two_input_weight(x1, x2, w1, w2, b):

  _y = x1 * w1 + x2 * w2 + b
  return _y

In [None]:
(x1, x2) = (2, 4)
(w1, w2) = (3, 5)
b = 1

_y = _two_input_weight(x1, x2, w1, w2, b)

print (_y)

## Layer expression of neural network

A neural network is composed of a large number of neurons. However, if each adjacent network is represented as a single network layer as shown in Figure 4, it can be seen that the output of neurons is propagated sequentially from network layer to network layer (Figure 5).As shown in Figure 4, we call a layer connects all neurons between adjacent network layers and calculates a weighted sum and activation "fully connected layer".

<center>
<img src='https://drive.google.com/uc?export=view&id=1dGcNIotcQJuHzFpeQ6l6zVI2hd3r89Wc' width='30%'>
</center>

<center>Figure 4. Fully connected layer</center>

<center>
<img src='https://drive.google.com/uc?export=view&id=1F0CMnq3eF91nGk8qFf7vH4s7jxcd1Ua3' width='30%'>
</center>

<center>Figure 5. Layer expression of neural network </center>

## Calculation for extending to fully connted layers

<center>
<img src='https://drive.google.com/uc?export=view&id=1vtzsNTep-RtQF0f0sN21L-nnTAhoR24q' width='70%'>
</center>

<center>Figure 6. Calculation flow of weighted sums in fully connected layers </center>

Figure 6 shows the flow of calculating a weighted sum in a fully connected layer consisting of two neurons for two input values. In Figure 6, if the two input values ​​are x1, x2, the corresponding weights are (w11, w21), (w12, w22), and the biases are b1, b2, then the weighted sums _y1, _y2 are as follows. It can be calculated as follows.

$$\_y_1=x_1\cdot w_{11}+x_2\cdot w_{12}+b_1 \tag{4}$$
$$\_y_2=x_1\cdot w_{21}+x_2\cdot w_{22}+b_2 \tag{5}$$

Equations (4) and (5) are vector $\mathbf{x} = (x_1, x_2)$, matrix $\mathbf{W} = ((w_{11}, w_{21})$, $(w_{12}, w_{22}))$, vector $\mathbf{b} = (b_1, b_2)$, vector $\mathbf{\_y} = ( \_y_1, \_y_2)$, it can be calculated with one formula as shown below.
Equation (6) can be used to calculate the weighted sum in a fully connected layer not only for two neurons for two input values ​​but also for $m$ neurons for $n$ input values.

$$\_\mathbf{y}=\mathbf{xW}+\mathbf{b} \tag{6}$$

### **Python basic 4**
Using python and its numpy library, calculate (1)addition, (2)subtraction, and (3)inner product when (x1,x2)=(2,4),(w1,w2)=(3,5) , and  b=1 .

In [None]:
import numpy as np

a = np.array([5, 2, 3])
b = np.array([-1, 0, 1])

addition = a + b
subtraction = a - b
innerprod1 = np.dot(a, b) # inner product
innerprod2 = np.matmul(a, b) # inner product

print(addition)
print(subtraction)
print(innerprod1)
print(innerprod2)

### **Python basic 5**
Using python and numpy library, calculate (1)addition, (2)matrix element product, and (3)matrix product when A ＝[[5, 2], [1, 4]], B = [[-1, 0], [2, 3]]

In [None]:
A = np.array([[5, 2], [1, 4]])
B = np.array([[-1, 0], [2, 3]])


g = A + B
h = A * B
i = np.matmul(A, B)

print(g)
print(h)
print(i)

### **Code example**

We can write Equation (5) in python code as follow.

In [None]:
_y = np.matmul(x, W) + b

Let's calculate for vectors and matrices, x=(x1,x2)=(2,4) , W=((w11,w21),(w12,w22))=((3,5),(6,5)), b=(b1,b2)=(1,3). We can use numpy's array function array() for arrays.

In [None]:
import numpy as np

x = np.array([2,4])
W = np.array([[3, 5],[6,5]])
b = np.array([1,3])

_y = np.matmul(x, W) + b

print(_y)

### **Practice 1-2**

 1. Write a python program that calculates the weighted sum of two neurons for $n$ input values $​​x = (x_1, x_2, \cdots , x_n)$ as a function _fc_weight() This function calculates the weighted sum of fully connected layers. As arguments, take an array for vector $\mathbf{x}$ corresponding to $n$ input values, a matrix $\mathbf{W}$ corresponding to each weight value of $m$ neurons, and an array for vector $\mathbf{b}$ corresponding to each bias, and output an array for vector $\_\mathbf{y}$ as a return value.
<br><br>
_y = _fc_weight(x, W, b)

2. Furthermore, calculates the fesults for $\mathbf{x} = (x_1, x_2) = (2, 4)$, $\mathbf{W} = ((w_{11}, w_{21}), (w_{12}, w_{22})) = ((3, 5), (6, 5))$, $\mathbf{b} = (b_1, b_2) = (1, 3)$. Use numpy's array function array() for arrays.

In [None]:
import numpy as np

def fc_weight(x, W, b):
  _y = np.matmul(x, W) + b

  return _y

Furthermore, calculates the fesults for $\mathbf{x} = (x_1, x_2) = (2, 4)$, $\mathbf{W} = ((w_{11}, w_{21}), (w_{12}, w_{22})) = ((3, 5), (6, 5))$, $\mathbf{b} = (b_1, b_2) = (1, 3)$. Use numpy's array function array() for arrays.

In [None]:
x = np.array([2,4])
W = np.array([[3, 5],[6,5]])
b = np.array([1,3])

_y = fc_weight(x, W, b)

print(_y)

## Applying activation function

A sigmoid function is any mathematical function whose graph has a characteristic S-shaped curve. The sigmoid function is applyed as the activate function in Figure 3.

$$ S(x) = \frac{1}{1+e^{-x}} \tag{7}$$

<br>
<center>
<img src='https://drive.google.com/uc?export=view&id=1pZn0O6YanQoQVDrPRxWhZy7pGzs85pYv' width='40%'>
</center>

<center>Figure 7. Graph of sigmoid function </center>

### **Code example**

Let's define a sigmoid funtion sigmoid(x) using python.

In [None]:
import numpy as np

def sigmoid(x):
    return 1/(1+np.exp(-x))

### **Python basic 6**
Draw a graph of sigmoid function using matplotlib library.

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure()

graph = fig.add_subplot(111)
graph.grid(linestyle="dotted")
graph.set_xlabel("x")
graph.set_ylabel("y")

x = np.linspace(-10, 10, num=300)

y = sigmoid(x)
graph.plot(x, y)

plt.show()

## Final output of perceptron
As shown in the formula (8), the output of perceptron is calculated by applying the activation functin for the weighed sum.Especially, when applying a sigmoid function, the formula (8) is described as the formula (9).

$$y = Activate(\_y) = Activate(\mathbf{xW}+\mathbf{b}) \tag{8}$$
$$y = Activate(\_y) = sigmoid(\mathbf{xW}+\mathbf{b}) \tag{9}$$

### **Code example**

Let's define a function of calculation in fully connected layer fc_layer(). Use a sigmoid function as the activation function. As arguments, take an array for vector $\mathbf{x}$ corresponding to $n$ input values, a matrix $\mathbf{W}$ corresponding to each weight value of $m$ neurons, and an array for vector $\mathbf{b}$ corresponding to each bias, and output an array for vector $\mathbf{y}$ as a return value.

In [None]:
def fc_layer(x, W, b):
  _y = fc_weight(x, W, b)
  y = sigmoid(x)

  return y

Furthermore, calculates the fesults for x=(x1,x2)=(2,4), W=((w11,w21),(w12,w22))=((3,5),(6,5)),  b=(b1,b2)=(1,3).

In [None]:
x = np.array([2,4])
W = np.array([[3, 5],[6,5]])
b = np.array([1,3])

y = fc_layer(x, W, b)

print(y)

# Part-2

## What is natural language processing?

* Languages ​​used in everyday conversation, such as Japanese, Thai, and English are called natural languages.
Natural language processing is a process that enables computers to perform machine translation, automatic summarization, text classification, context understanding, conversation generation, etc. through processes such as word and phrase extraction and dependency analysis in natural language.

* Application examples of natural language processing
 * Document classification
 * Search
 * Machine translation
 * Document summary
 * Question answer
 * Dialogue
 * Part-of-speech tagging
 * Word splitting
 * Semantic disambiguation
 * Named entity extraction
 * Parsing
 * Predicate term structure recognition


## Natural language processing and deep learning
* Estimation of the probability of word appearance in a document using topic models and part-of-speech estimation using hidden Markov models have been performed.

* In 2013, a method such as word2vec that uses neural networks to learn distributed representations of words was devised, and RNN and LSTM have been applied to natural language processing, as well as machine translation, dialogue generation, automatic summarization, and image description. Applications such as generation have expanded.

## Flow of natural language processing

1. Dataset preparation
2. Pre-processing
3. Quantifying words
4. Learning dataset = building applied model
5. Classification and regression using applied models



## Pre-processing
* Processing the text into a format that is easy for analysis programs to process by processing such as n-gram division and stop words.

 * Unified notation
 * Lowercase and uppercase
 * Word replacement



### **Code example**

In [None]:
text = 'I look at sky and you look in mirror.'

In [None]:
print(text)

In [None]:
text = text.lower() # lowercase
text = text.replace('.', ' .') # separate period
words = text.split(' ') # Split words by white space

In [None]:
print (words)

## Quantifying words
* Converts words into number values so that relationships between words can be calculated quantitatively.
* The basic way of quantifying words is indexing words and using an index number as a word id.

In [None]:
def word2id(words):

  word_to_id = {}

  for word in words:
    if word not in word_to_id:
      new_id = len(word_to_id)
      word_to_id[word] = new_id

  return word_to_id

def id2word(word_to_id):
  id_to_word = {}
  for word, id in word_to_id.items():
    id_to_word[id] = word

  return id_to_word

### **Practice 2-1**
For the following sentences, use word2id() to convert words to IDs. Also, let's use id2word() to extract the corresponding word from the id.<br><br>


In the mirror, you saw a bird flying across the blue sky.

In [None]:
# Indexing words
word_to_id = word2id(words)
print(word_to_id)

# Reverse lookup from word id
id_to_word = id2word(word_to_id)
print(id_to_word)

In [None]:
print(word_to_id['look'])
print(id_to_word[7])

## One-hot vector representation
* Another way for quantifiyg words is using one-hot vector representation
* One-hot vector is a vector whose element values ​​are 0 and 1, and where only one element is 1. Neural networks that process natural language often use one-hot vectors as input.

* For the sentence 'I look at sky and you look in mirror.', when decomposed as in the example above, using the word id, a one-hot vector for each word can be generated as shown below.

 * i: [1, 0, 0, 0, 0, 0, 0, 0, 0]
 * look: [0, 1, 0, 0, 0, 0, 0, 0, 0]
 * at: [0, 0, 1, 0, 0, 0, 0, 0, 0]

### **Code example**

In [None]:
# Convert word-number list data to one-hot format
def make_one_hot(corpus):
    N = corpus.shape[0]
    dim =len(word_to_id)

    one_hot = np.zeros((N, dim), dtype=np.int32)
    for idx, word_id in enumerate(corpus):
        one_hot[idx, word_id] = 1

    return one_hot


In [None]:
import numpy as np
print(words)
corpus = [word_to_id[word] for word in words]
corpus = np.array(corpus)

print(corpus)

make_one_hot(corpus)

### **Practice 2-2**
Let's convert the following sentence into a one-hot vector representation.<br><br>

In the mirror, you saw a bird flying across the blue sky.

## Language models and context
* The process by which words appear in a document is regarded as a stochastic process, and a model that calculates the probability that a word will appear in a certain position is called a language model.
* In a language model, the surrounding words used to calculate the probability of a word appearing are called context.

## Distributed representation of words (Word embedding)
* A word expressed as a vector.
* The following methods use neural networks to obtain word distributed representations.

 * Word2Vec
 * GloVe
 * fastText

* It can also be trained using regular deep learning to obtain word distributed representations.

## Word2Vec
* Method for generating distributed representations of words (word embeddings)

* Applys CBOW model and skip gram model
* Invented by Tomas Mikolov et al. in 2013
* Learn the meaning representation of words based on the distribution hypothesis (the hypothesis that the meaning of a word is formed by surrounding words).
*Learn with a two-layer neural network. The weights of the hidden layer become the distributed representation of the word.

* The distributed representation of the generated words is a vector matrix that represents the meaning of each word, and the distance between words can be calculated.

 * vector('Paris') - vector('France') + vector('Italy') = vector('Rome') <br>
 * vector('king') - vector('man') + vector('woman') = vector('queen')


## CBOW model

* Predict a single word using multiple words as a context.
* The order of the context words does not matter.
<br>

<center>
<img src='https://drive.google.com/uc?export=view&id=17xtbUuEkH9Vot5HhNX2A8tJWXdWzKbCH' width='70%'>
</center>

<center>Figure 9. CBOW model </center>



## Skip-gram model

* Predict multiple words using one word as context.
* Context words are weighted according to its positional proximity to input words.

<center>
<img src='https://drive.google.com/uc?export=view&id=1AwKBI_Vqz5s2QsijFt4I3GSUN0znhly8' width='70%'>
</center>

<center>Figure 10. Skip-gram model </center>


### **Practice 2-3**
Let's explain how traing sentences forcreating word distributed representation matrix (embedding matrix) for CBOW model and skip-gram model in Word2Vec, respectively.

## Creating a word embedding matrix
In order to create a word embedding matrix, a large set of sentences is required. Then, a neural model for extracting embedding matirx can be trained using the corpus by applying word2vec, GLoVe, and so on.

Step-1: Get a corpus that includes a set of sentences.<br>
Step-2: Taining a neural model using the corpus by applying a embedding creation method such as word2vec and GLoVe.

### Code example

Let'S create a word embedding matrix using the corpus provided at the website below.Here, we use word2vec, which is provided as a gensim library.

(Website) http://mattmahoney.net/dc/textdata.html

In [None]:
!wget http://mattmahoney.net/dc/text8.zip

In [None]:
!unzip text8.zip

In [None]:
import logging
from gensim.models.word2vec import Word2Vec, Text8Corpus

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = Text8Corpus('text8')
model = Word2Vec(sentences, vector_size=100)

model.save('model.bin')

In [None]:
model = Word2Vec.load('model.bin')

# embedding vector of 'dog'
print(model.wv['dog'])
print(model.wv['dog'].shape)
# extract words with similar meaning to 'car'
model.wv.most_similar(['car'])

Let's check whether the following formula used in the above explanation holds true.

vector('Paris') - vector('France') + vector('Italy’) = vector(‘Rome’)


In [None]:
vector = model.wv['paris'] - model.wv['france'] + model.wv['italy']

model.wv.most_similar(vector)

### **Practice 2-4**
Let's try to see if the following formula holds true.

vector('king') - vector('man') + vector('woman') = vector('queen')

## Vector space model

* Algorithms for information retrieval
* Research began around 1970
 * The SMART system of Dr. Salton and others is famous.

* Represent the search target data and search words as vectors and place them in the vector space.
* Calculate the similarity between search target data and search terms using vector calculations (cosine, inner product, distance, etc.).

<center>
<img src='https://drive.google.com/uc?export=view&id=1Pj7FI_hkdMK-jlpqdmNkZvPRHQRWbJ01' width='60%'>
</center>

<center>Figure 11. Vector space model </center>

## Cosine similarity
The cosine measure is used as a measure of vector closeness. The proximity of vectors calculated by cosine measure is called "cosine similarity". Suppose a query vector and a data vector be $\mathbf{q}, \mathbf{d}$, respectively, cosine similarity between $\mathbf{q}$ and $\mathbf{d}$ is calculated by the formula (10). Here, a vector variable (e.g. $\mathbf{q}$ and $\mathbf{d}$) is described in bold.

$$ C(q, d) = \frac{\mathbf{q}\cdot \mathbf{d}}{|\mathbf{q}||\mathbf{d}|} \tag{10}$$

<center>
<img src='https://drive.google.com/uc?export=view&id=1T_iVHLGLR3tXiFXdU741klw8bCdDnUbr' width='50%'>
</center>
<center>Figure 14. Cosine measure </center>

Suppse that $\mathbf{q} = (1, 0, 1), \mathbf{d_1} = (1, 1, 1), \mathbf{d_2} = (0, 1, 1)$,cosine similarities $C(\mathbf{q}, \mathbf{d_1})$ for $\mathbf{q}$ and $\mathbf{d_1}$ and $C(\mathbf{q}, \mathbf{d_2})$ for $\mathbf{q}$ and $\mathbf{d_2}$ are caluculated as follows.

$$ C(\mathbf{q}, \mathbf{d_1}) = \frac{(1,0,1)\cdot (1,1,1)}{|(1,0,1)||(1,1,1)|} = \frac{1\times 1 + 0\times 1 + 1\times 1}{\sqrt{1^2+0^2+1^2 }\sqrt{1^2+1^2+1^2}} = \frac{2}{1.41\times 1.73} = 0.81$$
$$ C(\mathbf{q}, \mathbf{d_2}) = \frac{(1,0,1)\cdot (0,1,1)}{|(1,0,1)||(0,1,1)|} = \frac{1\times 0 + 0\times 1 + 1\times 1}{\sqrt{1^2+0^2+1^2 }\sqrt{0^2+1^2+1^2}} = \frac{1}{1.41\times 1.41} = 0.50$$


### Code example

Let's create a function cosine_sim() that calculates the cosine similarity of the embedded vector, and try to calculate the cosine similarity between similarity the embedded vectors of "car" and "truck".

In [None]:
# Calculate cosine similarity
def cosine_sim(w1, w2):
  cosine_value = np.dot(model.wv[w1], model.wv[w2]) / (np.linalg.norm(model.wv[w1]) * np.linalg.norm(model.wv[w2]))

  return cosine_value

print (cosine_sim('car','truck'))

# ('truck', 0.7117820978164673)

The cosine similarity is output as 0.7151368, and you can see that the same value is calculated as the result of "model.wv.most_similar(['car'])" earlier. Here, please note that the values ​​may differ depending on the individual environment.

[('driver', 0.7645907402038574), <br>
 ('cars', 0.7256535291671753), <br>
 ('motorcycle', 0.7231885194778442), <br>
 ('taxi', 0.7163602113723755), <br>
 ('truck', 0.7151368856430054), <br>
 ('vehicle', 0.6943105459213257), <br>
...]

### **Practice 2-5**
Let's use the function cosine_sim() to calculate the cosine smilarity of the embedded vectors of "car" and "train."

## Document-term matrix
* A matrix characterized by a set of terms ($t_1$ to $t_n$) that appear in a document ($d_1$ to $d_m$)

<center>
<img src='https://drive.google.com/uc?export=view&id=1ErQ36FOlADuk54RF2qvPih4OTCS-0qXN' width='40%'>
</center>
<center>Figure 12. Document-term matrix </center>
<br>

* Example of creating document-term matrix

There are five documents $d_1, d_2, d_3, d_4, d_5$ that include five words.

Step-1: Extract all words<br>
In the example, from five documents, a set of words, apple, banana, grape, pear, strawberry, mango, melon, watermelon, peach, tangerine, mathematics, physics, French, geography, chemistry, Japanese, English, are extracted.

Step-2: Create a document-term matrix<br>
We make a document-term matrix, setting words on the horizontal axis and documents on the vertical axis.

Step-3: Add feature values<br>
For each document, add "1" if the word appears in the corresponding document, otherwise add "0".

<center>
<img src='https://drive.google.com/uc?export=view&id=171homWzJFytRIsIaRSCLm_75Lu0N1dLU' width='70%'>
</center>
<center>Figure 13. Basic creation flow of document-term matrix </center>

## Creation of document vector
In Figure 13, doc1 is represented as a vector $\mathbf{d}_1 = (1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ,0)$

Smilarly, one-hot vectors of terms 'apple', 'banana', 'grape', 'paer', 'strawberry' are represented as follows.

$\mathbf{t}_{apple} = (1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)$<br>
$\mathbf{t}_{banana} = (0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)$<br>
$\mathbf{t}_{grape} = (0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0)$<br>
$\mathbf{t}_{paer} = (0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0)$<br>
$\mathbf{t}_{strawberry} = (0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0)$<br>

In this case, doc1 is considered as the addition of vectors of terms appeared in doc1.

$\mathbf{d}_1 = \mathbf{t}_{apple}+ \mathbf{t}_{banana}+ \mathbf{t}_{grape}+ \mathbf{t}_{paer}+ \mathbf{t}_{strawberry}$  

### Code example
Let'S create a document vector by vectorizing the words in the document and adding those word vectors.Here, we use term embedded  vectors instead of one-hot vectors.

document 1: apple, banana, grape, pear, strawberry

document 2: grape, pear, mango, melon, watermelon

In [None]:
doc_vec1 = model.wv['apple'] + model.wv['banana'] + model.wv['grape'] + model.wv['pear'] + model.wv['strawberry']

print (doc_vec1)

In [None]:
doc_vec2 = model.wv['grape'] + model.wv['pear'] + model.wv['mango'] + model.wv['melon'] + model.wv['watermelon']

print (doc_vec2)

Calculates the cosine similarity of document 1 and document 2.



In [None]:
 cosine_value = np.dot(doc_vec1, doc_vec2) / (np.linalg.norm(doc_vec1) * np.linalg.norm(doc_vec2))

 print(cosine_value)

### **Practice 2-6**
Let's modify the cosine_sim function and create a function cosine_sim_vec(v1, v2) that takes two vectors as input and returns the cosine similarity.

### **Practice 2-7**
The following equations shows that the semantic calculation is possible by the vector operations for the word embedded vectors.

vector('Paris') - vector('France') + vector('Italy') = vector('Rome')

vector('king') - vector('man') + vector('woman') = vector('queen')

<br>
1. Using embedded vectors by word2vec, create document vectors for right side and the left side of the above equations, respectively.<br>
2. Using the function cosine_sim(), calculate the cosine similarity. <br>
3. Based on the results of (2), check if the equations are correct or not.



### **Practice 2-8**

There are five documents  d3,d4,d5  that include five words as follows.

d3: peach, melon, banana, strawberry, orange<br>
d4: mathematics, physics, french, geography, chemistry<br>
d5: mathematics, chemistry, japanese, geography, english<br>

1.   Create each document vector for d3, d4, d5.
2.   Calculate the cosine similarity of documents 3 and 4, and the cosine similarity of documents 4 and 5, respectively.
3.   Explain the similarity between documents 3 and 4 (how similar they are) and the similarity between documents 4 and 5 based on the content of the documents and the values of the cosine similarity.




## References
* https://code.google.com/archive/p/word2vec/
* Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
* François Chollet, Deep Learning with Python