<a href="https://colab.research.google.com/github/michaellgoro/colabtogithub/blob/master/aml2022_report.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Machine Learning (2022) Final Report Assignment

Answer Questions 1 to 4 (either in Japanese or English). Submit a report in either PDF (.pdf) or JupyterNotebook (.ipynb) format.

## Question 1 (50 points)

Consider a convolutional neural network (CNN) that predicts a label $\hat{y} \in \{0, 1\}$ for a given sentence $\boldsymbol{X} \in \mathbb{R}^{d \times T}$. Here, a sentence is represented by a matrix $\boldsymbol{X} = (\boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_T)$ consisting of a concatenation of $T$ word embeddings, $\boldsymbol{x}_1, \boldsymbol{x}_2, \dots, \boldsymbol{x}_T \in \mathbb{R}^d$, where $d$ is the size of word embeddings, and $T$ is the number of words in the sentence.

These equations define the whole architecture of the CNN.

\begin{align}
\hat{y} &= \begin{cases}
1 & (0.5 < p) \\
0 & (p \leq 0.5)
\end{cases} \\
p &= \sigma(\boldsymbol{v}^\top \boldsymbol{s}) \\
\boldsymbol{s} &= \max(\boldsymbol{c}_1, \dots, \boldsymbol{c}_{T-\delta+1}) \\
\boldsymbol{c}_t &= {\rm ReLU}(\boldsymbol{W} \boldsymbol{x}_{t:t+\delta-1} + \boldsymbol{b}) & (\forall t \in \{1, \dots, T-\delta+1\}) \\
\boldsymbol{x}_{t:t+\delta-1} &= \boldsymbol{x}_{t} \oplus \boldsymbol{x}_{t+1} \oplus \dots \oplus \boldsymbol{x}_{t+\delta-1}
\end{align}

Here:

+ $\boldsymbol{W} \in \mathbb{R}^{m \times \delta d}$, $\boldsymbol{b} \in \mathbb{R}^m, \boldsymbol{v} \in \mathbb{R}^m$ are the model parameters;
+ $m$ denotes the number of output channels of the CNN;
+ $\delta$ denotes the width (kernel size) of the convolution;
+ $\sigma(\cdot)$ denotes the standard sigmoid function;
+ $\max(\cdot)$ presents the max pooling operation;
+ ${\rm ReLU}(\cdot)$ denotes the ReLU activation function;
+ $\oplus$ presents a concatenation of vectors.

Setting the hyperparameters $d=3, m=2, \delta=2$, we initialize the model parameters as follows.

\begin{align}
\boldsymbol{W} &= \begin{pmatrix}
-3 & -2 & -1 & -1 & -2 & -3 \\
3 & 2 & 3 & 2 & 3 & 2
\end{pmatrix} \\
\boldsymbol{b} &= \begin{pmatrix}
-0.2 \\ 0.1
\end{pmatrix} \\
\boldsymbol{v} &= \begin{pmatrix}
-1 \\ 2
\end{pmatrix}
\end{align}

Suppose that we give a negative ($y=0$) training instance with the sentence ($T = 5$),

\begin{align}
\boldsymbol{X} &= \begin{pmatrix}
-0.3 & 0 & 0.1 & 0 & 0 \\
-0.2 & -0.1 & 0 & 0.1 & 0 \\
-0.1 & -0.2 & 0.1 & 0 & 0.1
\end{pmatrix} ,
\end{align}
to the CNN model, and answer the following questions.



In [65]:
import numpy as np
X = np.array([[-0.3, 0, 0.1, 0,0],
              [-0.2, -0.1, 0, 0.1, 0],
              [-0.1, -0.2, 0.1, 0, 0.1]])

W = np.array([[-3, -2, -1, -1, -2, -3],
              [3, 2, 3, 2, 3, 2]])
b = np.array([-0.2, 0.1])
v = np.array([-1, 2])



**(1)** Find the value of the vector $\boldsymbol{x}_{3:4}$.

$x_{3:4}=(0.1,0,0.1,0,0.1,0)^T$

In [66]:
import numpy as np
x_34 = np.ravel(X[:, 2:4])
x_34

array([0.1, 0. , 0. , 0.1, 0.1, 0. ])

**(2)** Find the values of the hidden vectors $\boldsymbol{c}_1, \boldsymbol{c}_2, \boldsymbol{c}_3, \boldsymbol{c}_4$.

In [67]:
#c1~c4を格納するためのlist
# W*x_34
c = []
print(W)
for i in range(4):
  x = np.ravel(X[:, i:i+2])
  # print(x.shape)
  print(x)
  c.append(W@x+b)
print()
c = np.array(c).T
print(c)
for i in range(len(c)):
  for j in range(len(c[0])):
    if c[i][j] < 0:
      c[i][j] = 0
print(c)
# for i in range(4):
#   print(f"c{i+1}")
#   print(c[i])
#   print()

[[-3 -2 -1 -1 -2 -3]
 [ 3  2  3  2  3  2]]
[-0.3  0.  -0.2 -0.1 -0.1 -0.2]
[ 0.   0.1 -0.1  0.  -0.2  0.1]
[0.1 0.  0.  0.1 0.1 0. ]
[0.  0.  0.1 0.  0.  0.1]

[[ 1.8 -0.2 -0.8 -0.6]
 [-2.3 -0.4  0.9  0.6]]
[[1.8 0.  0.  0. ]
 [0.  0.  0.9 0.6]]


\begin{align}
\boldsymbol{c}_1 &= \begin{pmatrix}
1.8\\
0 \\
\end{pmatrix} \\
\boldsymbol{c}_2 &= \begin{pmatrix}
0\\
0 \\
\end{pmatrix}  \\
\boldsymbol{c}_3 &= \begin{pmatrix}
0 \\
0.9 \\
\end{pmatrix}  \\
\boldsymbol{c}_4 &= \begin{pmatrix}
0 \\
0.6 \\
\end{pmatrix}  
\end{align}

**(3)** Find the value of the vector $\boldsymbol{s}$.


In [68]:
s = c.max(axis=1)
s

array([1.8, 0.9])

$\boldsymbol{s} = (1.8, 0.9)^T$

**(4)** Find the value of $p$.

In [69]:
def sigmoid(a):
  return 1 / (1 + np.exp(-a))
p = sigmoid(v @ s)
p

0.5

**(5)** Write the formula of the binary cross-entropy loss between the correct label $y$ and the probability estimate $p$.

$l(p,y) = -y\log{p}-(1-y)\log{(1-p)}$

**(6)** Compute the loss value by using the formula of (5) for the training instance.

In [81]:
# loss = -1*np.log(sigmoid(p))-
def calc_loss(p,y):
  print(np.log(sigmoid(p)))
  loss = -y * np.log(p) - (1 - y) * np.log(1-p)
  return loss
calc_loss(p,1)

-0.47407698418010663


0.6931471805599453

**(7)** Compute the gradient of the loss function with respect to $\boldsymbol{v}$ for the training instance.

In [82]:
import torch
dtype = torch.float
st = torch.from_numpy(s.astype(np.float32)).clone()
v = torch.tensor([-1, 2], dtype=dtype, requires_grad=True)
print(st,v)
def sigmoid_tensor(a):
  return 1 / (1+torch.exp(-a))
pt = sigmoid_tensor(torch.dot(st,v))
print(pt)

def calc_loss_tensor(p,y):
  print(torch.log(sigmoid_tensor(p)))
  print(sigmoid_tensor(1-p))
  loss = -y * torch.log(p) - (1 - y) * torch.log(1-p)
  return loss
loss = calc_loss_tensor(pt,1)
print(loss)
loss.backward()
v.grad

tensor([1.8000, 0.9000]) tensor([-1.,  2.], requires_grad=True)
tensor(0.5000, grad_fn=<MulBackward0>)
tensor(-0.4741, grad_fn=<LogBackward0>)
tensor(0.6225, grad_fn=<MulBackward0>)
tensor(0.6931, grad_fn=<SubBackward0>)


tensor([-0.9000, -0.4500])

**(8)** Compute the gradients of the loss function with respect to $\boldsymbol{W}$ for the training instance.

In [126]:
W = torch.tensor([[-3, -2, -1, -1, -2, -3],
              [3, 2, 3, 2, 3, 2]], dtype=dtype, requires_grad=True)
X = torch.tensor([[-0.3, 0, 0.1, 0,0],
              [-0.2, -0.1, 0, 0.1, 0],
              [-0.1, -0.2, 0.1, 0, 0.1]], dtype=dtype)
b =torch.tensor([-0.2, 0.1],dtype=dtype)
v = torch.tensor([-1, 2],dtype=dtype)
trans_X = X[:, 0:2].reshape(6, -1)
for i in range(1,4):
  trans_X = torch.cat((trans_X, X[:, i:i+2].reshape(6, -1)), 1)
tmp_c = torch.mm(W,trans_X)+b.reshape(2,-1)
relu = torch.nn.ReLU()
c = relu(tmp_c)
s = torch.max(c,1).values
p = sigmoid_tensor(torch.dot(s,v))
loss = calc_loss_tensor(p,1)
print(loss)
loss.backward()
W.grad

tensor(-0.4741, grad_fn=<LogBackward0>)
tensor(0.6225, grad_fn=<MulBackward0>)
tensor(0.6931, grad_fn=<SubBackward0>)


tensor([[-0.1500,  0.0000, -0.1000, -0.0500, -0.0500, -0.1000],
        [-0.1000,  0.0000,  0.0000, -0.1000, -0.1000,  0.0000]])

## Question 2 (20 points)

Give names of two datasets that can be used to evaluate the quality of word embeddings, and explain the datasets with the following perspectives.

+ Brief explanation of the task for the evaluation.
+ Statistics of the dataset (e.g., the number of instances)
+ Measure(s) for evaluating the quality

## Question 3 (20 points)

Explain two reasons why Transformers are superior to Recurrent Neural Network
(RNN) in sequence-to-sequence tasks such as Machine Translation.

**１つ目**

シーケンス長をn、全単語を表現するためのベクトルの次元をdとすると、Transformerで用いているSelf-Attensionの層ごとの複雑さを計算するのにかかる時間のオーダーは$O(n^2d)$、RNNについては$O(nd^2)$となる。一般に言語処理を行う際には$n<d$であるため、Transformerの方が有利であることがわかる。

**２つ目**

Transformerはself-attentionをベースとした構造になっており、self-attentionは全ての位置を$O(1)$で結びつけることができる。一方RNNは前の状態から後の状態を計算するために位置を結びつけるためのかかる時間は$O(n)$となる。この点においてもTransformerはRNNよりも優れている。


## Question 4 (10 points)

Implement the code for using a pre-trained **language** model. Show the code and its output as well as the following information:

+ The detail of the pre-trained language model, for example,
    + https://huggingface.co/EleutherAI/gpt-j-6B
    + https://huggingface.co/rinna/japanese-gpt-1b
    + https://huggingface.co/facebook/blenderbot-400M-distill
+ The task addressed by the model (e.g., "text generation", "summarization", "chatbot")
