# CS224N: PyTorch Tutorial (Winter '21)  CS224N：Pytorch教程（2021年冬）

### Author: Dilara Soylu
### 翻譯&練習題：修改自DeepL自動翻譯結果

In this notebook, we will have a basic introduction to `PyTorch` and work on a toy NLP task. Following resources have been used in preparation of this notebook:
* ["Word Window Classification" tutorial notebook]((https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/materials/ww_classifier.ipynb) by Matt Lamm, from Winter 2020 offering of CS224N
* Official PyTorch Documentation on [Deep Learning with PyTorch: A 60 Minute Blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) by Soumith Chintala
* PyTorch Tutorial Notebook, [Build Basic Generative Adversarial Networks (GANs) | Coursera](https://www.coursera.org/learn/build-basic-generative-adversarial-networks-gans) by Sharon Zhou, offered on Coursera

Many thanks to Angelica Sun and John Hewitt for their feedback.


在這本筆記本中，我們將對 "PyTorch"進行基本介紹，並實作一個簡單的範例NLP任務。在準備本notebook時，我們使用了以下資源：
* ["單詞窗口分類 "教程筆記本]((https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1204/materials/ww_classifier.ipynb)，作者Matt Lamm，來自CS224N的2020年冬季課程。
* PyTorch官方文檔[Deep Learning with PyTorch: A 60 Minute Blitz](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) 作者：Soumith Chintala
* PyTorch教學筆記本，[構建基本生成對抗網路（GANs）|Coursera](https://www.coursera.org/learn/build-basic-generative-adversarial-networks-gans)，作者Sharon Zhou，在Coursera提供。

非常感謝 Angelica Sun 和 John Hewitt 的回饋。

# Part I.
## Introduction 簡介
[PyTorch](https://pytorch.org/) is a machine learning framework that is used in both academia and industry for various applications. PyTorch started of as a more flexible alternative to [TensorFlow](https://www.tensorflow.org/), which is another popular machine learning framework. At the time of its release, `PyTorch` appealed to the users due to its user friendly nature: as opposed to defining static graphs before performing an operation as in `TensorFlow`, `PyTorch` allowed users to define their operations as they go, which is also the approached integrated by `TensorFlow` in its following releases. Although `TensorFlow` is more widely preferred in the industry, `PyTorch` is often times the preferred machine learning framework for researchers. If you would like to learn more about the differences between the two, you can check out [this](https://blog.udacity.com/2020/05/pytorch-vs-tensorflow-what-you-need-to-know.html) blog post. 

[PyTorch](https://pytorch.org/)是一個機器學習框架，在學術界和工業界都被用於各種應用。`PyTorch` 最初是作為[TensorFlow](https://www.tensorflow.org/)的一個更靈活的替代品，後者是另一個流行的機器學習框架。在發布時，`PyTorch` 因其用戶友好的特性而吸引了使用者：相對於像 `TensorFlow` 那樣在執行操作前定義靜態圖，`PyTorch` 允許用戶在執行時定義他們的操作，`TensorFlow` 在其後續版本中也采用了這樣的方法。盡管 `TensorFlow` 在業界中更受青睞，但 `PyTorch` 通常是研究人員首選的機器學習框架。如果你想了解更多關於兩者之間的區別，你可以查看[這篇](https://blog.udacity.com/2020/05/pytorch-vs-tensorflow-what-you-need-to-know.html)部落格文章。

Now that we have learned enough about the background of `PyTorch`, let's start by importing it into our notebook. To install `PyTorch`, you can follow the instructions here. Alternatively, you can open this notebook using `Google Colab`, which already has `PyTorch` installed in its base kernel. Once you are done with the installation process, run the following cell:

現在我們已經充分了解了 `PyTorch` 的背景，讓我們開始把它導入我們的筆記本。要安裝 `PyTorch`，你可以按照這裡的說明。或者，你也可以用 `Google Colab` 打開這個筆記本，它的基本內核已經安裝了 `PyTorch`。一旦你完成了安裝過程，執行以下Cell：

In [14]:
import torch
import torch.nn as nn

# Import pprint, module we use for making our print statements prettier
# 導入pprint，用來使我們的print結果更漂亮
import pprint

pp = pprint.PrettyPrinter()

We are all set to start our tutorial. Let's dive in!

設定完畢，以下開始教學

## Tensors 張量

Tensors are the most basic building blocks in `PyTorch`.  Tensors are similar to matrices, but the have extra properties and they can represent higher dimensions. For example, an square image with 256 pixels in both sides can be represented by a `3x256x256` tensor, where the first 3 dimensions represent the color channels, red, green and blue. 

張量是 `PyTorch` 中最基本的構建模塊。 張量類似於矩陣，但它們有額外的屬性，可以表示更高的維度。例如，一個兩邊有256個像素的正方形圖像可以用 "3x256x256 "張量來表示，其中前3個維度代表顏色通道，即紅、綠和藍。




### Tensor Initialization 張量初始化

There are several ways to instantiate tensors in `PyTorch`, which we will go through next. 

在 `PyTorch` 中，有幾種實例化張量的方法，我們接下來將介紹這些方法。

#### **From a Python List** 從 Python List 初始化張量

We can initalize a tensor from a `Python` list, which could include sublists. The dimensions and the data types will be automatically inferred by `PyTorch` when we use [`torch.tensor()`](https://pytorch.org/docs/stable/generated/torch.tensor.html). 

我們可以從一個 `Python` 列表中初始化一個張量，該列表可以包括子列表。當我們使用[`torch.tensor()`](https://pytorch.org/docs/stable/generated/torch.tensor.html)時，尺寸和數據類型將由`PyTorch`自動推斷出來。

In [15]:
# Initialize a tensor from a Python List
data = [[0, 1], [2, 3], [4, 5]]
x_python = torch.tensor(data)

# Print the tensor
x_python

tensor([[0, 1],
        [2, 3],
        [4, 5]])

We can also call `torch.tensor()` with the optional `dtype` parameter, which will set the data type. Some useful datatypes to be familiar with are: `torch.bool`, `torch.float`, and `torch.long`.

我們也可以在使用 `torch.tensor()` 時選擇 `dtype` 參數，可以設置資料型態。一些需要熟悉的有用的資料型態有 `torch.bool`, `torch.float`, 和`torch.long`。

In [16]:
# We are using the dtype to create a tensor of particular type
x_float = torch.tensor(data, dtype=torch.float)
x_float

tensor([[0., 1.],
        [2., 3.],
        [4., 5.]])

In [17]:
# We are using the dtype to create a tensor of particular type
x_bool = torch.tensor(data, dtype=torch.bool)
x_bool

tensor([[False,  True],
        [ True,  True],
        [ True,  True]])

We can also get the same tensor in our specified data type using methods such as `float()`, `long()` etc. 

我們也可以使用 `float()`, `long()` 等方法獲得我們指定的資料型態的相同張量。

In [18]:
x_python.float()

tensor([[0., 1.],
        [2., 3.],
        [4., 5.]])

We can also use `tensor.FloatTensor`, `tensor.LongTensor`, `tensor.Tensor` classes to instantiate a tensor of particular type. `LongTensor`s are particularly important in NLP as many methods that deal with indices require the indices to be passed as a `LongTensor`, which is a 64 bit integer. 

我們也可以使用 `tensor.FloatTensor`, `tensor.LongTensor`, `tensor.Tensor` 類別來實例化一個特定類型的張量。`LongTensor`在NLP中特別重要，因為許多處理索引值的方法需要將索引值作為 `LongTensor` 傳遞，它是一個 64-bit 的整數。

In [19]:
# `torch.Tensor` defaults to float
# Same as torch.FloatTensor(data)
x = torch.Tensor(data)
x

tensor([[0., 1.],
        [2., 3.],
        [4., 5.]])

#### **From a NumPy Array** 從Numpy Array初始化
We can also initialize a tensor from a `NumPy` array. 

我們也可以從一個 `NumPy` 陣列中初始化一個張量。


In [20]:
import numpy as np

# Initialize a tensor from a NumPy array
ndarray = np.array(data)
x_numpy = torch.from_numpy(ndarray)

# Print the tensor
x_numpy

tensor([[0, 1],
        [2, 3],
        [4, 5]])

#### **From a Tensor** 從一個張量初始化
We can also initialize a tensor from another tensor, using the following methods:

* `torch.ones_like(old_tensor)`: Initializes a tensor of `1s`.
* `torch.zeros_like(old_tensor)`: Initializes a tensor of `0s`.
* `torch.rand_like(old_tensor)`: Initializes a tensor where all the elements are sampled from a uniform distribution between `0` and `1`.
* `torch.randn_like(old_tensor)`: Initializes a tensor where all the elements are sampled from a normal distribution.

All of these methods preserve the tensor properties of the original tensor passed in, such as the `shape` and `device`, which we will cover in a bit. 

我們也可以從另一個張量中初始化一個張量，使用以下方法。

* `torch.ones_like(old_tensor)`: 初始化一個全部都是 `1` 的張量。
* `torch.zeros_like(old_tensor)`: 初始化一個全部都是 `0` 的張量。
* `torch.rand_like(old_tensor)`: 初始化一個張量，其中所有的元素都是從 `0` 和 `1` 之間的均勻分布中採樣的。
* `torch.randn_like(old_tensor)`: 初始化一個張量，其中所有的元素都是從常態分布中採樣的。

所有這些方法都保留了傳入的原始張量的張量屬性，比如 `shape`和 `device`，我們稍後會介紹這些。

In [21]:
# Initialize a base tensor
x = torch.tensor([[1.0, 2], [3, 4]])
x

tensor([[1., 2.],
        [3., 4.]])

In [22]:
# Initialize a tensor of 0s
x_zeros = torch.zeros_like(x)
x_zeros

tensor([[0., 0.],
        [0., 0.]])

In [23]:
# Initialize a tensor of 1s
x_ones = torch.ones_like(x)
x_ones

tensor([[1., 1.],
        [1., 1.]])

In [24]:
# Initialize a tensor where each element is sampled from a uniform distribution
# between 0 and 1
x_rand = torch.rand_like(x)
x_rand

tensor([[0.8420, 0.0232],
        [0.4553, 0.5807]])

In [25]:
# Initialize a tensor where each element is sampled from a normal distribution
x_randn = torch.randn_like(x)
x_randn

tensor([[-0.7074,  0.1293],
        [-0.7968, -1.8529]])

#### **By Specifying a Shape** 透過指定一個大小來創建張量
We can also instantiate tensors by specifying their shapes (which we will cover in more detail in a bit). The methods we could use follow the ones in the previous section:
* `torch.zeros()`
* `torch.ones()`
* `torch.rand()`
* `torch.randn()`

我們也可以透過指定它們的大小來實例化張量（我們將在稍後詳細介紹）。我們可以使用的方法與上一節中的方法相同。
* `torch.zeros()`
* `torch.one()`
* `torch.rand()`
* `torch.randn()`

In [26]:
# Initialize a 4x3x2 tensor of 0s
shape = (4, 3, 2)
x_zeros = torch.zeros(shape)  # x_zeros = torch.zeros(4, 3, 2) is an alternative
x_zeros

tensor([[[0., 0.],
         [0., 0.],
         [0., 0.]],

        [[0., 0.],
         [0., 0.],
         [0., 0.]],

        [[0., 0.],
         [0., 0.],
         [0., 0.]],

        [[0., 0.],
         [0., 0.],
         [0., 0.]]])

#### **With `torch.arange()`** **使用 `torch.arange()`**

We can also create a tensor with `torch.arange(end)`, which returns a `1-D` tensor with elements ranging from `0` to `end-1`. We can use the optional `start` and `step` parameters to create tensors with different ranges.  

我們也可以用 `torch.arange(end)` 創建一個張量，它會回傳一個一維張量，元素範圍從 `0` 到 `end-1`。也可以使用 `start` 和 `step` 參數來創建不同範圍的張量。 

In [28]:
# Create a tensor with values 0-9
x = torch.arange(10)
x

tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

### Tensor Properties Tensor 屬性

Tensors have a few properties that are important for us to cover. These are namely `shape`, and the `device` properties. 

Tensor的幾個屬性對我們來說很重要： 比如大小(`shape`)和設備(`device`)屬性。

#### Data Type 資料類型

The `dtype` property lets us see the data type of a tensor. 

`dtype` 屬性可以讓我們確認一個張量所儲存的資料型態。

In [29]:
# Initialize a 3x2 tensor, with 3 rows and 2 columns
x = torch.ones(3, 2)
x.dtype

torch.float32

#### Shape 形狀

The `shape` property tells us the shape of our tensor. This can help us identify how many dimensional our tensor is as well as how many elements exist in each dimension.

`shape`屬性告訴我們 tensor 的大小。這可以幫助我們確定我們的 tensor 是多少維的，以及每個維度上有多少元素存在。

In [30]:
# Initialize a 3x2 tensor, with 3 rows and 2 columns
x = torch.Tensor([[1, 2], [3, 4], [5, 6]])
x

tensor([[1., 2.],
        [3., 4.],
        [5., 6.]])

In [31]:
# Print out its shape
# Same as x.size()
x.shape

torch.Size([3, 2])

In [32]:
# Print out the number of elements in a particular dimension
# 0th dimension corresponds to the rows
x.shape[0]

3

We can also get the size of a particular dimension with the `size()` method.

我們也可以透過 `size()` 函式獲得某個特定維度的大小（相當於`shape[]`）


In [33]:
# Get the size of the 0th dimension
x.size(0)

3

We can change the shape of a tensor with the `view()` method.

我們可以用 `view()` 函式改變張量的大小。

In [34]:
# Example use of view()
# x_view shares the same memory as x, so changing one changes the other
x_view = x.view(2, 3)
x_view

tensor([[1., 2., 3.],
        [4., 5., 6.]])

In [35]:
# We can ask PyTorch to infer the size of a dimension with -1
x_view = x.view(3, -1)
x_view.view(-1)

tensor([1., 2., 3., 4., 5., 6.])

We can also use `torch.reshape()` method for a similar purpose. There is a subtle difference between `reshape()` and `view()`: `view()` requires the data to be stored contiguously in the memory. You can refer to [this](https://stackoverflow.com/questions/49643225/whats-the-difference-between-reshape-and-view-in-pytorch) StackOverflow answer for more information. In simple terms, contiguous means that the way our data is laid out in the memory is the same as the way we would read elements from it. This happens because some methods, such as `transpose()` and `view()`, do not actually change how our data is stored in the memory. They just change the meta information about out tensor, so that when we use it we will see the elements in the order we expect. 

我們也可以使用 `torch.reshape()` 函式來達到類似目的。`reshape()` 和 `view()` 之間有一個微妙的區別：`view()` 需要資料在記憶體中是被連續儲存的。你可以參考[這個](https://stackoverflow.com/questions/49643225/whats-the-difference-between-reshape-and-view-in-pytorch)StackOverflow 的答案，了解更多資訊。簡單來說，連續意味著我們的資料在記憶體中的布局方式與我們從記憶體中讀取元素的方式是一樣的。這是因為一些函式，如 `transpose()` 和 `view()`，實際上並沒有改變我們的數據在內存中的存儲方式。它們只是改變了張量的元信息（meta information），所以當我們使用它時，我們會按照我們期望的順序看到元素。

`reshape()` calls `view()` internally if the data is stored contiguously, if not, it returns a copy. The difference here isn't too important for basic tensors, but if you perform operations that make the underlying storage of the data non-contiguous (such as taking a transpose), you will have issues using `view()`. If you would like to match the way your tensor is stored in the memory to how it is used, you can use the `contiguous()` method.  

如果資料是連續儲存的，`reshape()` 在內部呼叫 `view()`；如果不是，則返回一個副本。這裡的區別對於基本的張量來說並不太重要，但是如果你執行的操作使得資料的底層記憶體儲存不連續（比如進行矩陣轉置 `transpose()`），使用 `view()` 就會有問題。如果想讓你的張量在記憶體中的儲存方式與它的使用方式相匹配，你可以使用 `contiguous()` 函式。 

In [36]:
# Change the shape of x to be 2x3
# x_reshaped could be a reference to or copy of x
x_reshaped = torch.reshape(x, (2, 3))
x_reshaped

tensor([[1., 2., 3.],
        [4., 5., 6.]])

We can use `torch.unsqueeze(x, dim)` function to add a dimension of size `1` to the provided `dim`, where `x` is the tensor. We can also use the corresponding use `torch.squeeze(x)`, which removes the dimensions of size `1`.

我們可以使用 `torch.unsqueeze(x, dim)` 函數在提供的 `dim` 上增加一個大小為 `1` 的維度，其中 `x` 是張量。我們也可以使用相應的`torch.squeeze(x)`，刪除大小為 `1` 的維度。


In [37]:
# Initialize a 5x2 tensor, with 5 rows and 2 columns
x = torch.arange(10).reshape(5, 2)
x

tensor([[0, 1],
        [2, 3],
        [4, 5],
        [6, 7],
        [8, 9]])

In [38]:
# Add a new dimension of size 1 at the 1st dimension
x = x.unsqueeze(1)
x.shape

torch.Size([5, 1, 2])

In [39]:
# Squeeze the dimensions of x by getting rid of all the dimensions with 1 element
x = x.squeeze()
x.shape

torch.Size([5, 2])

If we want to get the total number of elements in a tensor, we can use the `numel()` method. 

如果我們想得到張量中元素的總數，我們可以使用 `numel()` 函式。

In [40]:
x

tensor([[0, 1],
        [2, 3],
        [4, 5],
        [6, 7],
        [8, 9]])

In [41]:
# Get the number of elements in tensor.
x.numel()

10

#### **Device** **設備**

Device property tells `PyTorch` where to store our tensor. Where a tensor is stored determines which device, `GPU` or `CPU`, would be handling the computations involving it. We can find the device of a tensor with the `device` property.

設備屬性告訴 `PyTorch` 將我們的張量儲存在哪裡。張量的儲存位置決定了哪個設備，即 `GPU` 或 `CPU`，這會關係到之後的計算流程。我們可以透過 `device` 屬性知道張量目前被儲存在哪個設備中。

In [42]:
# Initialize an example tensor
x = torch.Tensor([[1, 2], [3, 4]])
x

tensor([[1., 2.],
        [3., 4.]])

In [43]:
# Get the device of the tensor
x.device

device(type='cpu')

We can move a tensor from one device to another with the method `to(device)`.

我們可以用 `to(device) `函式將一個張量從一個設備移動到另一個設備。

In [44]:
# Check if a GPU is available, if so, move the tensor to the GPU
if torch.cuda.is_available():
    x = x.to("cuda")

In [45]:
x.device

device(type='cuda', index=0)

### Tensor Indexing Tensor索引

In `PyTorch` we can index tensors, similar to `NumPy`. 

在 `PyTorch` 中，我們可以對張量進行索引，類似於 `NumPy`。

In [46]:
# Initialize an example tensor
x = torch.Tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]], [[9, 10], [11, 12]]])
x

tensor([[[ 1.,  2.],
         [ 3.,  4.]],

        [[ 5.,  6.],
         [ 7.,  8.]],

        [[ 9., 10.],
         [11., 12.]]])

In [47]:
x.shape

torch.Size([3, 2, 2])

In [48]:
# Access the 0th element, which is the first row
x[0]  # Equivalent to x[0, :]

tensor([[1., 2.],
        [3., 4.]])

We can also index into multiple dimensions with `:`.

我們也可以用 `:` 對多個維度進行索引。

In [49]:
# Get the top left element of each element in our tensor
x[:, 0, 0]

tensor([1., 5., 9.])

We can also access arbitrary elements in each dimension. 

我們也可以在每個維度上存取任意的元素。

In [50]:
# Print x again to see our tensor
x

tensor([[[ 1.,  2.],
         [ 3.,  4.]],

        [[ 5.,  6.],
         [ 7.,  8.]],

        [[ 9., 10.],
         [11., 12.]]])

In [51]:
# Let's access the 0th and 1st elements, each twice
i = torch.tensor([0, 0, 1, 1])
x[i]

tensor([[[1., 2.],
         [3., 4.]],

        [[1., 2.],
         [3., 4.]],

        [[5., 6.],
         [7., 8.]],

        [[5., 6.],
         [7., 8.]]])

In [52]:
# Let's access the 0th elements of the 1st and 2nd elements
i = torch.tensor([1, 2])
j = torch.tensor([0])
x[i, j]

tensor([[ 5.,  6.],
        [ 9., 10.]])

We can get a `Python` scalar value from a tensor with `item()`. 

我們可以用 `item()` 從張量中取得一個 `Python` 純量（scalar）值。

In [53]:
x[0, 0, 0]

tensor(1.)

In [54]:
x[0, 0, 0].item()

1.0

### Operations 操作
PyTorch operations are very similar to those of `NumPy`. We can work with both scalars and other tensors. 

`PyTorch` 的操作與 `NumPy` 的操作非常相似。我們可以用純量和一個張量進行運算，張量之間也能相互進行運算。

In [55]:
# Create an example tensor
x = torch.ones((3, 2, 2))
x

tensor([[[1., 1.],
         [1., 1.]],

        [[1., 1.],
         [1., 1.]],

        [[1., 1.],
         [1., 1.]]])

In [56]:
# Perform elementwise addition
# Use - for subtraction
x + 2

tensor([[[3., 3.],
         [3., 3.]],

        [[3., 3.],
         [3., 3.]],

        [[3., 3.],
         [3., 3.]]])

In [57]:
# Perform elementwise multiplication
# Use / for division
x * 2

tensor([[[2., 2.],
         [2., 2.]],

        [[2., 2.],
         [2., 2.]],

        [[2., 2.],
         [2., 2.]]])

We can apply the same operations between different tensors of compatible sizes.

我們可以在大小兼容的不同張量之間進行同樣的操作。

In [58]:
# Create a 4x3 tensor of 6s
a = torch.ones((4, 3)) * 6
a

tensor([[6., 6., 6.],
        [6., 6., 6.],
        [6., 6., 6.],
        [6., 6., 6.]])

In [59]:
# Create a 1D tensor of 2s
b = torch.ones(3) * 2
b

tensor([2., 2., 2.])

In [60]:
# Divide a by b
a / b

tensor([[3., 3., 3.],
        [3., 3., 3.],
        [3., 3., 3.],
        [3., 3., 3.]])

We can use `tensor.matmul(other_tensor)` for matrix multiplication and `tensor.T` for transpose. Matrix multiplication can also be performed with `@`.

我們可以使用 `tensor.matmul(other_tensor)` 進行矩陣乘法，`tensor.T` 進行轉置。矩陣乘法也可以用 `@` 進行。

In [61]:
# Alternative to a.matmul(b)
# a @ b.T returns the same result since b is 1D tensor and the 2nd dimension
# is inferred
a @ b

tensor([36., 36., 36., 36.])

In [62]:
pp.pprint(a.shape)
pp.pprint(a.T.shape)

torch.Size([4, 3])
torch.Size([3, 4])


We can take the mean and standard deviation along a certain dimension with the methods `mean(dim)` and `std(dim)`. That is, if we want to get the mean `3x2` matrix in a `4x3x2` matrix, we would set the `dim` to be 0. We can call these methods with no parameter to get the mean and standard deviation for the whole tensor. To use `mean` and `std` our tensor should be a floating point type. 

我們可以用 `mean(dim)` 和 `std(dim)` 函式來取得某一維度的平均和標準差。也就是說，如果我們想在一個 `4x3x2` 的矩陣中得到 `3x2` 的平均值，我們可以將 `dim` 設為 0。我們可以在沒有參數的情況下呼叫這些函式來得到整個張量的平均值和標準差。要使用 `mean` 和 `std`，我們的張量必須是浮點數類別。

In [63]:
# Create an example tensor
m = torch.tensor([[1.0, 1.0], [2.0, 2.0], [3.0, 3.0], [4.0, 4.0]])

pp.pprint("Mean: {}".format(m.mean()))
pp.pprint("Mean in the 0th dimension: {}".format(m.mean(0)))
pp.pprint("Mean in the 1st dimension: {}".format(m.mean(1)))

'Mean: 2.5'
'Mean in the 0th dimension: tensor([2.5000, 2.5000])'
'Mean in the 1st dimension: tensor([1., 2., 3., 4.])'


We can concatenate tensors using `torch.cat`.

我們可以用 `torch.cat` 來連接張量。

In [64]:
# Concatenate in dimension 0 and 1
a_cat0 = torch.cat([a, a, a], dim=0)
a_cat1 = torch.cat([a, a, a], dim=1)

print("Initial shape: {}".format(a.shape))
print("Shape after concatenation in dimension 0: {}".format(a_cat0.shape))
print("Shape after concatenation in dimension 1: {}".format(a_cat1.shape))

Initial shape: torch.Size([4, 3])
Shape after concatenation in dimension 0: torch.Size([12, 3])
Shape after concatenation in dimension 1: torch.Size([4, 9])


Most of the operations in `PyTorch` are not in place. However, `PyTorch` offers the in place versions of operations available by adding an underscore (`_`) at the end of the method name. 

`PyTorch` 中的大多數操作都不是就地操作（指：改變原物件的值）。但是，`PyTorch` 在方法名稱的後面加上下劃線（`_`），就可以提供原地操作的版本。

In [65]:
# Print our tensor
a

tensor([[6., 6., 6.],
        [6., 6., 6.],
        [6., 6., 6.],
        [6., 6., 6.]])

In [66]:
# add() is not in place
a = a.add(a)
a

tensor([[12., 12., 12.],
        [12., 12., 12.],
        [12., 12., 12.],
        [12., 12., 12.]])

In [67]:
# add_() is in place
a.add_(a)
a

tensor([[24., 24., 24.],
        [24., 24., 24.],
        [24., 24., 24.],
        [24., 24., 24.]])

## Exercises: Tensors 練習：張量

Please complete the following code as required.

請按照說明的要求完成下列程式碼。

### 1. Import the pytorch package

In [68]:
import torch
import torch.nn as nn

### 2. Create a tensor of size 10 from python list (dim=1)



In [2]:
data = [i for i in range(1, 11)]
x_python = torch.tensor(data)
x_python

tensor([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

### 3. Create a tensor of data type bool

In [3]:
x = torch.Tensor(3, 2).bool()
x

tensor([[ True, False],
        [ True, False],
        [ True, False]])

### 4. Create a 3x3 tensor with values ranging from 0 to 8

In [4]:
data = [i for i in range(9)]
x = torch.tensor(data).reshape(3, 3)
x

tensor([[0, 1, 2],
        [3, 4, 5],
        [6, 7, 8]])

### 5. Create a 3x3x3 tensor with random values and initialize a new tensor with 1s from it.

In [6]:
x_shape = (3, 3, 3)
x = torch.randn(x_shape)
ones_x = torch.ones_like(x)
ones_x

tensor([[[1., 1., 1.],
         [1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.],
         [1., 1., 1.]],

        [[1., 1., 1.],
         [1., 1., 1.],
         [1., 1., 1.]]])

### 6. Output the data type, shape, and device of the following tensor and 

In [9]:
data = [[[3, 5, 6]], [[2, 1, 3]], [[9, 6, 0]], [[4, -1, 2]]]

t = torch.tensor(data)

print("Data Type: {}".format(t.dtype))  # data type
print("Data Shape: {}".format(t.shape))  # shape
print("Data Device: {}".format(t.device))  # device

Data Type: torch.int64
Data Shape: torch.Size([4, 1, 3])
Data Device: cpu


### 7. Get the top right element of each element in the tensor above.

In [11]:
"""
Output will be:
tensor([[6],
        [3],
        [0],
        [2]])
"""

t[:, :, 2]

tensor([[6],
        [3],
        [0],
        [2]])

### 8. Calculate the matrix multiplication of the following two tensors.

In [12]:
t1 = torch.randn(12, 4, 8)
t2 = torch.rand(12, 8, 16)

t_mul = t1 @ t2
t_mul.shape

torch.Size([12, 4, 16])

# Part II.
## Autograd 自動微分

`PyTorch` and other machine learning libraries are known for their automatic differantiation feature. That is, given that we have defined the set of operations that need to be performed, the framework itself can figure out how to compute the gradients. We can call the `backward()` method to ask `PyTorch` to calculate the gradients, which are then stored in the `grad` attribute.

`PyTorch` 和其他機器學習函式庫以其自動微分的功能而聞名。也就是說，考慮到我們已經定義了需要執行的一系列操作，框架本身就可以想辦法計算梯度。我們可以調用 `backward()` 函式來要求 `PyTorch` 計算梯度，然後將其儲存在 `grad` 屬性中。

In [69]:
# Create an example tensor
# requires_grad parameter tells PyTorch to store gradients
x = torch.tensor([2.0], requires_grad=True)

# Print the gradient if it is calculated
# Currently None since x is a scalar
pp.pprint(x.grad)

None


In [70]:
# Calculating the gradient of y with respect to x
y = x * x * 3  # 3x^2
y.backward()
pp.pprint(x.grad)  # d(y)/d(x) = d(3x^2)/d(x) = 6x = 12

tensor([12.])


Let's run backprop from a different tensor again to see what happens.

讓我們再對另一個不同的張量使用反向傳播，看看會發生什麼。

In [71]:
z = x * x * 3  # 3x^2
z.backward()
pp.pprint(x.grad)

tensor([24.])


We can see that the `x.grad` is updated to be the sum of the gradients calculated so far. When we run backprop in a neural network, we sum up all the gradients for a particular neuron before making an update. This is exactly what is happening here! This is also the reason why we need to run `zero_grad()` in every training iteration (more on this later). Otherwise our gradients would keep building up from one training iteration to the other, which would cause our updates to be wrong. 

我們可以看到，`x.grad` 被更新為迄今為止計算的梯度之和。當我們在神經網路中運行反向傳播時，在進行更新之前，我們會將某個特定神經元的所有梯度相加，這正是這裡所發生的事情。這也是為什麼我們需要在每次訓練迭代中運行 `zero_grad()` 的原因（後面會詳細介紹）。否則，我們的梯度會在每個迭代中不斷累計，這將導致參數更新出錯。

## Neural Network Module 神經網絡模組

So far we have looked into the tensors, their properties and basic operations on tensors. These are especially useful to get familiar with if we are building the layers of our network from scratch. We will utilize these in Assignment 3, but moving forward, we will use predefined blocks in the `torch.nn` module of `PyTorch`. We will then put together these blocks to create complex networks. Let's start by importing this module with an alias so that we don't have to type `torch` every time we use it. 

到目前為止，我們已經研究了張量、它們的屬性和對張量的基本操作。如果我們要從頭開始構建網路層，熟悉這些內容是特別有用的。我們將在作業3(CS224N課程的)中利用這些東西，但在未來，我們將使用 `PyTorch `的 `torch.nn` 模組中的預定義模組。然後我們會把這些模組放在一起，創建複雜的網路。讓我們先用別名導入這個模組，這樣我們就不必每次使用時都要輸入 `torch`。

In [72]:
import torch.nn as nn

### **Linear Layer** **線性層**

We can use `nn.Linear(H_in, H_out)` to create a linear layer. This will take a matrix of `(N, *, H_in)` dimensions and output a matrix of `(N, *, H_out)`. The `*` denotes that there could be arbitrary number of dimensions in between. The linear layer performs the operation `Ax+b`, where `A` and `b` are initialized randomly. If we don't want the linear layer to learn the bias parameters, we can initialize our layer with `bias=False`.

我們可以使用 `nn.Linear(H_in, H_out)` 來創建一個線性層。這將接受一個 `(N, *, H_in)` 維度的矩陣並輸出一個 `(N, *, H_out)` 的矩陣。`*` 表示中間可以有任意多的維度。線性層執行 `Ax+b` 的操作，其中 `A` 和 `b` 是隨機初始化的。如果我們不希望線性層學習 `bias` 參數，我們可以用`bias=False` 來初始化。

In [73]:
# Create the inputs
input = torch.ones(2, 3, 4)
# N* H_in -> N*H_out


# Make a linear layers transforming N,*,H_in dimensinal inputs to N,*,H_out
# dimensional outputs
linear = nn.Linear(4, 2)
linear_output = linear(input)
linear_output

tensor([[[-0.3197, -0.0709],
         [-0.3197, -0.0709],
         [-0.3197, -0.0709]],

        [[-0.3197, -0.0709],
         [-0.3197, -0.0709],
         [-0.3197, -0.0709]]], grad_fn=<ViewBackward0>)

In [74]:
list(linear.parameters())  # Ax + b

[Parameter containing:
 tensor([[ 0.4699, -0.1039, -0.4500,  0.1296],
         [-0.4941,  0.3705, -0.4157,  0.3590]], requires_grad=True),
 Parameter containing:
 tensor([-0.3653,  0.1094], requires_grad=True)]

### **Other Module Layers** **其他模塊層**

There are several other preconfigured layers in the `nn` module. Some commonly used examples are `nn.Conv2d`, `nn.ConvTranspose2d`, `nn.BatchNorm1d`, `nn.BatchNorm2d`, `nn.Upsample` and `nn.MaxPool2d` among many others. We will learn more about these as we progress in the course. For now, the only important thing to remember is that we can treat each of these layers as plug and play components: we will be providing the required dimensions and `PyTorch` will take care of setting them up. 

在 `nn` 模塊中還有其他幾個預設的層。一些常用的例子是 `nn.Conv2d`，`nn.ConvTranspose2d`，`nn.BatchNorm1d`，`nn.BatchNorm2d`，`nn.Upsample` 和 `nn.MaxPool2d` 以及其他等等。 隨著課程的進行，我們將進一步了解這些內容。目前唯一需要記住的是，我們可以把這些層當作即插即用的組件：我們只要提供所需的大小，`PyTorch` 會負責設置它們。

### **Activation Function Layer** **激勵函數層**
We can also use the `nn` module to apply activations functions to our tensors. Activation functions are used to add non-linearity to our network. Some examples of activations functions are `nn.ReLU()`, `nn.Sigmoid()` and `nn.LeakyReLU()`. Activation functions operate on each element seperately, so the shape of the tensors we get as an output are the same as the ones we pass in.

我們還可以使用 `nn` 模組對我們的張量使用激勵函數。激勵函數被用來給我們的網路添加非線性。激勵函數的一些例子是 `nn.ReLU()`, `nn.Sigmoid()` 和 `nn.LeakyReLU()`。激勵函數是單獨對每個元素進行操作，所以我們得到的張量的大小和我們傳入的張量是一樣的。

In [75]:
linear_output

tensor([[[-0.3197, -0.0709],
         [-0.3197, -0.0709],
         [-0.3197, -0.0709]],

        [[-0.3197, -0.0709],
         [-0.3197, -0.0709],
         [-0.3197, -0.0709]]], grad_fn=<ViewBackward0>)

In [76]:
sigmoid = nn.Sigmoid()
output = sigmoid(linear_output)
output

tensor([[[0.4207, 0.4823],
         [0.4207, 0.4823],
         [0.4207, 0.4823]],

        [[0.4207, 0.4823],
         [0.4207, 0.4823],
         [0.4207, 0.4823]]], grad_fn=<SigmoidBackward0>)

### **Putting the Layers Together** **將多個層放在一起**

So far we have seen that we can create layers and pass the output of one as the input of the next. Instead of creating intermediate tensors and passing them around, we can use `nn.Sequentual`, which does exactly that. 

到目前為止我們已經學習到如何創建層，並將一個層的輸出作為下一個層的輸入。我們可以使用 `nn.Sequentual` 來代替創建中間張量並傳遞各層的結果。

In [77]:
block = nn.Sequential(nn.Linear(4, 2), nn.Sigmoid())

input = torch.ones(2, 3, 4)
output = block(input)
output

tensor([[[0.3866, 0.6464],
         [0.3866, 0.6464],
         [0.3866, 0.6464]],

        [[0.3866, 0.6464],
         [0.3866, 0.6464],
         [0.3866, 0.6464]]], grad_fn=<SigmoidBackward0>)

### Custom Modules 自定義模組

Instead of using the predefined modules, we can also build our own by extending the `nn.Module` class. For example, we can build a the `nn.Linear` (which also extends `nn.Module`) on our own using the tensor introduced earlier! We can also build new, more complex modules, such as a custom neural network. You will be practicing these in the later assignment.

我們可以不使用預定義的模組，而是通過擴展 `nn.Module` 來建立自己的模組。例如，我們可以使用前面介紹的張量來建立一個 `nn.Linear`（它也擴展了`nn.Module`）。我們還可以建立新的、更複雜的模組，如自定義神經網路。你可以在CS224N後面的作業中練習這些。

To create a custom module, the first thing we have to do is to extend the `nn.Module`. We can then initialize our parameters in the `__init__` function, starting with a call to the `__init__` function of the super class. All the class attributes we define which are `nn` module objects are treated as parameters, which can be learned during the training. Tensors are not parameters, but they can be turned into parameters if they are wrapped in `nn.Parameter` class.

要創建一個自定義模組，我們首先要做的是擴展 `nn.Module`。然後我們可以在 `__init__` 函數中初始化我們的參數，首先是呼叫父類別的 `__init__` 函數。我們定義的所有屬於 `nn` 模組對象的屬性都被當作參數，可以在訓練中學習。張量不是參數，但是如果它們被包在 `nn.Parameter` 類中，它們就可以變成參數。

All classes extending `nn.Module` are also expected to implement a `forward(x)` function, where `x` is a tensor. This is the function that is called when a parameter is passed to our module, such as in `model(x)`.

所有擴展 `nn.Module` 的類別也應該實作一個 `forward(x)` 函數，其中 `x` 是一個張量。當一個參數被傳遞給我們的模組時，這是一個被呼叫的函數，例如在 `model(x)` 中。

In [78]:
class MultilayerPerceptron(nn.Module):
    def __init__(self, input_size, hidden_size):
        # Call to the __init__ function of the super class
        super(MultilayerPerceptron, self).__init__()

        # Bookkeeping: Saving the initialization parameters
        self.input_size = input_size
        self.hidden_size = hidden_size

        # Defining of our model
        # There isn't anything specific about the naming of `self.model`. It could
        # be something arbitrary.
        self.model = nn.Sequential(
            nn.Linear(self.input_size, self.hidden_size),
            nn.ReLU(),
            nn.Linear(self.hidden_size, self.input_size),
            nn.Sigmoid(),
        )

    def forward(self, x):
        output = self.model(x)
        return output

Here is an alternative way to define the same class. You can see that we can replace `nn.Sequential` by defining the individual layers in the `__init__` method and connecting the in the `forward` method. 

下面是定義同一類別的另一種方法。你可以看到我們可以透過在 `__init__` 方法中定義各個層，並在 `forward` 函式中連接，來代替 `nn.Sequential`。

In [79]:
class MultilayerPerceptron(nn.Module):
    def __init__(self, input_size, hidden_size):
        # Call to the __init__ function of the super class
        super(MultilayerPerceptron, self).__init__()

        # Bookkeeping: Saving the initialization parameters
        self.input_size = input_size
        self.hidden_size = hidden_size

        # Defining of our layers
        self.linear = nn.Linear(self.input_size, self.hidden_size)
        self.relu = nn.ReLU()
        self.linear2 = nn.Linear(self.hidden_size, self.input_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        linear = self.linear(x)
        relu = self.relu(linear)
        linear2 = self.linear2(relu)
        output = self.sigmoid(linear2)
        return output

Now that we have defined our class, we can instantiate it and see what it does. 

現在我們已經定義了我們的類別，我們可以把它實例化，看看它能作什麼。

In [80]:
# Make a sample input
input = torch.randn(2, 5)

# Create our model
model = MultilayerPerceptron(5, 3)

# Pass our input through our model
model(input)

tensor([[0.3865, 0.4406, 0.5251, 0.4538, 0.5130],
        [0.4015, 0.4533, 0.5372, 0.4204, 0.4530]], grad_fn=<SigmoidBackward0>)

We can inspect the parameters of our model with `named_parameters()` and `parameters()` methods. 

我們可以用 `named_parameters()` 和 `parameters()` 函式檢查我們模型的參數。

In [81]:
list(model.named_parameters())

[('linear.weight', Parameter containing:
  tensor([[ 0.0590, -0.2182, -0.1102,  0.1848,  0.1242],
          [ 0.3129,  0.2491,  0.1808, -0.3214,  0.1112],
          [-0.4008,  0.0207, -0.2799, -0.0014, -0.1322]], requires_grad=True)),
 ('linear.bias', Parameter containing:
  tensor([0.2516, 0.3986, 0.2406], requires_grad=True)),
 ('linear2.weight', Parameter containing:
  tensor([[ 0.1189,  0.5154, -0.2814],
          [-0.0232, -0.1403, -0.3857],
          [-0.0125,  0.1025, -0.1616],
          [ 0.3783,  0.2448, -0.1698],
          [ 0.3612,  0.0360,  0.3431]], requires_grad=True)),
 ('linear2.bias', Parameter containing:
  tensor([-0.4914, -0.1494,  0.1398, -0.4161, -0.2637], requires_grad=True))]

## Optimization 優化（器）
We have showed how gradients are calculated with the `backward()` function. Having the gradients isn't enought for our models to learn. We also need to know how to update the parameters of our models. This is where the optimizers comes in. `torch.optim` module contains several optimizers that we can use. Some popular examples are `optim.SGD` and `optim.Adam`. When initializing optimizers, we pass our model parameters, which can be accessed with `model.parameters()`, telling the optimizers which values it will be optimizing. Optimizers also has a learning rate (`lr`) parameter, which determines how big of an update will be made in every step. Different optimizers have different hyperparameters as well.

我們已經展示了如何用 `backward()` 函數來計算梯度。有了梯度還不足以讓我們的模型進行學習。我們還需要知道如何更新我們模型的參數。這就是優化器登場的時候了。
`torch.optim` 模組裡頭有幾個我們可以使用的優化器。一些比較常用的例子是 `optim.SGD` 和 `optim.Adam`。當初始化優化器時，我們傳遞我們的模型參數，可以用 `model.parameters()` 呼叫，告訴優化器它將優化哪些值。優化器也有一個學習率（`lr`）參數，它決定了每一步將進行多大的更新。不同的優化器也有不同的超參數。

In [82]:
import torch.optim as optim

After we have our optimization function, we can define a `loss` that we want to optimize for. We can either define the loss ourselves, or use one of the predefined loss function in `PyTorch`, such as `nn.BCELoss()`. Let's put everything together now! We will start by creating some dummy data. 

有了優化函數後，我們可以定義一個我們想要優化的「損失」函數。我們可以自己定義損失，或者使用`PyTorch`中預定義的損失函數，如 `nn.BCELoss()`。現在讓我們把所有的東西放在一起！ 我們將從創建一些範例資料開始。

In [83]:
# Create the y data
y = torch.ones(10, 5)

# Add some noise to our goal y to generate our x
# We want out model to predict our original data, albeit the noise
x = y + torch.randn_like(y)
x

tensor([[ 0.3453,  1.2104,  0.8543,  2.3149,  1.7949],
        [ 0.1551,  3.4177,  0.5026,  0.9726,  0.3493],
        [-0.0036,  0.1684,  0.9291,  1.8193,  0.4894],
        [ 1.8198,  1.6284,  0.8045,  2.2575,  2.0745],
        [ 1.6753, -0.5617,  0.4887,  1.7341,  2.1398],
        [ 1.6839, -0.1710,  0.8256,  0.4285,  1.4306],
        [-0.1116,  0.6421,  0.8681,  2.6186,  1.0652],
        [ 1.8008,  1.3212,  0.1492,  0.3898, -0.7971],
        [ 2.3707,  0.5358,  0.8470,  1.1773,  0.8518],
        [ 1.7510, -0.9816,  1.6112,  0.4882, -0.2327]])

Now, we can define our model, optimizer and the loss function. 

現在，我們可以定義我們的模型、優化器和損失函數。

In [84]:
# Instantiate the model
model = MultilayerPerceptron(5, 3)

# Define the optimizer
adam = optim.Adam(model.parameters(), lr=1e-1)

# Define loss using a predefined loss function
loss_function = nn.BCELoss()

# Calculate how our model is doing now
y_pred = model(x)
loss_function(y_pred, y).item()

0.6066659092903137

Let's see if we can have our model achieve a smaller loss. Now that we have everything we need, we can setup our training loop. 

讓我們看看是否能讓我們的模型達到一個較小的損失。現在我們有了我們需要的一切，我們可以設置我們的訓練循環。

In [85]:
# Set the number of epoch, which determines the number of training iterations
n_epoch = 10

for epoch in range(n_epoch):
    # Set the gradients to 0
    adam.zero_grad()

    # Get the model predictions
    y_pred = model(x)

    # Get the loss
    loss = loss_function(y_pred, y)

    # Print stats
    print(f"Epoch {epoch}: training loss: {loss}")

    # Compute the gradients
    loss.backward()

    # Take a step to optimize the weights
    adam.step()

Epoch 0: training loss: 0.6066659092903137
Epoch 1: training loss: 0.3982151746749878
Epoch 2: training loss: 0.21716317534446716
Epoch 3: training loss: 0.09739437699317932
Epoch 4: training loss: 0.037612032145261765
Epoch 5: training loss: 0.01358792744576931
Epoch 6: training loss: 0.004923112224787474
Epoch 7: training loss: 0.001856371178291738
Epoch 8: training loss: 0.0007394045824185014
Epoch 9: training loss: 0.00031236204085871577


In [86]:
list(model.parameters())

[Parameter containing:
 tensor([[-0.8590, -0.2428, -0.5012, -0.3720, -0.0918],
         [ 0.7235,  1.1215,  1.2097,  1.0908,  0.4797],
         [ 1.2258,  0.4246,  0.7917,  0.7369,  0.7239]], requires_grad=True),
 Parameter containing:
 tensor([-0.4840,  0.8851,  1.2186], requires_grad=True),
 Parameter containing:
 tensor([[ 0.0512,  0.7914,  1.0530],
         [ 0.4277,  0.9813,  1.1057],
         [ 0.9095,  1.2471,  1.2487],
         [-0.0152,  1.3429,  0.2937],
         [ 0.5647,  0.8968,  0.8212]], requires_grad=True),
 Parameter containing:
 tensor([0.4949, 0.9367, 0.1443, 1.1922, 0.3211], requires_grad=True)]

You can see that our loss is decreasing. Let's check the predictions of our model now and see if they are close to our original `y`, which was all `1s`. 

可以看到我們的損失正在減少。讓我們現在檢查一下我們模型的預測，看看它們是否接近我們最初的 `y`，也就是所有的 `1s`。

In [None]:
# See how our model performs on the training data
y_pred = model(x)
y_pred

tensor([[0.9999, 0.9997, 0.9997, 0.9977, 0.9592],
        [0.9998, 0.9992, 0.9992, 0.9952, 0.9422],
        [0.9985, 0.9960, 0.9951, 0.9835, 0.8981],
        [0.9999, 0.9996, 0.9996, 0.9970, 0.9542],
        [1.0000, 0.9999, 1.0000, 0.9994, 0.9790],
        [1.0000, 0.9998, 0.9998, 0.9982, 0.9634],
        [0.9995, 0.9983, 0.9981, 0.9915, 0.9249],
        [0.9972, 0.9935, 0.9917, 0.9762, 0.8799],
        [0.9936, 0.9879, 0.9834, 0.9619, 0.8520],
        [0.9984, 0.9957, 0.9948, 0.9827, 0.8958]], grad_fn=<SigmoidBackward0>)

In [87]:
# Create test data and check how our model performs on it
x2 = y + torch.randn_like(y)
y_pred = model(x2)
y_pred

tensor([[1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [0.9986, 0.9996, 0.9998, 0.9983, 0.9974],
        [0.9968, 0.9990, 0.9994, 0.9981, 0.9951],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [0.9997, 0.9999, 1.0000, 0.9998, 0.9994],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [1.0000, 1.0000, 1.0000, 1.0000, 1.0000],
        [0.9994, 0.9999, 0.9999, 0.9995, 0.9990],
        [1.0000, 1.0000, 1.0000, 1.0000, 0.9999]], grad_fn=<SigmoidBackward0>)

Great! Looks like our model almost perfectly learned to filter out the noise from the `x` that we passed in!

很好！ 看起來我們的模型幾乎完美地學會了過濾掉我們傳入的 `x` 的噪音！

# Part III.
## Demo: Word Window Classification 演示：詞匯窗口分類

Until this part of the notebook, we have learned the fundamentals of PyTorch and built a basic network solving a toy task. Now we will attempt to solve an example NLP task. Here are the things we will learn:

在這部分筆記之前，我們已經學習了 `PyTorch` 的基礎知識，並建立了一個基本的網路來解決一個簡單的任務。現在我們將嘗試解決一個NLP任務的例子。以下是我們將要學習的內容：

1. Data: Creating a Dataset of Batched Tensors
2. Modeling
3. Training
4. Prediction
---

1. 數據：創建一個批次化的張量的數據集
2. 建模
3. 訓練
4. 預測

In this section, our goal will be to train a model that will find the words in a sentence corresponding to a `LOCATION`, which will be always of span `1` (meaning that `San Fransisco` won't be recognized as a `LOCATION`). Our task is called `Word Window Classification` for a reason. Instead of letting our model to only take a look at one word in each forward pass, we would like it to be able to consider the context of the word in question. That is, for each word, we want our model to be aware of the surrounding words. Let's dive in!

在這一節中，我們的目標是訓練一個模型，在一個句子中找到與「地點」相對應的詞，這些詞的跨度總是 `1`（意味著 `San Fransisco` 不會被識別為 `地點`）。我們的任務被稱為「詞窗分類」是有原因的。我們不希望讓我們的模型在每次前進的過程中只看一個單詞，而是希望它能夠考慮到這個詞的文本脈絡。也就是說，對於每個詞，我們希望我們的模型能夠了解作為上下文的詞語。讓我們開始行動吧！

### Data 資料

The very first task of any machine learning project is to set up our training set. Usually, there will be a training corpus we will be utilizing. In NLP tasks, the corpus would generally be a `.txt` or `.csv` file where each row corresponds to a sentence or a tabular datapoint. In our toy task, we will assume that we have already read our data and the corresponding labels into a `Python` list.

機器學習的首要任務是建立我們的訓練集。通常情況下，我們會有一個訓練語料庫來使用。在NLP任務中，語料庫通常是一個 `.txt` 或 `.csv` 文件，每一行對應一個句子或一個表格資料點。在我們的簡易任務中，我們將假設我們已經將我們的數據和相應的標籤讀入一個 `Python` 的列表(list)。

In [88]:
# Our raw data, which consists of sentences
corpus = [
    "We always come to Paris",
    "The professor is from Australia",
    "I live in Stanford",
    "He comes from Taiwan",
    "The capital of Turkey is Ankara",
]

#### Preprocessing 預處理

To make it easier for our models to learn, we usually apply a few preprocessing steps to our data. This is especially important when dealing with text data. Here are some examples of text preprocessing:

* **Tokenization**: Tokenizing the sentences into words.
* **Lowercasing**: Changing all the letters to be lowercase.
* **Noise removal:** Removing special characters (such as punctuations). 
* **Stop words removal**: Removing commonly used words.

為了使我們的模型更容易學習，我們通常會對數據進行一些預處理步驟。這在處理文本數據時尤其重要。下面是一些文本預處理的例子：

- **分詞**: 將句子符號化為單詞。
- **字母轉小寫**: 將所有字母改成小寫。
- **去除噪音**: 去除特殊字符（如標點符號）。
- **去掉停頓詞**: 去除常用的單詞。

Which preprocessing steps are necessary is determined by the task at hand. For example, although it is useful to remove special characters in some tasks, for others they may be important (for example, if we are dealing with multiple languages). For our task, we will lowercase our words and tokenize. 

哪些預處理步驟是必要的會由手上的任務決定。例如，盡管在某些任務中刪除特殊字符是有用的，但對於其他任務來說，它們可能是重要的（例如，如果我們正在處理多種語言）。對於我們的任務來說，會把單詞轉為小寫並進行標記。


In [89]:
# The preprocessing function we will use to generate our training examples
# Our function is a simple one, we lowercase the letters
# and then tokenize the words.
def preprocess_sentence(sentence):
    return sentence.lower().split()


# Create our training set
train_sentences = [sent.lower().split() for sent in corpus]
train_sentences

[['we', 'always', 'come', 'to', 'paris'],
 ['the', 'professor', 'is', 'from', 'australia'],
 ['i', 'live', 'in', 'stanford'],
 ['he', 'comes', 'from', 'taiwan'],
 ['the', 'capital', 'of', 'turkey', 'is', 'ankara']]

For each training example we have, we should also have a corresponding label. Recall that the goal of our model was to determine which words correspond to a `LOCATION`. That is, we want our model to output `0` for all the words that are not `LOCATION`s and `1` for the ones that are `LOCATION`s.

對於我們擁有的每個訓練例子，應該有一個相應的標籤。回顧一下，模型的目標是確定哪些詞對應於 `LOCATION`。也就是說，我們希望模型對所有不是 `LOCATION `的詞輸出 `0`，對是 `LOCATION `的詞輸出 `1`。

In [90]:
# Set of locations that appear in our corpus
locations = set(["australia", "ankara", "paris", "stanford", "taiwan", "turkey"])

# Our train labels
train_labels = [[1 if word in locations else 0 for word in sent] for sent in train_sentences]
train_labels

[[0, 0, 0, 0, 1],
 [0, 0, 0, 0, 1],
 [0, 0, 0, 1],
 [0, 0, 0, 1],
 [0, 0, 0, 1, 0, 1]]

#### Converting Words to Embeddings 將單詞轉換為嵌入

Let's look at our training data a little more closely. Each datapoint we have is a sequence of words. On the other hand, we know that machine learning models work with numbers in vectors. How are we going to turn words into numbers? You may be thinking embeddings and you are right!

讓我們更仔細地看一下我們的訓練數據。每個數據點都是一個單詞序列。另一方面，我們知道機器學習模型是用向量中的數字工作的。如何將單詞變成數字呢？你可能在想詞嵌入，你是對的！

Imagine that we have an embedding lookup table `E`, where each row corresponds to an embedding. That is, each word in our vocabulary would have a corresponding embedding row `i` in this table. Whenever we want to find an embedding for a word, we will follow these steps:
1. Find the corresponding index `i` of the word in the embedding table: `word->index`.
2. Index into the embedding table and get the embedding: `index->embedding`.

想像一下，我們有一個嵌入查詢表 `E`，其中每一行都對應著一個嵌入。也就是說，詞匯中的每個詞在這個表中都有一個對應的嵌入行 `i`。每當我們想找到一個詞的嵌入時，將遵循以下步驟：
1. 找到該詞在嵌入表中的對應索引 `i`: `詞->索引`。
2. 索引到嵌入表並得到嵌入: `index->embedding`。

Let's look at the first step. We should assign all the words in our vocabulary to a corresponding index. We can do it as follows:
1. Find all the unique words in our corpus.
2. Assign an index to each.

讓我們來看看第一步。我們應該將詞匯表中的所有單詞分配給一個相應的索引。可以按以下方式進行：
1. 在我們的語料庫中找到所有獨特的詞。
2. 為每個詞分配一個索引。

In [91]:
# Find all the unique words in our corpus
vocabulary = set(w for s in train_sentences for w in s)
vocabulary

{'always',
 'ankara',
 'australia',
 'capital',
 'come',
 'comes',
 'from',
 'he',
 'i',
 'in',
 'is',
 'live',
 'of',
 'paris',
 'professor',
 'stanford',
 'taiwan',
 'the',
 'to',
 'turkey',
 'we'}

`vocabulary` now contains all the words in our corpus. On the other hand, during the test time, we can see words that are not contained in our vocabulary. If we can figure out a way to represent the unknown words, our model can still reason about whether they are a `LOCATION` or not, since we are also looking at the neighboring words for each prediction. 

We introduce a special token, `<unk>`, to tackle the words that are out of vocabulary. We could pick another string for our unknown token if we wanted. The only requirement here is that our token should be unique: we should only be using this token for unknown words. We will also add this special token to our vocabulary. 

`vocabulary` 現在包含了我們語料庫中的所有單詞。另一方面，在測試期間，我們可以看到不包含在我們的詞匯表中的詞。如果我們能想出一種方法來表示這些未知的詞，我們的模型仍然可以推理出它們是否是一個 `LOCATION`，因為我們也在看每個預測的鄰近詞。

我們引入了一個特殊的標記，`<unk>`，以解決詞匯量以外的詞。如果我們願意的話，可以選擇另一個字符串作為未知標記。這裡唯一的要求是：標記應該是唯一的：我們應該只對未知的詞使用這個標記。我們還會把這個特殊的標記添加到我們的詞匯表中。

In [92]:
# Add the unknown token to our vocabulary
vocabulary.add("<unk>")

Earlier we mentioned that our task was called `Word Window Classification` because our model is looking at the surroundings words in addition to the given word when it needs to make a prediction. 

前面我們提到，我們的任務被稱為「字窗分類」，因為我們的模型在需要進行預測時，除了給定的單詞外，還要查看周圍的單詞。

For example, let's take the sentence "We always come to Paris". The corresponding training label for this sentence is `0, 0, 0, 0, 1` since only Paris, the last word, is a `LOCATION`. In one pass (meaning a call to `forward()`), our model will try to generate the correct label for one word. Let's say our model is trying to generate the correct label `1` for `Paris`. If we only allow our model to see `Paris`, but nothing else, we will miss out on the important information that the word `to` often times appears with `LOCATION`s.

例如「我們總是來巴黎」這個句子。這個句子對應的訓練標籤是 `0, 0, 0, 0, 1`，因為只有最後一個詞巴黎是一個「地點」。在一次傳遞中（指呼叫 `forward()`），我們的模型將嘗試為一個詞生成正確的標籤。假設我們的模型正試圖為「巴黎」生成正確的標籤 `1`。如果我們只允許我們的模型看到「巴黎」，而其他東西全部忽略的話，我們會錯過 `to` 這個詞經常與 `LOCATION` 一起出現的重要資訊。

Word windows allow our model to consider the surrounding `+N` or `-N` words of each word when making a prediction. In our earlier example for `Paris`, if we have a window size of 1, that means our model will look at the words that come immediately before and after `Paris`, which are `to`, and, well, nothing. Now, this raises another issue. `Paris` is at the end of our sentence, so there isn't another word following it. Remember that we define the input dimensions of our `PyTorch` models when we are initializing them. If we set the window size to be `1`, it means that our model will be accepting `3` words in every pass. We cannot have our model expect `2` words from time to time.

詞窗允許我們的模型在進行預測時考慮每個詞的周圍 `+N` 或 `-N` 個詞。在我們前面關於「巴黎」的例子中，如果窗口大小為 `1`，這意味著模型將查看緊接在「巴黎」之前和之後的詞，也就是只有 `to`，其他什麼也沒有。這提出了另一個問題。
「巴黎」在句子的末尾，所以它後面沒有別的詞。記得，我們在初始化 `PyTorch` 模型的時候定義了輸入大小。如果我們將窗口大小設置為 `1`，這意味著模型將在每一次傳遞中必須輸入 `3` 個詞。我們無法讓模型只輸入 `2` 個單詞。

The solution is to introduce a special token, such as `<pad>`, that will be added to our sentences to make sure that every word has a valid window around them. Similar to `<unk>` token, we could pick another string for our pad token if we wanted, as long as we make sure it is used for a unique purpose. 

解決方案是引入一個特殊的標記，如 `<pad>`，它將被添加到我們的句子中，以確保每個單詞周圍都有一個有效的窗口。與 `<unk>` 標記類似，如果我們願意，我們可以選擇另一個字符串作為 padding 標記，只要確保它被用於一個獨特的目的即可。

In [93]:
# Add the <pad> token to our vocabulary
vocabulary.add("<pad>")


# Function that pads the given sentence
# We are introducing this function here as an example
# We will be utilizing it later in the tutorial
def pad_window(sentence, window_size, pad_token="<pad>"):
    window = [pad_token] * window_size
    return window + sentence + window


# Show padding example
window_size = 2
pad_window(train_sentences[0], window_size=window_size)

['<pad>', '<pad>', 'we', 'always', 'come', 'to', 'paris', '<pad>', '<pad>']

Now that our vocabularly is ready, let's assign an index to each of our words. 

現在詞匯表已經準備好了，讓我們為每個詞指定一個索引。

In [94]:
# We are just converting our vocabularly to a list to be able to index into it
# Sorting is not necessary, we sort to show an ordered word_to_ind dictionary
# That being said, we will see that having the index for the padding token
# be 0 is convenient as some PyTorch functions use it as a default value
# such as nn.utils.rnn.pad_sequence, which we will cover in a bit
ix_to_word = sorted(list(vocabulary))

# Creating a dictionary to find the index of a given word
word_to_ix = {word: ind for ind, word in enumerate(ix_to_word)}
word_to_ix

{'<pad>': 0,
 '<unk>': 1,
 'always': 2,
 'ankara': 3,
 'australia': 4,
 'capital': 5,
 'come': 6,
 'comes': 7,
 'from': 8,
 'he': 9,
 'i': 10,
 'in': 11,
 'is': 12,
 'live': 13,
 'of': 14,
 'paris': 15,
 'professor': 16,
 'stanford': 17,
 'taiwan': 18,
 'the': 19,
 'to': 20,
 'turkey': 21,
 'we': 22}

In [95]:
ix_to_word[1]

'<unk>'

Great! We are ready to convert our training sentences into a sequence of indices corresponding to each token. 

很好！ 我們已經準備好將訓練句子轉換成對應於每個標記的索引序列。

In [96]:
# Given a sentence of tokens, return the corresponding indices
def convert_token_to_indices(sentence, word_to_ix):
    indices = []
    for token in sentence:
        # Check if the token is in our vocabularly. If it is, get it's index.
        # If not, get the index for the unknown token.
        if token in word_to_ix:
            index = word_to_ix[token]
        else:
            index = word_to_ix["<unk>"]
        indices.append(index)
    return indices


# More compact version of the same function
def _convert_token_to_indices(sentence, word_to_ix):
    return [word_to_ind.get(token, word_to_ix["<unk>"]) for token in sentence]


# Show an example
example_sentence = ["we", "always", "come", "to", "kuwait"]
example_indices = convert_token_to_indices(example_sentence, word_to_ix)
restored_example = [ix_to_word[ind] for ind in example_indices]

print(f"Original sentence is: {example_sentence}")
print(f"Going from words to indices: {example_indices}")
print(f"Going from indices to words: {restored_example}")

Original sentence is: ['we', 'always', 'come', 'to', 'kuwait']
Going from words to indices: [22, 2, 6, 20, 1]
Going from indices to words: ['we', 'always', 'come', 'to', '<unk>']


In the example above, `kuwait` shows up as `<unk>`, because it is not included in our vocabulary. Let's convert our `train_sentences` to `example_padded_indices`. 

在上面的例子中，`kuwait` 顯示為 `<unk>`，因為它不包括在我們的詞匯表中。讓我們把我們的 `train_sentences` 轉換為`example_padded_indices`。

In [97]:
# Converting our sentences to indices
example_padded_indices = [convert_token_to_indices(s, word_to_ix) for s in train_sentences]
example_padded_indices

[[22, 2, 6, 20, 15],
 [19, 16, 12, 8, 4],
 [10, 13, 11, 17],
 [9, 7, 8, 18],
 [19, 5, 14, 21, 12, 3]]

Now that we have an index for each word in our vocabularly, we can create an embedding table with `nn.Embedding` class in `PyTorch`. It is called as follows `nn.Embedding(num_words, embedding_dimension)` where `num_words` is the number of words in our vocabulary and the `embedding_dimension` is the dimension of the embeddings we want to have. There is nothing fancy about `nn.Embedding`: it is just a wrapper class around a trainabe `NxE` dimensional tensor, where `N` is the number of words in our vocabulary and `E` is the number of embedding dimensions. This table is initially random, but it will change over time. As we train our network, the gradients will be backpropagated all the way to the embedding layer, and hence our word embeddings would be updated. We will initiliaze the embedding layer we will use for our model in our model, but we are showing an example here. 

現在我們有了詞匯中每個詞的索引，我們可以用 `PyTorch` 中的 `nn.Embedding` 類創建一個嵌入表。具體呼叫方法如下： `nn.Embedding(num_words, embedding_dimension)` 其中 `num_words` 是詞匯表中的單詞數，`embedding_dimension` 是嵌入維度。 `nn.Embedding` 沒有什麼花俏的東西：它只是一個可訓練的 `NxE` 維度的張量的包裝類別，其中 `N` 是我們詞匯表中的詞數，`E` 是嵌入維度的數量。這個表最初是隨機的，但它會隨著時間的推移而改變。當我們訓練網路時，梯度將被反向傳播到嵌入層，因此詞嵌入將被更新。我們會在之後的模型中初始化嵌入層，但我們還是在這裡展示一個例子。

In [99]:
# Creating an embedding table for our words
embedding_dim = 5
embeds = nn.Embedding(len(vocabulary), embedding_dim)

# Printing the parameters in our embedding table
list(embeds.parameters())

[Parameter containing:
 tensor([[-0.6441, -0.7673, -1.5417, -0.1207, -0.9651],
         [-1.2125, -0.2414, -0.7353, -0.8564, -0.6710],
         [-0.8709, -0.2147,  0.9281, -0.2203, -1.0247],
         [-0.7629,  0.5375, -0.2861, -0.0471,  0.1966],
         [-0.2590, -0.6950, -0.1008,  0.2827, -1.4766],
         [-0.4554, -2.2830, -0.9473, -1.1503, -0.8882],
         [-0.8901, -0.4013, -0.7572, -0.1856, -0.3930],
         [ 2.0512,  1.6718,  0.4780, -0.9675, -1.3268],
         [-1.9501,  0.4185,  0.1948,  0.9153,  0.8057],
         [-0.4385,  1.5488, -1.4032, -0.8632, -1.2927],
         [ 0.1379, -0.0428,  0.9655, -0.5298,  2.1228],
         [-0.7229,  0.0037,  0.3245,  0.0244, -1.3044],
         [-0.9340, -0.6550,  0.4217, -0.5786, -0.3533],
         [-0.9831,  0.0045, -0.6370, -0.5370, -0.4228],
         [ 0.7762, -1.0065,  0.9537, -1.6010,  1.3405],
         [-0.3111, -0.1917, -0.5288, -1.4382, -0.7878],
         [ 0.5280, -0.2763,  1.1959, -0.4081,  0.1691],
         [-0.2543, -1.061

To get the word embedding for a word in our vocabulary, all we need to do is to create a lookup tensor. The lookup tensor is just a tensor containing the index we want to look up. `nn.Embedding` class expects an index tensor that is of type Long Tensor, so we should create our tensor accordingly. 

為了得到我們詞匯表中一個詞的詞嵌入，我們需要做的就是創建一個查找張量。查找張量只是一個包含我們想要查詢的索引的張量。`nn.Embedding` 類別希望索引張量是 Long Tensor 類別，所以我們應該相應地創建我們的張量。

In [100]:
# Get the embedding for the word Paris
index = word_to_ix["paris"]
index_tensor = torch.tensor(index, dtype=torch.long)
paris_embed = embeds(index_tensor)
paris_embed

tensor([-0.3111, -0.1917, -0.5288, -1.4382, -0.7878],
       grad_fn=<EmbeddingBackward0>)

In [101]:
# We can also get multiple embeddings at once
index_paris = word_to_ix["paris"]
index_ankara = word_to_ix["ankara"]
indices = [index_paris, index_ankara]
indices_tensor = torch.tensor(indices, dtype=torch.long)
embeddings = embeds(indices_tensor)
embeddings

tensor([[-0.3111, -0.1917, -0.5288, -1.4382, -0.7878],
        [-0.7629,  0.5375, -0.2861, -0.0471,  0.1966]],
       grad_fn=<EmbeddingBackward0>)

Usually, we define the embedding layer as part of our  model, which you will see in the later sections of our notebook. 

通常，我們將嵌入層定義為我們模型的一部分，你將在我們筆記本的後面章節中看到。

#### Batching Sentences 分批句子

We have learned about batches in class. Waiting our whole training corpus to be processed before making an update is constly. On the other hand, updating the parameters after every training example causes the loss to be less stable between updates. To combat these issues, we instead update our parameters after training on a batch of data. This allows us to get a better estimate of the gradient of the global loss. In this section, we will learn how to structure our data into batches using the `torch.util.data.DataLoader` class. 

我們在課堂上已經了解了批次的問題。等待我們的整個訓練語料庫被處理後再進行更新是很有必要的。另一方面，在每個訓練實例之後更新參數會導致在兩次更新之間損失的穩定性降低。為了解決這些問題，我們在對一批數據進行訓練後再更新我們的參數。這使我們能夠更好地估計全局損失的梯度。在本節中，我們將學習如何使用 `torch.util.data.DataLoader `類將我們的數據結構化為批次。

We will be calling the `DataLoader` class as follows: `DataLoader(data, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)`.  The `batch_size` parameter determines the number of examples per batch. In every epoch, we will be iterating over all the batches using the `DataLoader`. The order of batches is deterministic by default, but we can ask `DataLoader` to shuffle the batches by setting the `shuffle` parameter to `True`. This way we ensure that we don't encounter a bad batch multiple times.

我們將以如下方式呼叫 `DataLoader` 類別。`DataLoader(data, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)`。 `batch_size` 參數決定了每批的例子數量。在每個 `epoch` 中，我們將使用 `DataLoader` 對所有批次進行迭代。批次的順序在默認情況下是確定的，但是我們可以通過設置 `shuffle `參數為 `True` 來要求 `DataLoader` 對批次進行清洗。這樣我們就能確保不會多次遇到一個壞的批次。

If provided, `DataLoader` passes the batches it prepares to the `collate_fn`. We can write a custom function to pass to the `collate_fn` parameter in order to print stats about our batch or perform extra processing. In our case, we will use the `collate_fn` to:
1. Window pad our train sentences.
2. Convert the words in the training examples to indices.
3. Pad the training examples so that all the sentences and labels have the same length. Similarly, we also need to pad the labels. This creates an issue because when calculating the loss, we need to know the actual number of words in a given example. We will also keep track of this number in the function we pass to the `collate_fn` parameter.

如果提供的話，`DataLoader`會把它準備好的 `batch` 處理傳遞給 `collate_fn`。我們可以寫一個自定義函數傳遞給 `collate_fn` 參數，以便印出關於批次的統計訊息或執行額外的處理。在我們的案例中，我們將使用 `collate_fn` 來：
1. 對我們的訓練句子進行窗口填充（pad）。
2. 將訓練實例中的單詞轉換為索引。
3. 對訓練例子進行填充，使所有的句子和標籤具有相同的長度。同樣地，我們也需要對標籤進行填充。這就產生了一個問題，因為在計算損失時，我們需要知道一個給定例子中的實際單詞數。我們也將在傳遞給 `collate_fn` 參數的函數中紀錄這個數字。

Because our version of the `collate_fn` function will need to access to our `word_to_ix` dictionary (so that it can turn words into indices), we will make use of the `partial` function in `Python`, which passes the parameters we give to the function we pass it. 

因為我們的版本的 `collate_fn` 函數會需要使用 `word_to_ix` 字典（以便它能將單詞變成索引），我們將利用 `Python` 中的 `partial` 函數，它將我們給出的參數傳遞給我們傳遞給它的函數。

In [102]:
from torch.utils.data import DataLoader
from functools import partial


def custom_collate_fn(batch, window_size, word_to_ix):
    # Break our batch into the training examples (x) and labels (y)
    # We are turning our x and y into tensors because nn.utils.rnn.pad_sequence
    # method expects tensors. This is also useful since our model will be
    # expecting tensor inputs.
    x, y = zip(*batch)

    # Now we need to window pad our training examples. We have already defined a
    # function to handle window padding. We are including it here again so that
    # everything is in one place.
    def pad_window(sentence, window_size, pad_token="<pad>"):
        window = [pad_token] * window_size
        return window + sentence + window

    # Pad the train examples.
    x = [pad_window(s, window_size=window_size) for s in x]

    # Now we need to turn words in our training examples to indices. We are
    # copying the function defined earlier for the same reason as above.
    def convert_tokens_to_indices(sentence, word_to_ix):
        return [word_to_ix.get(token, word_to_ix["<unk>"]) for token in sentence]

    # Convert the train examples into indices.
    x = [convert_tokens_to_indices(s, word_to_ix) for s in x]

    # We will now pad the examples so that the lengths of all the example in
    # one batch are the same, making it possible to do matrix operations.
    # We set the batch_first parameter to True so that the returned matrix has
    # the batch as the first dimension.
    pad_token_ix = word_to_ix["<pad>"]

    # pad_sequence function expects the input to be a tensor, so we turn x into one
    x = [torch.LongTensor(x_i) for x_i in x]
    x_padded = nn.utils.rnn.pad_sequence(x, batch_first=True, padding_value=pad_token_ix)

    # We will also pad the labels. Before doing so, we will record the number
    # of labels so that we know how many words existed in each example.
    lengths = [len(label) for label in y]
    lenghts = torch.LongTensor(lengths)

    y = [torch.LongTensor(y_i) for y_i in y]
    y_padded = nn.utils.rnn.pad_sequence(y, batch_first=True, padding_value=0)

    # We are now ready to return our variables. The order we return our variables
    # here will match the order we read them in our training loop.
    return x_padded, y_padded, lenghts

This function seems long, but it really doesn't have to be. Check out the alternative version below where we remove the extra function declarations and comments. 

這個函數看起來很長，但它真的不需要。請看下面的替代版本，我們刪除了多餘的函數宣告和註解。

In [103]:
def _custom_collate_fn(batch, window_size, word_to_ix):
    # Prepare the datapoints
    x, y = zip(*batch)
    x = [pad_window(s, window_size=window_size) for s in x]
    x = [convert_tokens_to_indices(s, word_to_ix) for s in x]

    # Pad x so that all the examples in the batch have the same size
    pad_token_ix = word_to_ix["<pad>"]
    x = [torch.LongTensor(x_i) for x_i in x]
    x_padded = nn.utils.rnn.pad_sequence(x, batch_first=True, padding_value=pad_token_ix)

    # Pad y and record the length
    lengths = [len(label) for label in y]
    lenghts = torch.LongTensor(lengths)
    y = [torch.LongTensor(y_i) for y_i in y]
    y_padded = nn.utils.rnn.pad_sequence(y, batch_first=True, padding_value=0)

    return x_padded, y_padded, lenghts

Now, we can see the `DataLoader` in action. 

現在，我們可以看到 `DataLoader` 在執行。

In [105]:
# Parameters to be passed to the DataLoader
data = list(zip(train_sentences, train_labels))
batch_size = 2
shuffle = True
window_size = 2
collate_fn = partial(custom_collate_fn, window_size=window_size, word_to_ix=word_to_ix)

# Instantiate the DataLoader
loader = DataLoader(data, batch_size=batch_size, shuffle=shuffle, collate_fn=collate_fn)

# Go through one loop
counter = 0
for batched_x, batched_y, batched_lengths in loader:
    print(f"Iteration {counter}")
    print("Batched Input:")
    print(batched_x)
    print("Batched Labels:")
    print(batched_y)
    print("Batched Lengths:")
    print(batched_lengths)
    print("")
    counter += 1

Iteration 0
Batched Input:
tensor([[ 0,  0, 22,  2,  6, 20, 15,  0,  0],
        [ 0,  0, 10, 13, 11, 17,  0,  0,  0]])
Batched Labels:
tensor([[0, 0, 0, 0, 1],
        [0, 0, 0, 1, 0]])
Batched Lengths:
tensor([5, 4])

Iteration 1
Batched Input:
tensor([[ 0,  0,  9,  7,  8, 18,  0,  0,  0],
        [ 0,  0, 19, 16, 12,  8,  4,  0,  0]])
Batched Labels:
tensor([[0, 0, 0, 1, 0],
        [0, 0, 0, 0, 1]])
Batched Lengths:
tensor([4, 5])

Iteration 2
Batched Input:
tensor([[ 0,  0, 19,  5, 14, 21, 12,  3,  0,  0]])
Batched Labels:
tensor([[0, 0, 0, 1, 0, 1]])
Batched Lengths:
tensor([6])



The batched input tensors you see above will be passed into our model. On the other hand, we started off saying that our model will be a window classifier. The way our input tensors are currently formatted, we have all the words in a sentence in one datapoint. When we pass this input to our model, it needs to create the windows for each word, make a prediction as to whether the center word is a `LOCATION` or not for each window, put the predictions together and return. 

你在上面看到的分批輸入的張量將被傳入我們的模型。另一方面，我們一開始就說我們的模型將是一個窗口分類器。我們的輸入張量目前的格式是，我們把一個句子中的所有單詞都放在一個資料點上。當我們把這個輸入傳遞給我們的模型時，它需要為每個詞創建窗口，對每個窗口的中心詞是否是 `LOCATION` 進行預測，把預測結果放在一起並返回。

We could avoid this problem if we formatted our data by breaking it into windows beforehand. In this example, we will instead how our model take care of the formatting. 

如果我們事先將資料分解成窗口，就可以避免這個問題。在這個例子中，我們將改用我們的模型來處理格式化的問題。

Given that our `window_size` is `N` we want our model to make a prediction on every `2N+1` tokens. That is, if we have an input with `9` tokens, and a `window_size` of `2`, we want our model to return `5` predictions. This makes sense because before we padded it with `2` tokens on each side, our input also had `5` tokens in it! 

鑒於我們的窗口大小是 `N`，我們希望我們的模型能夠對每一個 `2N+1` 的標記進行預測。也就是說，如果我們有一個包含 `9` 個標記的輸入，而`window_size` 為 `2`，我們希望我們的模型能夠返回 `5` 個預測。這是有道理的，因為在我們用 `2` 標記填充它之前，我們的輸入也有 `5` 標記！"。

We can create these windows by using for loops, but there is a faster `PyTorch` alternative, which is the `unfold(dimension, size, step)` method. We can create the windows we need using this method as follows:

我們可以通過使用 `for` 迴圈來創建這些窗口，但是有一個更快的 `PyTorch` 替代方法，那就是 `unfold(dimension, size, step)` 函式。我們可以用這個方法創建我們需要的窗口，如下所示。

In [106]:
# Print the original tensor
print(f"Original Tensor: ")
print(batched_x)
print("")

# Create the 2 * 2 + 1 chunks
chunk = batched_x.unfold(1, window_size * 2 + 1, 1)
print(f"Windows: ")
print(chunk)

Original Tensor: 
tensor([[ 0,  0, 19,  5, 14, 21, 12,  3,  0,  0]])

Windows: 
tensor([[[ 0,  0, 19,  5, 14],
         [ 0, 19,  5, 14, 21],
         [19,  5, 14, 21, 12],
         [ 5, 14, 21, 12,  3],
         [14, 21, 12,  3,  0],
         [21, 12,  3,  0,  0]]])


### Model 模型

Now that we have prepared our data, we are ready to build our model. We have learned how to write custom `nn.Module` classes. We will do the same here and put everything we have learned so far together. 

現在我們已經準備好了資料，可以準備建立模型了。我們已經學會了如何編寫自定義的 `nn.Module` 類別。我們將在這裡做同樣的事情，把我們到目前為止所學的一切放在一起。

In [107]:
class WordWindowClassifier(nn.Module):
    def __init__(self, hyperparameters, vocab_size, pad_ix=0):
        super(WordWindowClassifier, self).__init__()

        """ Instance variables """
        self.window_size = hyperparameters["window_size"]
        self.embed_dim = hyperparameters["embed_dim"]
        self.hidden_dim = hyperparameters["hidden_dim"]
        self.freeze_embeddings = hyperparameters["freeze_embeddings"]

        """ Embedding Layer 
    Takes in a tensor containing embedding indices, and returns the 
    corresponding embeddings. The output is of dim 
    (number_of_indices * embedding_dim).

    If freeze_embeddings is True, set the embedding layer parameters to be
    non-trainable. This is useful if we only want the parameters other than the
    embeddings parameters to change. 

    """
        self.embeds = nn.Embedding(vocab_size, self.embed_dim, padding_idx=pad_ix)
        if self.freeze_embeddings:
            self.embed_layer.weight.requires_grad = False

        """ Hidden Layer
    """
        full_window_size = 2 * window_size + 1
        self.hidden_layer = nn.Sequential(nn.Linear(full_window_size * self.embed_dim, self.hidden_dim), nn.Tanh())

        """ Output Layer
    """
        self.output_layer = nn.Linear(self.hidden_dim, 1)

        """ Probabilities 
    """
        self.probabilities = nn.Sigmoid()

    def forward(self, inputs):
        """
        Let B:= batch_size
            L:= window-padded sentence length
            D:= self.embed_dim
            S:= self.window_size
            H:= self.hidden_dim

        inputs: a (B, L) tensor of token indices
        """
        B, L = inputs.size()

        """
    Reshaping.
    Takes in a (B, L) LongTensor
    Outputs a (B, L~, S) LongTensor
    """
        # Fist, get our word windows for each word in our input.
        token_windows = inputs.unfold(1, 2 * self.window_size + 1, 1)
        _, adjusted_length, _ = token_windows.size()

        # Good idea to do internal tensor-size sanity checks, at the least in comments!
        assert token_windows.size() == (B, adjusted_length, 2 * self.window_size + 1)

        """
    Embedding.
    Takes in a torch.LongTensor of size (B, L~, S) 
    Outputs a (B, L~, S, D) FloatTensor.
    """
        embedded_windows = self.embeds(token_windows)

        """
    Reshaping.
    Takes in a (B, L~, S, D) FloatTensor.
    Resizes it into a (B, L~, S*D) FloatTensor.
    -1 argument "infers" what the last dimension should be based on leftover axes.
    """
        embedded_windows = embedded_windows.view(B, adjusted_length, -1)

        """
    Layer 1.
    Takes in a (B, L~, S*D) FloatTensor.
    Resizes it into a (B, L~, H) FloatTensor
    """
        layer_1 = self.hidden_layer(embedded_windows)

        """
    Layer 2
    Takes in a (B, L~, H) FloatTensor.
    Resizes it into a (B, L~, 1) FloatTensor.
    """
        output = self.output_layer(layer_1)

        """
    Softmax.
    Takes in a (B, L~, 1) FloatTensor of unnormalized class scores.
    Outputs a (B, L~, 1) FloatTensor of (log-)normalized class scores.
    """
        output = self.probabilities(output)
        output = output.view(B, -1)

        return output

### Training 訓練

We are now ready to put everything together. Let's start with preparing our data and intializing our model. We can then intialize our optimizer and define our loss function. This time, instead of using one of the predefined loss function as we did before, we will define our own loss function. 

我們現在已經準備好把所有東西放在一起。讓我們先準備好我們的數據並初始化我們的模型。然後，我們可以初始化我們的優化器並定義我們的損失函數。這一次，我們將定義我們自己的損失函數，而不是像以前那樣使用預定義的損失函數。

In [108]:
# Prepare the data
data = list(zip(train_sentences, train_labels))
batch_size = 2
shuffle = True
window_size = 2
collate_fn = partial(custom_collate_fn, window_size=window_size, word_to_ix=word_to_ix)

# Instantiate a DataLoader
loader = DataLoader(data, batch_size=batch_size, shuffle=shuffle, collate_fn=collate_fn)

# Initialize a model
# It is useful to put all the model hyperparameters in a dictionary
model_hyperparameters = {
    "batch_size": 4,
    "window_size": 2,
    "embed_dim": 25,
    "hidden_dim": 25,
    "freeze_embeddings": False,
}

vocab_size = len(word_to_ix)
model = WordWindowClassifier(model_hyperparameters, vocab_size)

# Define an optimizer
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)


# Define a loss function, which computes to binary cross entropy loss
def loss_function(batch_outputs, batch_labels, batch_lengths):
    # Calculate the loss for the whole batch
    bceloss = nn.BCELoss()
    loss = bceloss(batch_outputs, batch_labels.float())

    # Rescale the loss. Remember that we have used lengths to store the
    # number of words in each training example
    loss = loss / batch_lengths.sum().float()

    return loss

Unlike our earlier example, this time instead of passing all of our training data to the model at once in each epoch, we will be utilizing batches. Hence, in each training epoch iteration, we also iterate over the batches.

與我們之前的例子不同，這次我們不是在每個 `epoch` 中一次性將所有的訓練數據傳遞給模型，而是利用批次。因此，在每個 `epoch` 的迭代中，我們也會對批次進行迭代。

In [109]:
# Function that will be called in every epoch
def train_epoch(loss_function, optimizer, model, loader):
    # Keep track of the total loss for the batch
    total_loss = 0
    for batch_inputs, batch_labels, batch_lengths in loader:
        # Clear the gradients
        optimizer.zero_grad()
        # Run a forward pass
        outputs = model.forward(batch_inputs)
        # Compute the batch loss
        loss = loss_function(outputs, batch_labels, batch_lengths)
        # Calculate the gradients
        loss.backward()
        # Update the parameteres
        optimizer.step()
        total_loss += loss.item()

    return total_loss


# Function containing our main training loop
def train(loss_function, optimizer, model, loader, num_epochs=10000):
    # Iterate through each epoch and call our train_epoch function
    for epoch in range(num_epochs):
        epoch_loss = train_epoch(loss_function, optimizer, model, loader)
        if epoch % 100 == 0:
            print(epoch_loss)

Let's start training!

開始訓練吧！

In [110]:
num_epochs = 1000
train(loss_function, optimizer, model, loader, num_epochs=num_epochs)

0.2729293256998062
0.22947931662201881
0.18051733449101448
0.13749945163726807
0.08919842541217804
0.07945763878524303
0.05936894100159407
0.05264806840568781
0.0382453016936779
0.035388316959142685


### Prediction 預測

Let's see how well our model is at making predictions. We can start by creating our test data.

讓我們看看我們的模型在預測方面的表現如何。我們可以從創建我們的測試數據開始。

In [111]:
# Create test sentences
test_corpus = ["She comes from Paris"]
test_sentences = [s.lower().split() for s in test_corpus]
test_labels = [[0, 0, 0, 1]]

# Create a test loader
test_data = list(zip(test_sentences, test_labels))
batch_size = 1
shuffle = False
window_size = 2
collate_fn = partial(custom_collate_fn, window_size=2, word_to_ix=word_to_ix)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=1, shuffle=False, collate_fn=collate_fn)

Let's loop over our test examples to see how well we are 
doing. 

讓我們在測試實例上預測，看看我們做得如何。

In [112]:
for test_instance, labels, _ in test_loader:
    outputs = model.forward(test_instance)
    print(labels)
    print(outputs)

tensor([[0, 0, 0, 1]])
tensor([[0.5522, 0.0443, 0.0483, 0.9403]], grad_fn=<ViewBackward0>)
