# Representing text

If we want to solve Natural Language Processing (NLP) tasks with neural networks, we need some way to represent text as tensors. Computers already represent textual characters as numbers that map to fonts on your screen using encodings such as ASCII or UTF-8.


<img alt="Image showing diagram mapping a character to an ASCII and binary representation" src="images/2-represent-text-as-tensor-checkpoint-1.png" align="middle" />

We understand what each letter **represents**, and how all characters come together to form the words of a sentence. However, computers by themselves do not have such an understanding, and a neural network has to learn the meaning during training.

Therefore, we can use different approaches when representing text:
* **Character-level representation**, when we represent text by treating each character as a number. Given that we have $C$ different characters in our text corpus, the word *Hello* would be represented by $5\times C$ tensor. Each letter would correspond to a tensor column in one-hot encoding.
* **Word-level representation**, when we create a **vocabulary** of all words in our text sequence or sentence(s), and then represent each word using one-hot encoding. This approach is somehow better, because each letter by itself does not have much meaning, and thus by using higher-level semantic concepts - words - we simplify the task for the neural network. However, given a large dictionary size, we need to deal with high-dimensional sparse tensors.  For example, if we have a vocabulary size of 10,000 different words.  Then each word would have an one-hot encoding length of 10,000; hence the high-dimensional.

To unify those approaches, we typically call an atomic piece of text **a token**. In some cases tokens can be letters, in other cases - words, or parts of words.

> For example, we can choose to tokenize *indivisible* as `in`-`divis`-`ible`, where the `#` sign represents that the token is a continuation of the previous word. This would allow the root `divis` to always be represented by one token, corresponding to one core meaning.

The process of converting text into a sequence of tokens is called **tokenization**. Next, we need to assign each token to a number, which we can feed into a neural network. This is called **vectorization**, and is normally done by building a token vocabulary.  

Let's start by installing some required Python packages we'll use in this module.


如果我们想用神经网络解决自然语言处理 (NLP) 任务，我们需要某种方法将文本表示为张量。 计算机已经将文本字符表示为使用 ASCII 或 UTF-8 等编码映射到屏幕上字体的数字。

我们了解每个字母**代表**，以及所有字符如何组合成一个句子的单词。 然而，计算机本身并没有这样的理解，神经网络必须在训练过程中学习其含义。

因此，我们可以在表示文本时使用不同的方法：
* **字符级表示**，当我们通过将每个字符视为一个数字来表示文本时。 鉴于我们的文本语料库中有 $C$ 个不同的字符，单词 *Hello* 将由 $5\times C$ 张量表示。 每个字母都对应于 one-hot 编码中的一个张量列。
* **词级表示**，当我们为文本序列或句子中的所有词创建一个**词汇表**，然后使用单热编码表示每个词时。 这种方法在某种程度上更好，因为每个字母本身没有太多意义，因此通过使用更高级别的语义概念——单词——我们简化了神经网络的任务。 然而，给定一个大字典大小，我们需要处理高维稀疏张量。 例如，如果我们有 10,000 个不同单词的词汇量。 那么每个单词的单热编码长度为 10,000； 因此是高维的。

为了统一这些方法，我们通常将一段原子文本称为 **a token**。 在某些情况下，标记可以是字母，在其他情况下 - 单词或单词的一部分。

> 例如，我们可以选择将 *indivisible* 分词为 `in`-`divis`-`ible`，其中 `#` 符号表示该分词是前一个单词的延续。 这将允许词根“divis”始终由一个标记表示，对应于一个核心含义。

将文本转换为一系列标记的过程称为**标记化**。 接下来，我们需要将每个标记分配给一个数字，我们可以将其输入神经网络。 这称为 **向量化**，通常通过构建标记词汇表来完成。

让我们首先安装我们将在本模块中使用的一些必需的 Python 包。

In [1]:
!pip install -r https://raw.githubusercontent.com/MicrosoftDocs/pytorchfundamentals/main/nlp-pytorch/requirements.txt

Collecting huggingface==0.0.1
  Downloading huggingface-0.0.1-py3-none-any.whl (2.5 kB)
Collecting nltk==3.5
  Downloading nltk-3.5.zip (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 32.9 MB/s eta 0:00:01
[?25hCollecting numpy==1.18.5
  Downloading numpy-1.18.5-cp38-cp38-manylinux1_x86_64.whl (20.6 MB)
[K     |████████████████████████████████| 20.6 MB 74.6 MB/s eta 0:00:01
[?25hCollecting opencv-python==4.5.1.48
  Downloading opencv_python-4.5.1.48-cp38-cp38-manylinux2014_x86_64.whl (50.4 MB)
[K     |████████████████████████████████| 50.4 MB 72 kB/s s eta 0:00:01
[?25hCollecting Pillow==7.1.2
  Downloading Pillow-7.1.2-cp38-cp38-manylinux1_x86_64.whl (2.1 MB)
[K     |████████████████████████████████| 2.1 MB 75.4 MB/s eta 0:00:01
Collecting torch==1.8.1
  Downloading torch-1.8.1-cp38-cp38-manylinux1_x86_64.whl (804.1 MB)
[K     |████████████████████████████████| 804.1 MB 1.8 kB/s  eta 0:00:01
[?25hCollecting torchaudio==0.8.1
  Downloading torchaudio-0.8.1-cp38-cp38-

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



# Text classification task

In this module, we will start with a simple text classification task based on **AG_NEWS** sample dataset, which is to classify news headlines into one of 4 categories: _World, Sports, Business and Sci/Tech_. This dataset is built from PyTorch's `torchtext` module, so we can easily access it.


在本模块中，我们将从基于 **AG_NEWS** 示例数据集的简单文本分类任务开始，即将新闻标题分类为 4 个类别之一：_World、Sports、Business 和 Sci/Tech_。 这个数据集是从 PyTorch 的`torchtext`模块构建的，所以我们可以很容易地访问它。

In [2]:
import torch
import torchtext
import os
import collections
os.makedirs('./data',exist_ok=True)
train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
classes = ['World', 'Sports', 'Business', 'Sci/Tech']

train.csv: 29.5MB [00:00, 121MB/s]                    
test.csv: 1.86MB [00:00, 104MB/s]                   


Here, `train_dataset` and `test_dataset` contain iterators that return pairs of label (number of class) and text respectively, for example:

这里，`train_dataset` 和 `test_dataset` 包含迭代器，分别返回标签对（类数）和文本，例如：

In [3]:
next(train_dataset)

(3,
 "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")

So, let's print out the first 5 new headlines from our dataset: 

因此，让我们打印出数据集中的前 5 个新标题：

In [4]:
for i,x in zip(range(5),train_dataset):
    print(f"**{classes[x[0]]}** -> {x[1]}\n")

**Sci/Tech** -> Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.

**Sci/Tech** -> Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.

**Sci/Tech** -> Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday.

**Sci/Tech** -> Oil prices soar to all-time record, posing new menace to US economy (AFP) AFP - Tearaway world oil prices, toppling records and straining wallets, present a new economic menace

Because datasets are iterators, if we want to use the data multiple times we need to convert it to a list:

因为数据集是迭代器，如果我们想多次使用数据，我们需要将其转换为列表：

In [5]:
train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
train_dataset = list(train_dataset)
test_dataset = list(test_dataset)

## Tokenization and Vectorization

Now we need to convert text into **numbers** that can be represented as tensors to feed them into a neural network. The first step is to convert text to tokens - **tokenization**. If we use word-level representation, each word would be represented by its own token. We will use build-in tokenizer from `torchtext` module:

现在我们需要将文本转换为可以表示为张量的**数字**，以将它们输入神经网络。 第一步是将文本转换为标记——**标记化**。 如果我们使用词级表示，每个词将由其自己的标记表示。 我们将使用 `torchtext` 模块中的内置分词器：

In [6]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

We'll use PyTorch's tokenizer to split words and spaces in the first 2 news articles. In our case, we use basic_english for the tokenizer to understand the language structure. This will return a string list of the text and characters.

我们将使用 PyTorch 的分词器来拆分前 2 篇新闻文章中的单词和空格。 在我们的例子中，我们使用 basic_english 作为分词器来理解语言结构。 这将返回文本和字符的字符串列表。

In [7]:
first_sentence = train_dataset[0][1]
second_sentence = train_dataset[1][1]

f_tokens = tokenizer(first_sentence)
s_tokens = tokenizer(second_sentence)

print(f'\nfirst token list:\n{f_tokens}')
print(f'\nsecond token list:\n{s_tokens}')


first token list:
['wall', 'st', '.', 'bears', 'claw', 'back', 'into', 'the', 'black', '(', 'reuters', ')', 'reuters', '-', 'short-sellers', ',', 'wall', 'street', "'", 's', 'dwindling\\band', 'of', 'ultra-cynics', ',', 'are', 'seeing', 'green', 'again', '.']

second token list:
['carlyle', 'looks', 'toward', 'commercial', 'aerospace', '(', 'reuters', ')', 'reuters', '-', 'private', 'investment', 'firm', 'carlyle', 'group', ',', '\\which', 'has', 'a', 'reputation', 'for', 'making', 'well-timed', 'and', 'occasionally\\controversial', 'plays', 'in', 'the', 'defense', 'industry', ',', 'has', 'quietly', 'placed\\its', 'bets', 'on', 'another', 'part', 'of', 'the', 'market', '.']


Next, to convert text to numbers, we will need to build a vocabulary of all tokens. We first build the dictionary using the `Counter` object, and then create a `Vocab` object that would help us deal with vectorization:

接下来，要将文本转换为数字，我们需要构建所有标记的词汇表。 我们首先使用`Counter`对象构建字典，然后创建一个`Vocab`对象来帮助我们处理向量化：

In [8]:
counter = collections.Counter()
for (label, line) in train_dataset:
    counter.update(tokenizer(line))
vocab = torchtext.vocab.Vocab(counter, min_freq=1)

To see how each word maps to the vocabulary, we'll loop through each word in the list to lookup it's index number in `vocab`.  Each word or character is displayed with it's corresponding index.  For example, word 'the' appears several times in both sentence and it's unique index in the vocab is the number 3.

要查看每个单词如何映射到词汇表，我们将遍历列表中的每个单词以查找它在 `vocab` 中的索引号。 每个单词或字符都显示有相应的索引。 例如，单词“the”在两个句子中都出现了多次，它在词汇表中的唯一索引是数字 3。

In [9]:
word_lookup = [list((vocab[w], w)) for w in f_tokens]
print(f'\nIndex lockup in 1st sentence:\n{word_lookup}')

word_lookup = [list((vocab[w], w)) for w in s_tokens]
print(f'\nIndex lockup in 2nd sentence:\n{word_lookup}')


Index lockup in 1st sentence:
[[432, 'wall'], [426, 'st'], [2, '.'], [1606, 'bears'], [14839, 'claw'], [114, 'back'], [67, 'into'], [3, 'the'], [849, 'black'], [14, '('], [28, 'reuters'], [15, ')'], [28, 'reuters'], [16, '-'], [50726, 'short-sellers'], [4, ','], [432, 'wall'], [375, 'street'], [17, "'"], [10, 's'], [67508, 'dwindling\\band'], [7, 'of'], [52259, 'ultra-cynics'], [4, ','], [43, 'are'], [4010, 'seeing'], [784, 'green'], [326, 'again'], [2, '.']]

Index lockup in 2nd sentence:
[[15875, 'carlyle'], [1073, 'looks'], [855, 'toward'], [1311, 'commercial'], [4251, 'aerospace'], [14, '('], [28, 'reuters'], [15, ')'], [28, 'reuters'], [16, '-'], [930, 'private'], [798, 'investment'], [321, 'firm'], [15875, 'carlyle'], [99, 'group'], [4, ','], [27658, '\\which'], [29, 'has'], [6, 'a'], [4460, 'reputation'], [12, 'for'], [565, 'making'], [52791, 'well-timed'], [9, 'and'], [80618, 'occasionally\\controversial'], [2126, 'plays'], [8, 'in'], [3, 'the'], [526, 'defense'], [242, 'indus

Using vocabulary, we can easily encode our tokenized string into a set of numbers. Let's use the first news article as an example:

使用词汇表，我们可以轻松地将标记化的字符串编码为一组数字。 我们以第一篇新闻文章为例：

In [9]:
vocab_size = len(vocab)
print(f"Vocab size if {vocab_size}")

def encode(x):
    return [vocab.stoi[s] for s in tokenizer(x)]

vec = encode(first_sentence)
print(vec)

Vocab size if 95812
[432, 426, 2, 1606, 14839, 114, 67, 3, 849, 14, 28, 15, 28, 16, 50726, 4, 432, 375, 17, 10, 67508, 7, 52259, 4, 43, 4010, 784, 326, 2]


In this code, the torchtext `vocab.stoi` dictionary allows us to convert from a string representation into numbers (the name *stoi* stands for "from **s**tring **to** **i**ntegers). To convert the text back from a numeric representation into text, we can use the `vocab.itos` dictionary to perform reverse lookup:

在此代码中，torchtext `vocab.stoi` 字典允许我们将字符串表示形式转换为数字（名称 *stoi* 代表“from **s**tring **to** **i**ntegers”） . 要将文本从数字表示转换回文本，我们可以使用 `vocab.itos` 字典执行反向查找：

In [10]:
def decode(x):
    return [vocab.itos[i] for i in x]

decode(vec)

['wall',
 'st',
 '.',
 'bears',
 'claw',
 'back',
 'into',
 'the',
 'black',
 '(',
 'reuters',
 ')',
 'reuters',
 '-',
 'short-sellers',
 ',',
 'wall',
 'street',
 "'",
 's',
 'dwindling\\band',
 'of',
 'ultra-cynics',
 ',',
 'are',
 'seeing',
 'green',
 'again',
 '.']

## BiGrams, TriGrams and N-Grams

One limitation of word tokenization is that some words are part of multi word expressions, for example, the word _'hot dog'_ has a completely different meaning than the words 'hot' and 'dog' in other contexts. If we represent words 'hot` and 'dog' always by the same vectors, it can confuse the model.

To address this, **N-gram representations** are sometimes used in document classification, where the frequency of each word, bi-word or tri-word is a useful feature for training classifiers. 
- In **bigram** representation, for example, we will add all _word pairs_ to the vocabulary, in addition to original words. 

To get n-gram representation, we can use `ngrams_iterator` function that will convert the sequence of tokens to the sequence of n-grams. In the code below, we will build bigram vocabulary from our news dataset:

词标记化的一个局限性是某些词是多词表达的一部分，例如，“hot dog”这个词与其他语境中的“hot”和“dog”这两个词具有完全不同的含义。 如果我们总是用相同的向量表示单词“hot”和“dog”，它可能会混淆模型。

为了解决这个问题，**N-gram 表示**有时用于文档分类，其中每个词、双词或三词的频率是训练分类器的有用特征。
- 例如，在 **bigram** 表示中，除了原始单词之外，我们还将所有_单词对_添加到词汇表中。

为了获得 n-gram 表示，我们可以使用 `ngrams_iterator` 函数将标记序列转换为 n-gram 序列。 在下面的代码中，我们将从我们的新闻数据集中构建二元词汇表：

In [11]:
from torchtext.data.utils import ngrams_iterator

bi_counter = collections.Counter()
for (label, line) in train_dataset:
    bi_counter.update(ngrams_iterator(tokenizer(line),ngrams=2))
bi_vocab = torchtext.vocab.Vocab(bi_counter, min_freq=2)

print(f"Bigram vocab size = {len(bi_vocab)}")

Bigram vocab size = 481971


In [12]:
def encode(x):
    return [bi_vocab.stoi[s] for s in tokenizer(x)]

encode(first_sentence)

[572,
 564,
 2,
 2326,
 49106,
 150,
 88,
 3,
 1143,
 14,
 32,
 15,
 32,
 16,
 443749,
 4,
 572,
 499,
 17,
 10,
 0,
 7,
 468770,
 4,
 52,
 7019,
 1050,
 442,
 2]

The main drawback of N-gram approach is that vocabulary size starts to grow extremely fast. Here we specify `min_freq` flag to `Vocab` constructor in order to avoid those tokens that appear in the text only once. We can also increase `min_freq` even further, because infrequent words/phrases usually have little effect on the accuracy of classification.

> **Note:** Try setting set `min_freq` parameter to a higher value, and observe the length of vocabulary change.

In practice, n-gram vocabulary size is still too high to represent words as one-hot vectors, and thus we need to combine this representation with some dimensionality reduction techniques, such as *embeddings*, which we will discuss in a later unit.

N-gram 方法的主要缺点是词汇量开始快速增长。 在这里，我们为 `Vocab` 构造函数指定 `min_freq` 标志，以避免那些只在文本中出现一次的标记。 我们还可以进一步增加`min_freq`，因为不常见的词/短语通常对分类的准确性影响不大。

> **注意：** 尝试将 set `min_freq` 参数设置为更高的值，并观察词汇变化的长度。

在实践中，n-gram 词汇量仍然太大，无法将单词表示为单热向量，因此我们需要将这种表示与一些降维技术相结合，例如 *embeddings*，我们将在后面的单元中讨论。

## Feeding text into neural network

We have learnt how to represent each word by a number. Now, to build text classification model, we need to feed the whole sentence (or whole news article) into a neural network. The problem here is that each article/sentence has variable length; and all fully-connected or convolution neural networks deal with fixed input size. There are two ways we can handle this problem:

* Find a way to collapse a sentence into fixed-length vector. In the next unit we will see how **Bag-of-Words** and **TF-IDF** representations help us to do that.
* Design special neural network architectures that can deal with variable length sequences. We'll learn how Recurrent neural networks (RNN) for sequence modeling are implemented later in this module.

我们已经学习了如何用数字表示每个单词。 现在，要构建文本分类模型，我们需要将整个句子（或整篇新闻文章）输入神经网络。 这里的问题是每篇文章/句子的长度都是可变的； 所有全连接或卷积神经网络都处理固定的输入大小。 我们有两种方法可以处理这个问题：

* 找到一种将句子折叠成固定长度向量的方法。 在下一个单元中，我们将看到 **Bag-of-Words** 和 **TF-IDF** 表示如何帮助我们做到这一点。
* 设计可以处理可变长度序列的特殊神经网络架构。 我们将在本模块后面了解如何实现用于序列建模的递归神经网络 (RNN)。