# Bag-of-Words and TF-IDF representations

In the previous unit, we have learnt to represent text by numbers. In this unit, we'll explore some of the approaches to feeding variable-length text into a neural network to collapse the input sequence into a fixed-length vector, which can then be used in the classifier.

To begin with, let's load our AG News dataset and build the vocabulary, we we have done in the previous unit. To make things shorter, all those operations are combined into `load_dataset` function of the accompanying Python module:

在上一单元中，我们学习了用数字表示文本。 在本单元中，我们将探讨一些将可变长度文本输入神经网络以将输入序列折叠为固定长度向量的方法，然后可将其用于分类器。

首先，让我们加载我们的 AG News 数据集并构建词汇表，我们在上一个单元中已经完成了。 为了使事情更简短，所有这些操作都被组合到随附的 Python 模块的 `load_dataset` 函数中：

In [2]:
!pip install -r https://raw.githubusercontent.com/MicrosoftDocs/pytorchfundamentals/main/nlp-pytorch/requirements.txt
!wget -q https://raw.githubusercontent.com/MicrosoftDocs/pytorchfundamentals/main/nlp-pytorch/torchnlp.py

Collecting torch==1.8.1
  Using cached torch-1.8.1-cp38-cp38-manylinux1_x86_64.whl (804.1 MB)
[31mERROR: azureml-automl-dnn-nlp 1.45.0 has requirement transformers==4.6.0, but you'll have transformers 4.3.3 which is incompatible.[0m
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.12.0
    Uninstalling torch-1.12.0:
      Successfully uninstalled torch-1.12.0
Successfully installed torch-1.8.1


In [3]:
import torch
import torchtext
import os
import collections
from torchnlp import *
train_dataset, test_dataset, classes, vocab = load_dataset()
vocab_size = len(vocab)
print("Vocab size = ",vocab_size)


Loading dataset...
Building vocab...
Vocab size =  95812


## Bag of Words text representation

Because words represent meaning, sometimes we can figure out the **_meaning of a text_** by just looking at the individual words, regardless of their order in the sentence. For example, when classifying news, words like *weather*, *snow* are likely to indicate *weather forecast*, while words like *stocks*, *dollar* would count towards *financial news*.

**Bag of Words** (BoW) vector representation is the most commonly used traditional vector representation. Each word is linked to a vector index, vector element contains the number of occurrences of a word in a given document.

因为单词代表意义，有时我们可以通过查看单个单词来弄清楚 **_文本的含义_**，而不管它们在句子中的顺序如何。 例如，在对新闻进行分类时，*天气*、*雪*等词可能表示*天气预报*，而*股票*、*美元*等词则可能表示*财经新闻*。

**词袋**（BoW）向量表示是最常用的传统向量表示。 每个词都链接到一个向量索引，向量元素包含一个词在给定文档中出现的次数。

<img alt="Image showing how a bag of words vector representation is represented in memory." src="images/3-bow-tfidf-1.png" align="middle" />

> **Note**: You can also think of BoW as a sum of all one-hot-encoded vectors for individual words in the text.

Below is an example of how to generate a bag of word representation for a text using vectorization defined previously:

> **注意**：您也可以将 BoW 视为文本中单个单词的所有 one-hot-encoded 向量的总和。

下面是一个示例，说明如何使用先前定义的矢量化为文本生成词袋表示：

In [4]:
def to_bow(text,bow_vocab_size=vocab_size):
    res = torch.zeros(bow_vocab_size,dtype=torch.float32)
    for i in encode(text):
        if i<bow_vocab_size:
            res[i] += 1
    return res

print(f"sample text:\n{train_dataset[0][1]}")
print(f"\nBoW vector:\n{to_bow(train_dataset[0][1])}")

sample text:
Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.

BoW vector:
tensor([0., 0., 2.,  ..., 0., 0., 0.])



 > **Note:** Here we are using global `vocab_size` variable to specify default size of the vocabulary. Since often vocabulary size are pretty big, we can limit the size of the vocabulary to most frequent words. Try lowering `vocab_size` value and running the code below, and see how it affects the accuracy. You should expect some accuracy drop, but not dramatic, in lieu of higher performance


 > **注意：** 这里我们使用全局变量 `vocab_size` 来指定词汇表的默认大小。 由于词汇量通常很大，我们可以将词汇量限制为最常用的词。 尝试降低 `vocab_size` 值并运行下面的代码，看看它如何影响准确性。 您应该预料到准确率会有所下降，但不会急剧下降，以代替更高的性能。

## Training BoW classifier

Now that we have learned how to build a Bag-of-Words representation of our text, let's train a classifier on top of it. First, we need to convert our dataset for training in such a way, that all positional vector representations are converted to bag-of-words representation. This can be achieved by passing `bowify` function as `collate_fn` parameter to standard torch `DataLoader`.  The `collate_fn` gives you the ability to apply your own function to the dataset as it's loaded by the Dataloader:

现在我们已经学习了如何构建文本的词袋表示，让我们在它之上训练一个分类器。 首先，我们需要转换我们的数据集以进行训练，将所有位置向量表示转换为词袋表示。 这可以通过将 `bowify` 函数作为 `collate_fn` 参数传递给标准 torch `DataLoader` 来实现。 `collate_fn` 使您能够在 Dataloader 加载数据集时将您自己的函数应用于数据集：

In [5]:
from torch.utils.data import DataLoader
import numpy as np 

# this collate function gets list of batch_size tuples, and needs to 
# return a pair of label-feature tensors for the whole minibatch
def bowify(b):
    return (
            torch.LongTensor([t[0]-1 for t in b]),
            torch.stack([to_bow(t[1]) for t in b])
    )

train_loader = DataLoader(train_dataset, batch_size=16, collate_fn=bowify, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, collate_fn=bowify, shuffle=True)

Now let's define a simple classifier neural network that contains one linear layer. The size of the input vector equals to `vocab_size`, and the output size corresponds to the number of classes (4). Because we are solving a classification task, the final activation function is `LogSoftmax()`.

现在让我们定义一个包含一个线性层的简单分类器神经网络。 输入向量的大小等于`vocab_size`，输出大小对应于类数 (4)。 因为我们正在解决分类任务，所以最终的激活函数是 `LogSoftmax()` 。

In [10]:
net = torch.nn.Sequential(torch.nn.Linear(vocab_size,4),torch.nn.LogSoftmax(dim=1))

Now we will define a standard PyTorch training loop. Because our dataset is quite large, for our teaching purpose we will train only for one epoch, and sometimes even for less than an epoch (specifying the `epoch_size` parameter allows us to limit training). We would also report accumulated training accuracy during training; the frequency of reporting is specified using `report_freq` parameter.

现在我们将定义一个标准的 PyTorch 训练循环。 因为我们的数据集非常大，为了我们的教学目的，我们将只训练一个 epoch，有时甚至少于一个 epoch（指定 `epoch_size` 参数允许我们限制训练）。 我们还将报告训练期间累积的训练准确度； 报告频率使用`report_fre`q参数指定。

In [8]:
def train_epoch(net,dataloader,lr=0.01,optimizer=None,loss_fn = torch.nn.NLLLoss(),epoch_size=None, report_freq=200):
    optimizer = optimizer or torch.optim.Adam(net.parameters(),lr=lr)
    net.train()
    total_loss,acc,count,i = 0,0,0,0
    for labels,features in dataloader:
        optimizer.zero_grad()
        out = net(features)
        loss = loss_fn(out,labels) #cross_entropy(out,labels)
        loss.backward()
        optimizer.step()
        total_loss+=loss
        _,predicted = torch.max(out,1)
        acc+=(predicted==labels).sum()
        count+=len(labels)
        i+=1
        if i%report_freq==0:
            print(f"{count}: acc={acc.item()/count}")
        if epoch_size and count>epoch_size:
            break
    return total_loss.item()/count, acc.item()/count

Let's see how the classifier performs on the training dataset.

让我们看看分类器如何在训练数据集上执行。

In [11]:
train_epoch(net,train_loader,epoch_size=15000)

3200: acc=0.784375
6400: acc=0.83421875
9600: acc=0.8473958333333333
12800: acc=0.860546875


(0.025977031000133263, 0.8640724946695096)

> The Bag-of-Words approach can be used in the same manner with n-gram tokenizer - only that the vocabulary size would be bigger, and thus the network would have too many parameters. In the next unit, we will see how bigram representation can be used together with embeddings.


> Bag-of-Words 方法可以以与 n-gram 分词器相同的方式使用——只是词汇量会更大，因此网络会有太多参数。 在下一个单元中，我们将看到如何将二元表示与嵌入一起使用。


## Term Frequency / Inverse Document Frequency:  TF-IDF

In BoW representation, word occurrences are evenly weighted, regardless of the word itself. However, it is clear that frequent words, such as *'a'*, *'in'*, *'the'* etc. are much less important for the classification, than specialized terms. In fact, in most NLP tasks some words are more relevant than others.

**TF-IDF** stands for **term frequency–inverse document frequency**. It is a variation of bag of words, where instead of a binary 0/1 value indicating the appearance of a word in a document, a floating-point value is used, which is related to the **_frequency of word occurrence_** in the corpus.

The formula to calculate TF-IDF is:  $w_{ij} = tf_{ij}\times\log({N\over df_i})$

Here's the meaning of each parameter in the formula:
* $i$ is the word 
* $j$ is the document
* $w_{ij}$ is the weight or the importance of the word in the document
* $tf_{ij}$ is the number of occurrences of the word $i$ in the document $j$, i.e. the BoW value we have seen before
* $N$ is the number of documents in the collection
* $df_i$ is the number of documents containing the word $i$ in the whole collection


在 BoW 表示中，单词出现的权重是均匀的，与单词本身无关。 然而，很明显，与专业术语相比，频繁出现的词，如 *'a'*、*'in'*、*'the'* 等，对于分类而言重要性要低得多。 事实上，在大多数 NLP 任务中，有些词比其他词更相关。

**TF-IDF** 代表 **term frequency–inverse document frequency**。 它是词袋的变体，其中使用浮点值代替二进制 0/1 值指示文档中单词的出现，这与 **_frequency of word occurrence_** in 语料库。

TF-IDF的计算公式为：$w_{ij} = tf_{ij}\times\log({N\over df_i})$

下面是公式中各个参数的含义：
* $i$ 是单词
* $j$ 是文档
* $w_{ij}$ 是单词在文档中的权重或重要性
* $tf_{ij}$是单词$i$在文档$j$中出现的次数，即我们之前看到的BoW值
* $N$ 是集合中文档的数量
* $df_i$ 是整个集合中包含单词 $i$ 的文档数

<img alt="Diagram showing table representing word frequent in documents." src="images/3-bow-tfidf-2.png" align="middle" />

TF-IDF value $w_{ij}$ increases proportionally to the number of times a word appears in a document and is offset by the number of documents in the corpus that contains the word, which helps to adjust for the fact that some words appear more frequently than others. For example, if the word appears in *every* document in the collection, $df_i=N$, and $w_{ij}=0$, and those terms would be completely disregarded.

First, let's compute document frequency $df_i$ for each word $i$. We can represent it as tensor of size `vocab_size`. We will limit the number of documents to $N=1000$ to speed up processing. For each input sentence, we compute the set of words (represented by their numbers), and increase the corresponding counter:

TF-IDF 值 $w_{ij}$ 与一个词在文档中出现的次数成正比增加，并被语料库中包含该词的文档数抵消，这有助于调整某些词出现的事实 比其他人更频繁。 例如，如果单词出现在集合中的*每个*文档中，$df_i=N$ 和 $w_{ij}=0$，那么这些术语将被完全忽略。

首先，让我们计算每个单词 $i$ 的文档频率 $df_i$。 我们可以将其表示为大小为`vocab_size`的张量。 我们会将文档数量限制为 $N=1000$ 以加快处理速度。 对于每个输入句子，我们计算单词集（由它们的数字表示），并增加相应的计数器：

In [12]:
N = 1000
df = torch.zeros(vocab_size)
for _,line in train_dataset[:N]:
    for i in set(encode(line)):
        df[i] += 1

Now that we have document frequencies for each word, we can define `tf_idf` function that will take a string, and produce TF-IDF vector. We will use `to_bow` defined above to calculate term frequency vector, and multiply it by inverse document frequency of the corresponding term. Remember that all tensor operations are element-wise, which allows us to implement the whole computation as a tensor formula:

现在我们有了每个单词的文档频率，我们可以定义 `tf_idf` 函数，该函数将获取一个字符串，并生成 TF-IDF 向量。 我们将使用上面定义的 `to_bow` 来计算术语频率向量，并将其乘以相应术语的逆文档频率。 请记住，所有张量运算都是按元素进行的，这使我们可以将整个计算实现为张量公式：

In [13]:
def tf_idf(s):
    bow = to_bow(s)
    return bow*torch.log((N+1)/(df+1))

print(tf_idf(train_dataset[0][1]))

tensor([0.0000, 0.0000, 0.0363,  ..., 0.0000, 0.0000, 0.0000])


> You may have noticed that we used a slightly different formula for TF-IDF, namely $\log({N+1\over df_i+1})$ instead of $\log({N\over df_i})$. This yields similar results, but prevents division by 0 in those cases when $df_i=0$.

Even though TF-IDF representation calculates different weights to different words according to their importance, it is unable to correctly capture the meaning, largely because the order of words in the sentence is still not taken into account. As the famous linguist J. R. Firth said in 1935, “The complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously”. We will learn in the later units how to capture contextual information from text using language modeling.

> 你可能已经注意到我们对 TF-IDF 使用了一个稍微不同的公式，即 $\log({N+1\over df_i+1})$ 而不是 $\log({N\over df_i})$。 这会产生类似的结果，但在 $df_i=0$ 的情况下会阻止除以 0。

即使 TF-IDF 表示根据重要性对不同的词计算不同的权重，它也无法正确地捕捉含义，很大程度上是因为仍然没有考虑句子中词的顺序。 正如著名语言学家 J. R. Firth 在 1935 年所说的那样，“一个词的完整含义总是与上下文相关的，离开上下文对意义的研究是不能认真对待的”。 我们将在后面的单元中学习如何使用语言建模从文本中捕获上下文信息。
