# Representing text
### If we want to solve Natural Language Processing (NLP) tasks with neural networks, we need some way to represent text as tensors. Computers already represent textual characters as numbers that map to fonts on your screen using encodings such as ASCII or UTF-8.



### We understand what each letter represents, and how all characters come together to form the words of a sentence. However, computers by themselves do not have such an understanding, and a neural network has to learn the meaning during training.

### Therefore, we can use different approaches when representing text:

### Character-level representation, when we represent text by treating each character as a number. Given that we have  C different characters in our text corpus, the word Hello would be represented by   5 * C tensor. Each letter would correspond to a tensor column in one-hot encoding. Word-level representation, when we create a vocabulary of all words in our text sequence or sentence(s), and then represent each word using one-hot encoding. This approach is somehow better, because each letter by itself does not have much meaning, and thus by using higher-level semantic concepts - words - we simplify the task for the neural network. However, given a large dictionary size, we need to deal with high-dimensional sparse tensors. For example, if we have a vocabulary size of 10,000 different words. Then each word would have an one-hot encoding length of 10,000; hence the high-dimensional. to unify those approaches, we typically call an atomic piece of text a token. In some cases tokens can be letters, in other cases - words, or parts of words.

### For example, we can choose to tokenize indivisible as in-divis-ible, where the # sign represents that the token is a continuation of the previous word. This would allow the root divis to always be represented by one token, corresponding to one core meaning.The process of converting text into a sequence of tokens is called tokenization. Next, we need to assign each token to a number, which we can feed into a neural network. This is called vectorization, and is normally done by building a token vocabulary.

### Let's start by installing some required Python packages we'll use in this module.

In [None]:
#!pip install -r https://raw.githubusercontent.com/MicrosoftDocs/pytorchfundamentals/main/nlp-pytorch/requirements.txt

In [None]:
#!pip install opencv-python

# Text classification task
### In this module, we will start with a simple text classification task based on AG_NEWS sample dataset, which is to classify news headlines into one of 4 categories: World, Sports, Business and Sci/Tech. This dataset is built from PyTorch's torchtext module, so we can easily access it.

In [2]:
torch. __version__

'1.11.0+cu102'

In [9]:
!pip install torchtext

Collecting torchtext
  Downloading torchtext-0.12.0-cp310-cp310-manylinux1_x86_64.whl (10.4 MB)
[K     |████████████████████████████████| 10.4 MB 443 kB/s eta 0:00:01
Collecting tqdm
  Downloading tqdm-4.64.0-py2.py3-none-any.whl (78 kB)
[K     |████████████████████████████████| 78 kB 435 kB/s eta 0:00:01
Installing collected packages: tqdm, torchtext
Successfully installed torchtext-0.12.0 tqdm-4.64.0


In [11]:
!pip install torchdata

Collecting torchdata
  Downloading torchdata-0.3.0-py3-none-any.whl (47 kB)
[K     |████████████████████████████████| 47 kB 127 kB/s eta 0:00:01
Installing collected packages: torchdata
Successfully installed torchdata-0.3.0


In [2]:
import torch
import torchtext
import os
import collections
os.makedirs('./data',exist_ok=True)
train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
classes = ['World', 'Sports', 'Business', 'Sci/Tech']

### Here, train_dataset and test_dataset contain iterators that return pairs of label (number of class) and text respectively, for example:

In [3]:
next(train_dataset)

TypeError: 'MapperIterDataPipe' object is not an iterator

### So, let's print out the first 5 new headlines from our dataset:

In [4]:
for i,x in zip(range(5),train_dataset):
    print(f"**{classes[x[0]]}** -> {x[1]}\n")

**Sci/Tech** -> Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.

**Sci/Tech** -> Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.

**Sci/Tech** -> Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.

**Sci/Tech** -> Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday.

**Sci/Tech** -> Oil prices soa

### Because datasets are iterators, if we want to use the data multiple times we need to convert it to a list:

In [6]:
train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
train_dataset = list(train_dataset)
test_dataset = list(test_dataset)

## Tokenization and Vectorization
### Now we need to convert text into numbers that can be represented as tensors to feed them into a neural network. The first step is to convert text to tokens - tokenization. If we use word-level representation, each word would be represented by its own token. We will use build-in tokenizer from torchtext module:

In [7]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')

### We'll use PyTorch's tokenizer to split words and spaces in the first 2 news articles. In our case, we use basic_english for the tokenizer to understand the language structure. This will return a string list of the text and characters.



In [8]:
first_sentence = train_dataset[0][1]
second_sentence = train_dataset[1][1]

f_tokens = tokenizer(first_sentence)
s_tokens = tokenizer(second_sentence)

print(f'\nfirst token list:\n{f_tokens}')
print(f'\nsecond token list:\n{s_tokens}')


first token list:
['wall', 'st', '.', 'bears', 'claw', 'back', 'into', 'the', 'black', '(', 'reuters', ')', 'reuters', '-', 'short-sellers', ',', 'wall', 'street', "'", 's', 'dwindling\\band', 'of', 'ultra-cynics', ',', 'are', 'seeing', 'green', 'again', '.']

second token list:
['carlyle', 'looks', 'toward', 'commercial', 'aerospace', '(', 'reuters', ')', 'reuters', '-', 'private', 'investment', 'firm', 'carlyle', 'group', ',', '\\which', 'has', 'a', 'reputation', 'for', 'making', 'well-timed', 'and', 'occasionally\\controversial', 'plays', 'in', 'the', 'defense', 'industry', ',', 'has', 'quietly', 'placed\\its', 'bets', 'on', 'another', 'part', 'of', 'the', 'market', '.']


### Next, to convert text to numbers, we will need to build a vocabulary of all tokens. We first build the dictionary using the Counter object, and then create a Vocab object that would help us deal with vectorization:



In [10]:
counter = collections.Counter()
for (label, line) in train_dataset:
    counter.update(tokenizer(line))
vocab = torchtext.vocab.Vocab(counter)

### To see how each word maps to the vocabulary, we'll loop through each word in the list to lookup it's index number in vocab. Each word or character is displayed with it's corresponding index. For example, word 'the' appears several times in both sentence and it's unique index in the vocab is the number 3.

In [11]:
word_lookup = [list((vocab[w], w)) for w in f_tokens]
print(f'\nIndex lockup in 1st sentence:\n{word_lookup}')

word_lookup = [list((vocab[w], w)) for w in s_tokens]
print(f'\nIndex lockup in 2nd sentence:\n{word_lookup}')


Index lockup in 1st sentence:
[[1395, 'wall'], [1409, 'st'], [225971, '.'], [399, 'bears'], [17, 'claw'], [4123, 'back'], [6637, 'into'], [203843, 'the'], [761, 'black'], [41106, '('], [19310, 'reuters'], [40787, ')'], [19310, 'reuters'], [39206, '-'], [2, 'short-sellers'], [165685, ','], [1395, 'wall'], [1581, 'street'], [32235, "'"], [61724, 's'], [1, 'dwindling\\band'], [97909, 'of'], [2, 'ultra-cynics'], [165685, ','], [9723, 'are'], [135, 'seeing'], [828, 'green'], [1758, 'again'], [225971, '.']]

Index lockup in 2nd sentence:
[[15, 'carlyle'], [600, 'looks'], [758, 'toward'], [490, 'commercial'], [124, 'aerospace'], [41106, '('], [19310, 'reuters'], [40787, ')'], [19310, 'reuters'], [39206, '-'], [696, 'private'], [809, 'investment'], [1776, 'firm'], [15, 'carlyle'], [4676, 'group'], [165685, ','], [5, '\\which'], [18945, 'has'], [110153, 'a'], [117, 'reputation'], [50186, 'for'], [1114, 'making'], [2, 'well-timed'], [68872, 'and'], [1, 'occasionally\\controversial'], [296, 'pla

### Using vocabulary, we can easily encode our tokenized string into a set of numbers. Let's use the first news article as an example:

In [34]:
vocab_size = len(vocab)
print(f"Vocab size of {vocab_size}")

def encode(x):
    return [vocab[s] for s in tokenizer(x)]

vec = encode(first_sentence)
print(vec)

Vocab size of 95810
[1395, 1409, 225971, 399, 17, 4123, 6637, 203843, 761, 41106, 19310, 40787, 19310, 39206, 2, 165685, 1395, 1581, 32235, 61724, 1, 97909, 2, 165685, 9723, 135, 828, 1758, 225971]


### In this code, the torchtext vocab.stoi(it does not work thats why i removed it) dictionary allows us to convert from a string representation into numbers (the name stoi stands for "from string to integers). To convert the text back from a numeric representation into text, we can use the vocab.itos dictionary to perform reverse lookup:

In [23]:
def decode(x):
    return [vocab[i] for i in x]

decode(vec)

[0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0]

## BiGrams, TriGrams and N-Grams
### One limitation of word tokenization is that some words are part of multi word expressions, for example, the word 'hot dog' has a completely different meaning than the words 'hot' and 'dog' in other contexts. If we represent words 'hot` and 'dog' always by the same vectors, it can confuse the model.

### To address this, N-gram representations are sometimes used in document classification, where the frequency of each word, bi-word or tri-word is a useful feature for training classifiers.

### In bigram representation, for example, we will add all word pairs to the vocabulary, in addition to original words.
### To get n-gram representation, we can use ngrams_iterator function that will convert the sequence of tokens to the sequence of n-grams. In the code below, we will build bigram vocabulary from our news dataset:

In [36]:
from torchtext.data.utils import ngrams_iterator

bi_counter = collections.Counter()
for (label, line) in train_dataset:
    bi_counter.update(ngrams_iterator(tokenizer(line),ngrams=2))
bi_vocab = torchtext.vocab.Vocab(bi_counter)

print(f"Bigram vocab size = {len(bi_vocab)}")

Bigram vocab size = 1308842


In [38]:
def encode(x):
    return [bi_vocab[s] for s in tokenizer(x)]

encode(first_sentence)

[1395,
 1409,
 225971,
 399,
 17,
 4123,
 6637,
 203843,
 761,
 41106,
 19310,
 40787,
 19310,
 39206,
 2,
 165685,
 1395,
 1581,
 32235,
 61724,
 1,
 97909,
 2,
 165685,
 9723,
 135,
 828,
 1758,
 225971]

### The main drawback of N-gram approach is that vocabulary size starts to grow extremely fast. Here we specify min_freq(It does not work ) flag to Vocab constructor in order to avoid those tokens that appear in the text only once. We can also increase min_freq even further, because infrequent words/phrases usually have little effect on the accuracy of classification.

### Note: Try setting set min_freq parameter to a higher value, and observe the length of vocabulary change.
### In practice, n-gram vocabulary size is still too high to represent words as one-hot vectors, and thus we need to combine this representation with some dimensionality reduction techniques, such as embeddings, which we will discuss in a later unit.