# Text classification task

As we don tok before, we go focus on simple text classification task wey base on **AG_NEWS** dataset, wey be to classify news headlines into one out of 4 categories: World, Sports, Business and Sci/Tech.

## The Dataset

Dis dataset dey already inside [`torchtext`](https://github.com/pytorch/text) module, so e go dey easy for us to access am.


In [1]:
import torch
import torchtext
import os
import collections
os.makedirs('./data',exist_ok=True)
train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
classes = ['World', 'Sports', 'Business', 'Sci/Tech']

Here, `train_dataset` and `test_dataset` get collection wey dey return pair of label (number of class) and text, for example:


In [2]:
list(train_dataset)[0]

(3,
 "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")

Make we print di first 10 new headlines wey dey our dataset:


In [5]:
for i,x in zip(range(5),train_dataset):
    print(f"**{classes[x[0]]}** -> {x[1]}")


**Sci/Tech** -> Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.
**Sci/Tech** -> Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\which has a reputation for making well-timed and occasionally\controversial plays in the defense industry, has quietly placed\its bets on another part of the market.
**Sci/Tech** -> Oil and Economy Cloud Stocks' Outlook (Reuters) Reuters - Soaring crude prices plus worries\about the economy and the outlook for earnings are expected to\hang over the stock market next week during the depth of the\summer doldrums.
**Sci/Tech** -> Iraq Halts Oil Exports from Main Southern Pipeline (Reuters) Reuters - Authorities have halted oil export\flows from the main pipeline in southern Iraq after\intelligence showed a rebel militia could strike\infrastructure, an oil official said on Saturday.
**Sci/Tech** -> Oil prices soar to

Bikos datasets na iterators, if we wan use di data plenty times, we go need convert am to list:


In [3]:
train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
train_dataset = list(train_dataset)
test_dataset = list(test_dataset)

## Tokenization

Now we gatz turn text to **numbers** wey fit show as tensors. If we wan represent am for word-level, we gatz do two things:
* use **tokenizer** to break text into **tokens**
* build one **vocabulary** of those tokens.


In [4]:
tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
tokenizer('He said: hello')

['he', 'said', 'hello']

In [5]:
counter = collections.Counter()
for (label, line) in train_dataset:
    counter.update(tokenizer(line))
vocab = torchtext.vocab.vocab(counter, min_freq=1)

Wit vocabulary, we fit easy encode our tokenized string into set of numbers:


In [19]:
vocab_size = len(vocab)
print(f"Vocab size if {vocab_size}")

stoi = vocab.get_stoi() # dict to convert tokens to indices

def encode(x):
    return [stoi[s] for s in tokenizer(x)]

encode('I love to play with my words')

Vocab size if 95810


[599, 3279, 97, 1220, 329, 225, 7368]

## Bag of Words text representation

Because say words dey carry meaning, sometimes we fit sabi wetin text mean just by looking at the words one by one, no matter how dem arrange for sentence. For example, if we wan classify news, words like *weather*, *snow* fit show say na *weather forecast*, while words like *stocks*, *dollar* go fit mean *financial news*.

**Bag of Words** (BoW) vector representation na di most common traditional vector representation wey people dey use. Each word dey connect to one vector index, and di vector element dey show how many times one word appear for one document.

![Image wey dey show how bag of words vector representation dey show for memory.](../../../../../translated_images/bag-of-words-example.606fc1738f1d7ba98a9d693e3bcd706c6e83fa7bf8221e6e90d1a206d82f2ea4.pcm.png) 

> **Note**: You fit also think of BoW like sum of all one-hot-encoded vectors for di individual words wey dey inside di text.

Below na example of how to generate bag of word representation using di Scikit Learn python library:


In [7]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
corpus = [
        'I like hot dogs.',
        'The dog ran fast.',
        'Its hot outside.',
    ]
vectorizer.fit_transform(corpus)
vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()

array([[1, 1, 0, 2, 0, 0, 0, 0, 0]], dtype=int64)

To calculate bag-of-words vector from di vector wey represent our AG_NEWS dataset, we fit use dis function:


In [20]:
vocab_size = len(vocab)

def to_bow(text,bow_vocab_size=vocab_size):
    res = torch.zeros(bow_vocab_size,dtype=torch.float32)
    for i in encode(text):
        if i<bow_vocab_size:
            res[i] += 1
    return res

print(to_bow(train_dataset[0][1]))

tensor([2., 1., 2.,  ..., 0., 0., 0.])


> **Note:** Na global `vocab_size` variable we dey use here to set default size of the vocabulary. Because vocabulary size dey big wella sometimes, we fit limit the size of the vocabulary to the words wey people dey use pass. Try reduce `vocab_size` value and run the code wey dey below, make you see how e go affect the accuracy. You go expect small drop for accuracy, but e no go too bad, instead e go make performance better.


## Train BoW classifier

Now we don sabi how to build Bag-of-Words representation for our text, make we train classifier on top am. First, we go need change our dataset for training so dat all positional vector representations go turn to bag-of-words representation. We fit do dis by passing `bowify` function as `collate_fn` parameter to standard torch `DataLoader`:


In [21]:
from torch.utils.data import DataLoader
import numpy as np 

# this collate function gets list of batch_size tuples, and needs to 
# return a pair of label-feature tensors for the whole minibatch
def bowify(b):
    return (
            torch.LongTensor([t[0]-1 for t in b]),
            torch.stack([to_bow(t[1]) for t in b])
    )

train_loader = DataLoader(train_dataset, batch_size=16, collate_fn=bowify, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, collate_fn=bowify, shuffle=True)

Make we define one simple classifier neural network wey get one linear layer. The size of di input vector na `vocab_size`, and di output size na di number of classes (4). Because we dey solve classification task, di final activation function na `LogSoftmax()`.


In [22]:
net = torch.nn.Sequential(torch.nn.Linear(vocab_size,4),torch.nn.LogSoftmax(dim=1))

Now we go define standard PyTorch training loop. Because our dataset big well well, for our teaching purpose we go train only for one epoch, and sometimes even for less than one epoch (we fit use `epoch_size` parameter take limit training). We go also report the training accuracy wey we don gather during training; the time wey we go dey report am na `report_freq` parameter go specify am.


In [24]:
def train_epoch(net,dataloader,lr=0.01,optimizer=None,loss_fn = torch.nn.NLLLoss(),epoch_size=None, report_freq=200):
    optimizer = optimizer or torch.optim.Adam(net.parameters(),lr=lr)
    net.train()
    total_loss,acc,count,i = 0,0,0,0
    for labels,features in dataloader:
        optimizer.zero_grad()
        out = net(features)
        loss = loss_fn(out,labels) #cross_entropy(out,labels)
        loss.backward()
        optimizer.step()
        total_loss+=loss
        _,predicted = torch.max(out,1)
        acc+=(predicted==labels).sum()
        count+=len(labels)
        i+=1
        if i%report_freq==0:
            print(f"{count}: acc={acc.item()/count}")
        if epoch_size and count>epoch_size:
            break
    return total_loss.item()/count, acc.item()/count

In [25]:
train_epoch(net,train_loader,epoch_size=15000)

3200: acc=0.8028125
6400: acc=0.8371875
9600: acc=0.8534375
12800: acc=0.85765625


(0.026090790722161722, 0.8620069296375267)

## BiGrams, TriGrams and N-Grams

One wahala wey dey bag of words approach be say some words dey join body for multi word expressions. For example, di word 'hot dog' get completely different meaning from di words 'hot' and 'dog' for other context. If we dey represent di words 'hot' and 'dog' always wit di same vectors, e fit confuse our model.

To solve dis mata, **N-gram representations** dey often dey use for document classification methods, where di frequency of each word, bi-word or tri-word go be useful feature to train classifiers. For bigram representation, for example, we go add all di word pairs join di vocabulary, plus di original words.

Below na example of how to generate bigram bag of word representation using Scikit Learn:


In [26]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1)
corpus = [
        'I like hot dogs.',
        'The dog ran fast.',
        'Its hot outside.',
    ]
bigram_vectorizer.fit_transform(corpus)
print("Vocabulary:\n",bigram_vectorizer.vocabulary_)
bigram_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()


Vocabulary:
 {'i': 7, 'like': 11, 'hot': 4, 'dogs': 2, 'i like': 8, 'like hot': 12, 'hot dogs': 5, 'the': 16, 'dog': 0, 'ran': 14, 'fast': 3, 'the dog': 17, 'dog ran': 1, 'ran fast': 15, 'its': 9, 'outside': 13, 'its hot': 10, 'hot outside': 6}


array([[1, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int64)

Di main wahala wey dey for N-gram method na say di vocabulary size go dey grow too fast. For real life, we go need join N-gram representation wit some techniques wey go reduce di dimension, like *embeddings*, wey we go talk about for di next unit.

To use N-gram representation for our **AG News** dataset, we go need build special ngram vocabulary:


In [27]:
counter = collections.Counter()
for (label, line) in train_dataset:
    l = tokenizer(line)
    counter.update(torchtext.data.utils.ngrams_iterator(l,ngrams=2))
    
bi_vocab = torchtext.vocab.vocab(counter, min_freq=1)

print("Bigram vocabulary length = ",len(bi_vocab))

Bigram vocabulary length =  1308842


We fit use di same code wey dey up train di classifier, but e go chop plenty memory. For di next unit, we go train bigram classifier wey dey use embeddings.

> **Note:** You fit only keep those ngrams wey show for di text pass di number wey you set. Dis one go make sure say di bigrams wey no dey show well go comot, and e go reduce di dimensionality well well. To do am, set `min_freq` parameter to higher value, and check how di length of vocabulary go change.


## Term Frequency Inverse Document Frequency TF-IDF

For BoW representation, e no dey matter how important one word be, dem dey give all word same weight. But e clear say some word wey dey show plenty like *a*, *in*, etc. no too dey important for classification like special terms. For most NLP work, some word dey more relevant pass others.

**TF-IDF** mean **term frequency–inverse document frequency**. E be one kind version of bag of words, but instead of binary 0/1 value wey dey show say word dey inside document, dem dey use floating-point value wey relate to how many times word show for corpus.

To explain am well, the weight $w_{ij}$ of one word $i$ for document $j$ na:
$$
w_{ij} = tf_{ij}\times\log({N\over df_i})
$$
where
* $tf_{ij}$ na how many times $i$ show for $j$, wey be the BoW value wey we don talk before
* $N$ na the number of documents wey dey the collection
* $df_i$ na the number of documents wey get the word $i$ for the whole collection

TF-IDF value $w_{ij}$ go increase as the word dey show plenty for one document, but e go reduce based on how many documents for the corpus get the word. This one dey help balance am because some word dey show plenty pass others. For example, if the word dey show for *every* document for the collection, $df_i=N$, and $w_{ij}=0$, dem go just ignore those kind terms.

You fit use Scikit Learn to create TF-IDF vectorization for text:


In [28]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectorizer.fit_transform(corpus)
vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()

array([[0.43381609, 0.        , 0.43381609, 0.        , 0.65985664,
        0.43381609, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ]])

## Conclusion

Even though TF-IDF dey give weight to how words dey show for text, e no fit show wetin the words mean or how dem take arrange. As one popular linguist J. R. Firth talk for 1935, “The full meaning of any word na from the context e dey come, and you no fit study meaning wey no get context seriously.” Later for this course, we go learn how to use language modeling take capture context from text.


---

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Disclaimer**:  
Dis dokyument don use AI transleshion service [Co-op Translator](https://github.com/Azure/co-op-translator) do di transleshion. Even as we dey try make am accurate, abeg make you sabi say automatik transleshion fit get mistake or no dey correct well. Di original dokyument for im native language na di one wey you go take as di correct source. For important informashon, e good make you use professional human transleshion. We no go fit take blame for any misunderstanding or wrong meaning wey fit happen because you use dis transleshion.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
