# Text classification task

For dis module, we go start wit one simple text classification task wey base on **[AG_NEWS](http://www.di.unipi.it/~gulli/AG_corpus_of_news_articles.html)** dataset: we go classify news headlines into one of di 4 categories: World, Sports, Business and Sci/Tech.

## The Dataset

To load di dataset, we go use di **[TensorFlow Datasets](https://www.tensorflow.org/datasets)** API.


In [1]:
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds

# In this tutorial, we will be training a lot of models. In order to use GPU memory cautiously,
# we will set tensorflow option to grow GPU memory allocation when required.
physical_devices = tf.config.list_physical_devices('GPU') 
if len(physical_devices)>0:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)

dataset = tfds.load('ag_news_subset')

We fit access di train an test part of di dataset by usin `dataset['train']` an `dataset['test']` respekivli:


In [3]:
ds_train = dataset['train']
ds_test = dataset['test']

print(f"Length of train dataset = {len(ds_train)}")
print(f"Length of test dataset = {len(ds_test)}")

Length of train dataset = 120000
Length of test dataset = 7600


Make we print di first 10 new headlines wey dey our dataset:


In [4]:
classes = ['World', 'Sports', 'Business', 'Sci/Tech']

for i,x in zip(range(5),ds_train):
    print(f"{x['label']} ({classes[x['label']]}) -> {x['title']} {x['description']}")

3 (Sci/Tech) -> b'AMD Debuts Dual-Core Opteron Processor' b'AMD #39;s new dual-core Opteron chip is designed mainly for corporate computing applications, including databases, Web services, and financial transactions.'
1 (Sports) -> b"Wood's Suspension Upheld (Reuters)" b'Reuters - Major League Baseball\\Monday announced a decision on the appeal filed by Chicago Cubs\\pitcher Kerry Wood regarding a suspension stemming from an\\incident earlier this season.'
2 (Business) -> b'Bush reform may have blue states seeing red' b'President Bush #39;s  quot;revenue-neutral quot; tax reform needs losers to balance its winners, and people claiming the federal deduction for state and local taxes may be in administration planners #39; sights, news reports say.'
3 (Sci/Tech) -> b"'Halt science decline in schools'" b'Britain will run out of leading scientists unless science education is improved, says Professor Colin Pillinger.'
1 (Sports) -> b'Gerrard leaves practice' b'London, England (Sports Network

## Text vectorization

Now wey we wan change text to **numbers** wey fit dey represent as tensors. If we wan do am for word-level representation, we go need do two things:

* Use **tokenizer** to break text into **tokens**.
* Build **vocabulary** for those tokens.

### Limiting vocabulary size

For di AG News dataset example, di vocabulary size big well-well, e pass 100k words. Normally, we no need words wey no dey show for text often &mdash; na only small sentences go get dem, and di model no go fit learn anything from dem. So e make sense to reduce di vocabulary size to smaller number by passing one argument to di vectorizer constructor:

Both of dem steps fit dey handle with **TextVectorization** layer. Make we create di vectorizer object, then use di `adapt` method to check all di text and build di vocabulary:


In [5]:
vocab_size = 50000
vectorizer = keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size)
vectorizer.adapt(ds_train.take(500).map(lambda x: x['title']+' '+x['description']))

> **Note** say we dey use only small part of the whole dataset to build vocabulary. We dey do am make execution time quick and make you no wait too long. But, e get risk say some words wey dey whole dataset no go enter the vocabulary, and dem go miss during training. So, if we use the whole vocabulary size and run through all the dataset during `adapt`, e fit make the final accuracy better, but e no go too change am. 

Now we fit check the real vocabulary:


In [6]:
vocab = vectorizer.get_vocabulary()
vocab_size = len(vocab)
print(vocab[:10])
print(f"Length of vocabulary: {vocab_size}")

['', '[UNK]', 'the', 'to', 'a', 'in', 'of', 'and', 'on', 'for']
Length of vocabulary: 5335


With di vectorizer, we fit encode any text into set of numbers easy:


In [7]:
vectorizer('I love to play with my words')

<tf.Tensor: shape=(7,), dtype=int64, numpy=array([ 112, 3695,    3,  304,   11, 1041,    1], dtype=int64)>

## Bag-of-words text representation

Because say words dey carry meaning, sometimes we fit sabi wetin text dey talk just by looking at the words one by one, no matter how dem arrange for sentence. For example, if we wan classify news, words like *weather* and *snow* fit mean *weather forecast*, while words like *stocks* and *dollar* go fit mean *financial news*.

**Bag-of-words** (BoW) vector representation na the simplest way to understand traditional vector representation. Each word dey connect to one vector index, and one vector element dey show how many times each word appear for one document.

![Image showing how a bag of words vector representation is represented in memory.](../../../../../translated_images/bag-of-words-example.606fc1738f1d7ba98a9d693e3bcd706c6e83fa7bf8221e6e90d1a206d82f2ea4.pcm.png) 

> **Note**: You fit also think of BoW as sum of all one-hot-encoded vectors for each word wey dey the text.

Below na example of how to generate bag-of-words representation using the Scikit Learn python library:


In [8]:
from sklearn.feature_extraction.text import CountVectorizer
sc_vectorizer = CountVectorizer()
corpus = [
        'I like hot dogs.',
        'The dog ran fast.',
        'Its hot outside.',
    ]
sc_vectorizer.fit_transform(corpus)
sc_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()

array([[1, 1, 0, 2, 0, 0, 0, 0, 0]], dtype=int64)

We fit use di Keras vectorizer wey we don define before, change each word number to one-hot encoding and add all di vectors together:


In [9]:
def to_bow(text):
    return tf.reduce_sum(tf.one_hot(vectorizer(text),vocab_size),axis=0)

to_bow('My dog likes hot dogs on a hot day.').numpy()

array([0., 5., 0., ..., 0., 0., 0.], dtype=float32)

> **Note**: E fit surprise you say di result no be di same as di one wey dey di previous example. Di reason be say for di Keras example, di length of di vector match di vocabulary size wey dem build from di whole AG News dataset, but for di Scikit Learn example, we build di vocabulary from di sample text as e dey happen.


## Train di BoW classifier

Now wey we don learn how to build di bag-of-words representation for our text, make we train one classifier wey go use am. First, we need to change our dataset to bag-of-words representation. We fit do dis one by using `map` function like dis:


In [11]:
batch_size = 128

ds_train_bow = ds_train.map(lambda x: (to_bow(x['title']+x['description']),x['label'])).batch(batch_size)
ds_test_bow = ds_test.map(lambda x: (to_bow(x['title']+x['description']),x['label'])).batch(batch_size)

Make we define one simple classifier neural network wey get one linear layer. Di input size na `vocab_size`, and di output size na di number of classes (4). Because we dey solve classification task, di final activation function na **softmax**:


In [12]:
model = keras.models.Sequential([
    keras.layers.Dense(4,activation='softmax',input_shape=(vocab_size,))
])
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])
model.fit(ds_train_bow,validation_data=ds_test_bow)



<keras.callbacks.History at 0x20c70a947f0>

Since we get 4 classes, accuracy wey pass 80% na beta result.

## Train classifier as one network

Because say di vectorizer na Keras layer too, we fit define one network wey go include am, and train am end-to-end. Dis way, we no need to dey vectorize di dataset using `map`, we fit just pass di original dataset go di input of di network.

> **Note**: We go still need apply maps to our dataset to change fields from dictionaries (like `title`, `description` and `label`) to tuples. But, when we dey load data from disk, we fit build dataset wey get di correct structure from di beginning.


In [13]:
def extract_text(x):
    return x['title']+' '+x['description']

def tupelize(x):
    return (extract_text(x),x['label'])

inp = keras.Input(shape=(1,),dtype=tf.string)
x = vectorizer(inp)
x = tf.reduce_sum(tf.one_hot(x,vocab_size),axis=1)
out = keras.layers.Dense(4,activation='softmax')(x)
model = keras.models.Model(inp,out)
model.summary()

model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))


Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 1)]               0         
                                                                 
 text_vectorization (TextVec  (None, None)             0         
 torization)                                                     
                                                                 
 tf.one_hot (TFOpLambda)     (None, None, 5335)        0         
                                                                 
 tf.math.reduce_sum (TFOpLam  (None, 5335)             0         
 bda)                                                            
                                                                 
 dense_2 (Dense)             (None, 4)                 21344     
                                                                 
Total params: 21,344
Trainable params: 21,344
Non-trainable p

<keras.callbacks.History at 0x20c721521f0>

## Bigrams, trigrams and n-grams

One wahala wey dey bag-of-words method be say some words dey join body form multi-word expressions. For example, di word 'hot dog' mean somtin wey different from di words 'hot' and 'dog' for other context. If we dey always use di same vectors represent di words 'hot' and 'dog', e fit confuse di model.

To solve dis mata, **n-gram representations** dey common for document classification methods, where di frequency of each word, bi-word or tri-word dey useful as feature to train classifiers. For bigram representations, for example, we go add all di word pairs join di vocabulary, plus di original words.

See example below of how to generate bigram bag of word representation using Scikit Learn:


In [14]:
bigram_vectorizer = CountVectorizer(ngram_range=(1, 2), token_pattern=r'\b\w+\b', min_df=1)
corpus = [
        'I like hot dogs.',
        'The dog ran fast.',
        'Its hot outside.',
    ]
bigram_vectorizer.fit_transform(corpus)
print("Vocabulary:\n",bigram_vectorizer.vocabulary_)
bigram_vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()


Vocabulary:
 {'i': 7, 'like': 11, 'hot': 4, 'dogs': 2, 'i like': 8, 'like hot': 12, 'hot dogs': 5, 'the': 16, 'dog': 0, 'ran': 14, 'fast': 3, 'the dog': 17, 'dog ran': 1, 'ran fast': 15, 'its': 9, 'outside': 13, 'its hot': 10, 'hot outside': 6}


array([[1, 0, 1, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
      dtype=int64)

Di main wahala wey dey wit di n-gram method be say di vocabulary size go dey grow too fast. For real life, we go need join di n-gram representation wit one dimensionality reduction method, like *embeddings*, wey we go talk about for di next unit.

To use n-gram representation for our **AG News** dataset, we go need pass di `ngrams` parameter go di `TextVectorization` constructor. Di size of bigram vocabulary dey **plenty well well**, for our case e pass 1.3 million tokens! So e make sense to limit di bigram tokens too to one reasonable number.

We fit use di same code wey we don use before to train di classifier, but e no go dey memory-efficient at all. For di next unit, we go train di bigram classifier wit embeddings. For now, you fit try train di bigram classifier for dis notebook and see whether you fit get better accuracy.


## How to calculate BoW Vectors automatic

For di example wey dey up, we calculate BoW vectors by hand by adding di one-hot encodings of di words one by one. But now, di latest version of TensorFlow fit help us calculate BoW vectors automatic if we pass di `output_mode='count` parameter go di vectorizer constructor. Dis one go make am easy for us to define and train our model well well:


In [15]:
model = keras.models.Sequential([
    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,output_mode='count'),
    keras.layers.Dense(4,input_shape=(vocab_size,), activation='softmax')
])
print("Training vectorizer")
model.layers[0].adapt(ds_train.take(500).map(extract_text))
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))

Training vectorizer


<keras.callbacks.History at 0x20c725217c0>

## Term frequency - inverse document frequency (TF-IDF)

For BoW representation, di way wey word dem dey show, dem dey use di same method take weight dem no matter di word. But e clear say word dem wey dey show well well like *a* and *in* no too get weight for classification like di special terms dem. For most NLP tasks, some words dey more important pass others.

**TF-IDF** mean **term frequency - inverse document frequency**. E be one kind version of bag-of-words, but instead of to use binary 0/1 value wey go show if word dey for document, dem dey use floating-point value wey relate to how many times di word show for di corpus.

To talk am well well, di weight $w_{ij}$ of one word $i$ for di document $j$ na:
$$
w_{ij} = tf_{ij}\times\log({N\over df_i})
$$
wey
* $tf_{ij}$ na how many times $i$ show for $j$, na di BoW value wey we don see before
* $N$ na di number of documents wey dey di collection
* $df_i$ na di number of documents wey get di word $i$ for di whole collection

Di TF-IDF value $w_{ij}$ go increase as di word dey show plenty times for one document, but e go reduce based on how many documents for di corpus get di word. Dis one dey help balance di fact say some words dey show pass others. For example, if di word dey show for *every* document for di collection, $df_i=N$, and $w_{ij}=0$, dem go just ignore di word.

You fit use Scikit Learn take create TF-IDF vectorization for text:


In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectorizer.fit_transform(corpus)
vectorizer.transform(['My dog likes hot dogs on a hot day.']).toarray()

array([[0.43381609, 0.        , 0.43381609, 0.        , 0.65985664,
        0.43381609, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        ]])

For Keras, di `TextVectorization` layer fit calculate TF-IDF frequencies automatic if you pass di `output_mode='tf-idf'` parameter. Make we repeat di code we use before to see if using TF-IDF go increase accuracy:


In [17]:
model = keras.models.Sequential([
    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocab_size,output_mode='tf-idf'),
    keras.layers.Dense(4,input_shape=(vocab_size,), activation='softmax')
])
print("Training vectorizer")
model.layers[0].adapt(ds_train.take(500).map(extract_text))
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))

Training vectorizer


<keras.callbacks.History at 0x20c729dfd30>

## Conclusion

Even though TF-IDF dey give weight to different words based on how dem take show, e no fit represent wetin dem mean or di order wey dem take dey. As one popular linguist J. R. Firth talk for 1935, "Di full meaning of any word dey always depend on di context, and any study of meaning wey no consider context no go make sense." Later for dis course, we go learn how we fit capture di context wey dey inside text using language modeling.


---

<!-- CO-OP TRANSLATOR DISCLAIMER START -->
**Disclaimer**:  
Dis docu don use AI translation service [Co-op Translator](https://github.com/Azure/co-op-translator) take translate am. Even though we dey try make sure say e correct, abeg no forget say automatic translation fit get mistake or no too accurate. Di original docu for di language wey dem first write am na di main correct one. For important information, e better make professional human translator check am. We no go fit take blame for any misunderstanding or wrong interpretation wey fit happen because of dis translation.
<!-- CO-OP TRANSLATOR DISCLAIMER END -->
