Author: Kaveh Mahdavi <kavehmahdavi74@yahoo.com>
License: BSD 3 clause
last update: 28/12/2022

# News Classification

I explore different neural network architectures for dealing with natural language text by using:
* bag-of-words
* embeddings
* recurrent neural network

In [28]:
import os

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
import sys
import tensorflow as tf
from tensorflow import keras
import tensorflow_datasets as tfds
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# To use GPU memory cautiously, I set tensorflow option to grow GPU memory allocation when needed.
physical_devices = tf.config.list_physical_devices('GPU')
if len(physical_devices) > 0:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)

In [55]:
# Functions
def get_bag_of_words(text, vocab_size):
    return tf.reduce_sum(tf.one_hot(vectorizer(text), vocab_size), axis=0)

def extract_text(x):
    return x['title']+' '+x['description']

def tupelize(x):
    return (extract_text(x),x['label'])


## Represent text

To solve Natural Language Processing (NLP) tasks with ANN, I need some way to represent text as tensors.

* **Character-level representation:** I represent text by treating each character as a number. Given that we have C  different characters in our text corpus, the word Hello could be represented by a tensor with shape C×5. Each letter would correspond to a tensor in one-hot encoding.
* **Word-level representation:** I create a vocabulary of all words in our text, and then represent words using one-hot encoding. This approach is better than character-level representation because each letter by itself does not have much meaning. By using higher-level semantic concepts - words - we simplify the task for the neural network. However, given a large dictionary size, we need to deal with high-dimensional sparse tensors.

### Load Dataset

In [2]:
dataset = tfds.load('ag_news_subset')

[1mDownloading and preparing dataset 11.24 MiB (download: 11.24 MiB, generated: 35.79 MiB, total: 47.03 MiB) to /home/kaveh/tensorflow_datasets/ag_news_subset/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/2 [00:00<?, ? splits/s]

Generating train examples...:   0%|          | 0/120000 [00:00<?, ? examples/s]

Shuffling /home/kaveh/tensorflow_datasets/ag_news_subset/1.0.0.incompleteYA5A4R/ag_news_subset-train.tfrecord*…

Generating test examples...:   0%|          | 0/7600 [00:00<?, ? examples/s]

Shuffling /home/kaveh/tensorflow_datasets/ag_news_subset/1.0.0.incompleteYA5A4R/ag_news_subset-test.tfrecord*.…

[1mDataset ag_news_subset downloaded and prepared to /home/kaveh/tensorflow_datasets/ag_news_subset/1.0.0. Subsequent calls will reuse this data.[0m


In [14]:
classes = ['World', 'Sports', 'Business', 'Sci/Tech']
ds_train = dataset['train']
ds_test = dataset['test']

print("Size of train dataset: {}".format(len(ds_train)))
print("Size of test dataset:  {}".format(len(ds_test)))

Size of train dataset: 120000
Size of test dataset:  7600


In [20]:
for i, x in zip(range(3), ds_train):
    print(f"{x['label']} ({classes[x['label']]}) -> {x['title']} {x['description']}")

3 (Sci/Tech) -> b'AMD Debuts Dual-Core Opteron Processor' b'AMD #39;s new dual-core Opteron chip is designed mainly for corporate computing applications, including databases, Web services, and financial transactions.'
1 (Sports) -> b"Wood's Suspension Upheld (Reuters)" b'Reuters - Major League Baseball\\Monday announced a decision on the appeal filed by Chicago Cubs\\pitcher Kerry Wood regarding a suspension stemming from an\\incident earlier this season.'
2 (Business) -> b'Bush reform may have blue states seeing red' b'President Bush #39;s  quot;revenue-neutral quot; tax reform needs losers to balance its winners, and people claiming the federal deduction for state and local taxes may be in administration planners #39; sights, news reports say.'


## Simple Classifier ANN

### I. Approach

#### Vectorize Text

I vectorize text into numbers to represent as tensors. In the word-level, I should do:
* Use a tokenizer to split text into tokens.
* Build a vocabulary of those tokens.

I don't take to account words that are rarely present in the text, since only a few sentences will have them, and the model will not learn from them.
I limit the vocabulary size by passing an argument to the `TextVectorization` constructor.

In [56]:
# Vectorize & build a vocabulary
vectorizer = keras.layers.experimental.preprocessing.TextVectorization(max_tokens=50000)
vectorizer.adapt(ds_train.take(500).map(lambda x: x['title'] + ' ' + x['description']))

In [57]:
vocabulary = vectorizer.get_vocabulary()
vocabulary_size = len(vocabulary)
print(vocabulary[:15])
print(f"Number of vocabulary: {vocabulary_size}")
vectorizer('I love artificial intelligence')

['', '[UNK]', 'the', 'to', 'a', 'in', 'of', 'and', 'on', 'for', '39s', 'with', 'that', 'its', 'as']
Number of vocabulary: 5335


<tf.Tensor: shape=(4,), dtype=int64, numpy=array([ 112, 3695, 5071, 3908])>

#### Bag of Words

I convert each word number into a one-hot encoding and adding all those vectors up.

In [33]:
batch_size = 128
ds_train_bow = ds_train.map(
    lambda x: (get_bag_of_words(x['title'] + x['description'], vocabulary_size), x['label'])).batch(batch_size)
ds_test_bow = ds_test.map(
    lambda x: (get_bag_of_words(x['title'] + x['description'], vocabulary_size), x['label'])).batch(batch_size)

<BatchDataset element_spec=(TensorSpec(shape=(None, 5335), dtype=tf.float32, name=None), TensorSpec(shape=(None,), dtype=tf.int64, name=None))>

#### Build a Classifier

In [31]:
model = keras.models.Sequential([
    keras.layers.Dense(4, activation='softmax', input_shape=(vocabulary_size,))
])

model.summary()

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])
model.fit(ds_train_bow, validation_data=ds_test_bow)



<keras.callbacks.History at 0x7f4813659af0>

### II. Approach:

In [60]:
inp = keras.Input(shape=(1,),dtype=tf.string)
x = vectorizer(inp)
x = tf.reduce_sum(tf.one_hot(x,vocabulary_size),axis=1)
out = keras.layers.Dense(4,activation='softmax')(x)
model = keras.models.Model(inp,out)
model.summary()

model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))

Training vectorizer

Exception ignored in: <function WeakKeyDictionary.__init__.<locals>.remove at 0x7f4839707310>
Traceback (most recent call last):
  File "/usr/lib/python3.8/weakref.py", line 345, in remove
    def remove(k, selfref=ref(self)):
KeyboardInterrupt: 


KeyboardInterrupt: 

In [None]:
model = keras.models.Sequential([
    keras.layers.experimental.preprocessing.TextVectorization(max_tokens=vocabulary_size,output_mode='count'),
    keras.layers.Dense(4,input_shape=(vocabulary_size,), activation='softmax')
])
print("Training vectorizer")
model.layers[0].adapt(ds_train.take(500).map(extract_text))
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['acc'])
model.fit(ds_train.map(tupelize).batch(batch_size),validation_data=ds_test.map(tupelize).batch(batch_size))

Training vectorizer
