<a href="https://colab.research.google.com/github/oscar-defelice/TextClassifierModels/blob/main/CNN/TextClassifierCNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installation and Configuration

Google Colab offers free GPU and even TPU. For the purpose of simpler setup, we will stick to GPU. Attention models are quite big, so we need to be aware that we are constrained by 12 GB of VRAM in Google Colab as Tesla K80 is used.

First, let's check if you have GPU enabled in your session here in Colab. You can do it by running the following code.

In [1]:
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:    
    raise SystemError('GPU device not found')

Found GPU at: /device:GPU:0


If you do not have the GPU enabled, just go to:

`Edit -> Notebook Settings -> Hardware accelerator -> Set to GPU`

To fine-tune our model, we need a couple of libraries to install first. 
TensorFlow 2 is already preinstalled, so the missing ones are [transformers](https://github.com/huggingface/transformers) and [TensorFlow datasets](https://github.com/tensorflow/datasets). This allows us to very easily import already pre-trained models for TensorFlow 2 and fine-tune with Keras API. 

In [2]:
%%bash
pip install -q transformers tensorflow_datasets==4.0.1 

In addition, we install [Tensorflow.JS](https://www.tensorflow.org/js). This will be useful to export our model once trained to deploy it on a web app.

In [3]:
%%bash
pip install -q tensorflowjs

# Loading AG News dataset

We will use [ag_news dataset](https://www.tensorflow.org/datasets/catalog/ag_news_subset).

The AG's news topic classification dataset is constructed by Xiang Zhang (xiang.zhang@nyu.edu) from the complete dataset of 1 million of news. It is used as a text classification benchmark in the following paper: Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015).

The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. Each class contains 30,000 training samples and 1,900 testing samples. The total number of training samples is 120,000 and testing 7,600.

Although we could load data very quickly just with `tensorflow_datasets` library, with the following code

```python
import tensorflow_datasets as tfds

(ds_train, ds_test), ds_info = tfds.load('ag_news_subset', 
          split = (tfds.Split.TRAIN, tfds.Split.TEST),
          as_supervised=True,
          with_info=True
          )

print('info', ds_info)
```

Note how the code above returns a dictionary 
```python
FeaturesDict({
  'description': Text(shape=(), dtype=tf.string),
  'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=4),
  'title': Text(shape=(), dtype=tf.string),
})
```

We want to operate some preprocessing, thus we are going to load data through the usual pandas dataframe from csv, and then convert to a tensorflow dataset, which is the robust, and ready-to-parallel computing format we want to use.

In [4]:
%%bash
wget https://s3.amazonaws.com/fast-ai-nlp/ag_news_csv.tgz
mkdir -p data && tar -xvzf ag_news_csv.tgz -C data/

ag_news_csv/
ag_news_csv/train.csv
ag_news_csv/readme.txt
ag_news_csv/test.csv
ag_news_csv/classes.txt


--2020-10-26 10:51:13--  https://s3.amazonaws.com/fast-ai-nlp/ag_news_csv.tgz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.64.158
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.64.158|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11784419 (11M) [application/x-tar]
Saving to: ‘ag_news_csv.tgz.1’

     0K .......... .......... .......... .......... ..........  0% 1.65M 7s
    50K .......... .......... .......... .......... ..........  0% 1.68M 7s
   100K .......... .......... .......... .......... ..........  1% 1.68M 7s
   150K .......... .......... .......... .......... ..........  1%  144M 5s
   200K .......... .......... .......... .......... ..........  2% 1.70M 5s
   250K .......... .......... .......... .......... ..........  2% 94.5M 4s
   300K .......... .......... .......... .......... ..........  3%  178M 4s
   350K .......... .......... .......... .......... ..........  3%  164M 3s
   400K .......... .......... .......... .....

In [5]:
import pandas as pd

train_data = pd.read_csv('/content/data/ag_news_csv/train.csv', engine='python', encoding='utf-8', header =None, names=['Class Index',	'Title',	'Description'])
test_data = pd.read_csv('/content/data/ag_news_csv/test.csv', engine='python', encoding='utf-8', header = None, names=['Class Index',	'Title',	'Description'])

print('Training set summary\n')
print(train_data.describe())
print('Test set summary\n')
print(test_data.describe())

Training set summary

         Class Index
count  120000.000000
mean        2.500000
std         1.118039
min         1.000000
25%         1.750000
50%         2.500000
75%         3.250000
max         4.000000
Test set summary

       Class Index
count  7600.000000
mean      2.500000
std       1.118108
min       1.000000
25%       1.750000
50%       2.500000
75%       3.250000
max       4.000000


Now let's explore the examples for fine-tunning. We can just take the top 5 examples and labels by `ds_train.take(5)`, so that we can explore the dataset without the need to iterate over whole 25000 examples in train dataset. 

In [6]:
train_data.head()

Unnamed: 0,Class Index,Title,Description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


In [7]:
test_data.head()

Unnamed: 0,Class Index,Title,Description
0,3,Fears for T N pension after talks,Unions representing workers at Turner Newall...
1,4,The Race is On: Second Private Team Sets Launc...,"SPACE.com - TORONTO, Canada -- A second\team o..."
2,4,Ky. Company Wins Grant to Study Peptides (AP),AP - A company founded by a chemistry research...
3,4,Prediction Unit Helps Forecast Wildfires (AP),AP - It's barely dawn when Mike Fitzpatrick st...
4,4,Calif. Aims to Limit Farm-Related Smog (AP),AP - Southern California's smog-fighting agenc...


## Import libraries

Here we import all the libraries we need to build and then train our model.

In [8]:
import re
import string
import numpy as np

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split

import tensorflow as tf
import tensorflow_datasets as tfds
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Conv1D, MaxPool1D, Dropout, Dense, GlobalMaxPool1D, Embedding, Activation
from keras.utils import to_categorical
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn import preprocessing
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Data Preprocessing

Here we have to prepare out data, with some text preprocessing, so add special tokens, removing punctuation and stopwords if necessary, and so on.

### Rename columns

In [9]:
# rename labels

labels = {1:'World News', 2:'Sports News', 3:'Business News', 4:'Science-Technology News'}

train_data['label'] = train_data['Class Index'].map(labels)
test_data['label'] = test_data['Class Index'].map(labels)

train_data = train_data.drop(columns=['Class Index'])
test_data = test_data.drop(columns=['Class Index'])

### Remove punctuation

First of all, we define a function to remove punctuation.

In [None]:
# The old and loved remove punctuation function

def remove_punc(text):
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

In [None]:
train_data['Text'] = train_data['Description'].apply(lambda x: remove_punc(x))
test_data['Text'] = test_data['Description'].apply(lambda x: remove_punc(x))

In [None]:
train_data.head()

Unnamed: 0,Class Index,Title,Description,Text
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli...",Reuters Shortsellers Wall Streets dwindlingba...
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,Reuters Private investment firm Carlyle Group...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,Reuters Soaring crude prices plus worriesabou...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,Reuters Authorities have halted oil exportflo...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...",AFP Tearaway world oil prices toppling record...


### Clean text

Hence, making use of nltk tokeniser, we clean our text:

1. Lowercase our texts.

2. Remove stopwords.

In [None]:
# data cleaning and remove stopwords

def data_cleaner(text):        
    lower_case = text.lower()
    tokens=word_tokenize(lower_case)
    return (" ".join(tokens)).strip()

def remove_stopwords (text):        
    list1=[word for word in text.split() if word not in stopwords.words('english')]
    return " ".join(list1)

train_data['Text'] = train_data['Text'].apply(lambda x: data_cleaner(x))
test_data['Text'] = test_data['Text'].apply(lambda x: data_cleaner(x))

train_data['Text'] = train_data['Text'].apply(lambda x: remove_stopwords(x))
test_data['Text'] = test_data['Text'].apply(lambda x: remove_stopwords(x))

In [10]:
train_data['Text'] = train_data['Description']
test_data['Text'] = test_data['Description']

### Tokenise

We make use of Keras tokeniser to assign to each word a number.

In [11]:
#@title Tokeniser configuration

max_len = 75#@param {type:"integer"}

In [12]:
tokeniser = Tokenizer()
tokeniser.fit_on_texts(train_data['Text'])

tokenised_text = tokeniser.texts_to_sequences(train_data['Text'])
tokenised_text = pad_sequences(tokenised_text, maxlen=max_len)

In [13]:
encoded_labels = preprocessing.LabelEncoder()
y = encoded_labels.fit_transform(train_data['label'])
y = to_categorical(y)

In [14]:
tokenised_text_test = tokeniser.texts_to_sequences(test_data['Text'])
tokenised_text_test = pad_sequences(tokenised_text_test, maxlen=max_len)

In [15]:
y_test = encoded_labels.transform(test_data['label'])
y_test = to_categorical(y_test)

In [16]:
vocab_size = len(tokeniser.word_index) + 1

We also export the dictionary word-to-index to a `json` file, this will be needed in order to convert a text to-be-classified in a format that the model can digest.

In [17]:
# Export the ditionary word-to-index to a json file
import json 
with open( 'word_dict.json' , 'w' ) as file:    
    json.dump(tokeniser.word_index , file )

### Convert data to TensorFlow datasets

We convert data to Tensorflow dataset, in order to feed the Keras model.

In [18]:
train_dataset = tf.data.Dataset.from_tensor_slices((tokenised_text, y))
test_dataset = tf.data.Dataset.from_tensor_slices((tokenised_text_test, y_test))

# Building the model

In [22]:
#@title Model Parameters
#@markdown Here we give a minimal set of parameters for model configuration.

emb_dim = 64 #@param {type:"integer"}
dropout_rate = 0.3#@param {type: "number"}
n_labels = len(labels)

learning_rate = 0.0006#@param {type: "number"}

loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.CategoricalAccuracy('accuracy')
opt = tf.keras.optimizers.Adam(learning_rate = learning_rate)

We are now ready to build our model, making use of Keras `Sequential` object.

In [23]:
# build the model
keras_model = Sequential()
keras_model.add(Embedding(vocab_size, output_dim = emb_dim, input_length=max_len))
keras_model.add(Dropout(dropout_rate))
keras_model.add(Conv1D(50, 3, activation='relu', padding='same', strides=1))
keras_model.add(MaxPool1D())
keras_model.add(Dropout(dropout_rate))
keras_model.add(Conv1D(100, 3, activation='relu', padding='same', strides=1))
keras_model.add(MaxPool1D())
keras_model.add(Dropout(dropout_rate))
keras_model.add(Conv1D(200, 3, activation='relu', padding='same', strides=1))
keras_model.add(GlobalMaxPool1D())
keras_model.add(Dropout(dropout_rate))
keras_model.add(Dense(100))
keras_model.add(Activation('relu'))
keras_model.add(Dropout(dropout_rate))
keras_model.add(Dense(n_labels))
keras_model.add(Activation('softmax'))
keras_model.compile(loss=loss, metrics=[metric], optimizer=opt)
keras_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 75, 64)            4079296   
_________________________________________________________________
dropout_5 (Dropout)          (None, 75, 64)            0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 75, 50)            9650      
_________________________________________________________________
max_pooling1d_2 (MaxPooling1 (None, 37, 50)            0         
_________________________________________________________________
dropout_6 (Dropout)          (None, 37, 50)            0         
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 37, 100)           15100     
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 18, 100)          

# Training

We are now ready to launch the training, though the `fit` method, we set the numebr of epochs and launch the model.

In [24]:
#@title Model Training
#@markdown We can move the slider to set number of epochs 
#@markdown and give a different batch size.

number_of_epochs = 7 #@param {type: "slider", min: 1, max: 12}
batch_size = 64 #@param ["2", "8", "16", "32", "64", "128", "256", "512"] {type:"raw", allow-input: true}

# train dataset
ds_train_encoded = train_dataset.shuffle(10000).batch(batch_size)

# test dataset
ds_test_encoded = test_dataset.batch(batch_size)

keras_model.fit(ds_train_encoded, epochs=number_of_epochs, validation_data=ds_test_encoded)

Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7


<tensorflow.python.keras.callbacks.History at 0x7fe6fab297f0>

## Evaluation

In [25]:
y_pred = to_categorical(np.argmax(keras_model.predict(tokenised_text_test), axis=1))

print(classification_report(y_test, y_pred, target_names=labels.values(), digits=4))

                         precision    recall  f1-score   support

             World News     0.8657    0.8616    0.8636      1900
            Sports News     0.8713    0.8905    0.8808      1900
          Business News     0.9518    0.9763    0.9639      1900
Science-Technology News     0.9285    0.8884    0.9080      1900

              micro avg     0.9042    0.9042    0.9042      7600
              macro avg     0.9043    0.9042    0.9041      7600
           weighted avg     0.9043    0.9042    0.9041      7600
            samples avg     0.9042    0.9042    0.9042      7600



# Predictions

Finally, we can now try to make predictions with our model.

The function `encode` below takes a text stream and returns the sequence of word indices. 

The encoded text is the input of the `predict` method of our model.

In [26]:
def encode(text):
  text = tokeniser.texts_to_sequences(text)
  return pad_sequences(text, maxlen=max_len)

Here some examples. 
Feel free to add as many other sentences as you like.

In [27]:
additional_sentence = "Apple iphone 12 is out!" #@param {type:"string"}

my_sentences = ['President Bush wants the war in Iraq, again', 
                "LeBron James wins the NBA championship with Los Angeles Lakers", 
                "Eni stock action value rise up to 14$",
                "Futures in New York held near $41 a barrel after Saudi Oil Minister Prince Abdulaziz Bin Salman called on the OPEC+ alliance to be proactive in the face of uncertain demand. Yet a draft statement from the meeting made no mention of any changes to the current deal, which calls for production cuts to be eased from January. The market is also looking out for any signs that a stimulus deal can still be agreed in Washington ahead of the election while a resurgence in the pandemic threatens any recovery.",
                additional_sentence
                ]

encode(my_sentences);

`predict` method returns a vector of probablities.

In [28]:
keras_model.predict(encode(my_sentences))

array([[2.4516769e-34, 2.4753018e-34, 9.8879884e-30, 1.0000000e+00],
       [0.0000000e+00, 0.0000000e+00, 1.0000000e+00, 0.0000000e+00],
       [1.0000000e+00, 5.2323827e-27, 1.5954859e-30, 6.1629427e-29],
       [1.0000000e+00, 1.8224776e-11, 7.9000709e-11, 3.5611195e-11],
       [0.0000000e+00, 1.0000000e+00, 0.0000000e+00, 0.0000000e+00]],
      dtype=float32)

In order to get the predicted class, one can call the `argmax` function.

In [29]:
for i, sen in enumerate(my_sentences):
  print(i, sen)
  print(encoded_labels.classes_[np.argmax(keras_model.predict(encode(my_sentences))[i])])

0 President Bush wants the war in Iraq, again
World News
1 LeBron James wins the NBA championship with Los Angeles Lakers
Sports News
2 Eni stock action value rise up to 14$
Business News
3 Futures in New York held near $41 a barrel after Saudi Oil Minister Prince Abdulaziz Bin Salman called on the OPEC+ alliance to be proactive in the face of uncertain demand. Yet a draft statement from the meeting made no mention of any changes to the current deal, which calls for production cuts to be eased from January. The market is also looking out for any signs that a stimulus deal can still be agreed in Washington ahead of the election while a resurgence in the pandemic threatens any recovery.
Business News
4 Apple iphone 12 is out!
Science-Technology News


# Save model and export in JavaScript

In order to convert our model using Tensorflow.js, we have to save the trained model.

In [30]:
#save Keras model
saved_model_path = "modelCNN.h5"

keras_model.save(saved_model_path)

Hence, we are ready to convert the saved model.

In [31]:
%%bash
tensorflowjs_converter --input_format=keras modelCNN.h5 ./model/

2020-10-26 11:53:42.024115: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1


Since we have not only the model, but also weights files, we zip everything to make it ready to download.

In [32]:
%%bash
zip -r model.zip ./model

  adding: model/ (stored 0%)
  adding: model/group1-shard4of4.bin (deflated 8%)
  adding: model/group1-shard2of4.bin (deflated 7%)
  adding: model/model.json (deflated 82%)
  adding: model/group1-shard1of4.bin (deflated 7%)
  adding: model/group1-shard3of4.bin (deflated 7%)
