# Text Classification Using a 1D CNN

This notebook is an example of using a 1D convolutional neural network to classify input texts. We will use [20Newsgroup dataset](http://qwone.com/~jason/20Newsgroups/) from which we work on 4 classes under the science category. Read more about this dataset from [scikit-learn](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html). After classification, we will try using several local explanation methods to explain why the CNN predicts a particular class.  

In [1]:
import sys
sys.path.append("..")
from analysis import *
from pprint import pprint
import json, csv

Using TensorFlow backend.


## Set up the project

The results from this notebook will be saved at `result_folder + project_name`.

In [2]:
result_folder = "../results/"
project_name = "4Newsgroups"
model_name = 'model2'

## Download word embeddings

In this experiment, we will use Glove (300 dimensions) as non-trainable embedding matrix. It can be downloaded from [here](https://nlp.stanford.edu/projects/glove/). After that, please set the path to the downloaded embedding (.txt format) in `analysis/settings.py`.

The function `get_embedding_matrix` receives three input parameters:
    - emb_path: a path of the embedding we want to use (set in `analysis/settings.py`)
    - max_len : the maximum length of input text to be processed. A longer text will be trimmed off, while a shorter text will be padded to have the length equal to `max_len`.
    - pad_initialisation : one of the two possible values [`uniform`, `zeros`] indicating how to initialise the vector of PAD token. (The default is `uniform`.)
    
It will return six outputs in the following order.
    - emb_matrix: a numpy array of word vectors (shape: (vocab_size, emb_dim) such as (400000, 300)) 
    - vocab_size: the number of words in the vocabulary
    - emb_dim: size of a word vector
    - max_len: the maximum length of input text to be processed
    - word_index: a list of all words in the vocabulary
    - word2index: a dictionary mapping from a word to its corresponding index

In [3]:
# Download embedding 
emb_matrix, vocab_size, emb_dim, max_len, word_index, word2index = get_embedding_matrix(EMBEDDINGS_PATH['glove-300'], max_len = 150)

Loading Embeddings Model
Done. 400000  words loaded!


## Data preparation
We will download the 20Newsgroup data using the function `load_20newsgroups` which takes three inputs:
    - ratio: a list of three numbers [a, b, c] summing up to one where a, b, and c are the proportion of training, valiation, and test data, respectively.
    - remove: a tuple of 20Newsgroup components which we want to remove from the input texts
    - categories: a list of classes we want to download. In this experiment, we choose only 4 classes.
    
This function will output `target_names` which is a list of class labels together with input texts and the corresponding class labels of training, validation, and test datasets.

In [4]:
# Download 20Newsgroup data
target_names, text_train, y_train, text_validate, y_validate, text_test, y_test = load_20newsgroups(ratio = [0.6, 0.2, 0.2], remove=('headers', 'footers'), categories = ['sci.crypt', 'sci.electronics', 'sci.med', 'sci.space'])

Then we preprocess the datasets by converting each input text into a vector of word indices using the function `get_data_matrix`. In addition, we convert the class labels into the one-hot encoding format using the function `to_categorical` from scikit-learn 

In [5]:
# Data preparation
utils.__log__("Start prepare training data")
X_train, y_train = get_data_matrix(text_train, word2index, max_len), y_train
utils.__log__("Start prepare validation data")
X_validate, y_validate = get_data_matrix(text_validate, word2index, max_len), y_validate
utils.__log__("Start prepare testing data")
X_test, y_test = get_data_matrix(text_test, word2index, max_len), y_test
utils.__log__("Finish")
y_train_onehot, y_validate_onehot, y_test_onehot = to_categorical(y_train), to_categorical(y_validate), to_categorical(y_test) 
understand_data(target_names, y_train, y_test, y_validate)

Start prepare training data at 2019-08-22 18:00:50.553996 (from last timestamp 0:01:13.309860 )


HBox(children=(IntProgress(value=0, max=2371), HTML(value='')))


Start prepare validation data at 2019-08-22 18:00:56.392229 (from last timestamp 0:00:05.838233 )


HBox(children=(IntProgress(value=0, max=790), HTML(value='')))


Start prepare testing data at 2019-08-22 18:00:58.106238 (from last timestamp 0:00:01.714009 )


HBox(children=(IntProgress(value=0, max=791), HTML(value='')))


Finish at 2019-08-22 18:00:59.856579 (from last timestamp 0:00:01.750341 )
The dataset has 4 classes: ['sci.crypt', 'sci.electronics', 'sci.med', 'sci.space']
Training data has 2371 examples: Counter({2: 616, 3: 610, 1: 586, 0: 559})
Validation data has 790 examples: Counter({0: 214, 1: 202, 3: 198, 2: 176})
Testing data has 791 examples: Counter({0: 218, 2: 198, 1: 196, 3: 179})


## Model creation, training, and testing

We create a model using the class CNNModel which receives several parameters. Some of them are from the `get_embedding_matrix` function. For the code below, 
    - We use three filter sizes [2, 3, 4] each of which has 50 filters. 
    - All of the filters use the `relu` activation function. 
    - For the classification part, we have one hidden layer with 150 nodes (relu activation) followed by the final output layer (softmax activation)
    - The sixth parameter, False, means that we freeze the embedding weights. Changing it to True will allow the training process to adjust the weights of word embeddings. 

In [6]:
cnn_model = CNNModel(vocab_size, word_index, word2index, emb_dim, emb_matrix, False, max_len, target_names, \
                     filters = [(2, 50), (3, 50), (4, 50)], \
                     filter_activations = 'relu', \
                     dense = [150, len(target_names)], \
                     dense_activations = ['relu', 'softmax'])

W0822 18:01:44.522664 15756 deprecation_wrapper.py:119] From D:\Imperial PhD\PublishedRepos\CNNAnalysis\venv\lib\site-packages\keras\backend\tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0822 18:01:45.185020 15756 deprecation_wrapper.py:119] From D:\Imperial PhD\PublishedRepos\CNNAnalysis\venv\lib\site-packages\keras\backend\tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0822 18:01:45.332873 15756 deprecation_wrapper.py:119] From D:\Imperial PhD\PublishedRepos\CNNAnalysis\venv\lib\site-packages\keras\backend\tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0822 18:01:45.463566 15756 deprecation_wrapper.py:119] From D:\Imperial PhD\PublishedRepos\CNNAnalysis\venv\lib\site-packages\keras\backend\tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.comp

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, None)         0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 150, 300)     120000600   input_1[0][0]                    
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 149, 50)      30050       embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_2 (Conv1D)               (None, 148, 50)      45050       embedding_1[0][0]                
__________________________________________________________________________________________________
conv1d_3 (

Then we train the model using the (preprocessed) training and validation data.

In [7]:
project_path = result_folder + project_name
cnn_model.train(project_path, model_name, X_train, y_train_onehot, X_validate, y_validate_onehot)

Model training ... at 2019-08-22 18:01:54.567641 (from last timestamp 0:00:54.711062 )


W0822 18:01:54.831514 15756 deprecation.py:323] From D:\Imperial PhD\PublishedRepos\CNNAnalysis\venv\lib\site-packages\tensorflow\python\ops\math_grad.py:1250: add_dispatch_support.<locals>.wrapper (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where


HBox(children=(IntProgress(value=0, description='Training', style=ProgressStyle(description_width='initial')),…

HBox(children=(IntProgress(value=0, description='Epoch 0', max=2371, style=ProgressStyle(description_width='in…


Epoch 00001: val_loss improved from inf to 0.75839, saving model to ../results/4Newsgroups/model2.h5


HBox(children=(IntProgress(value=0, description='Epoch 1', max=2371, style=ProgressStyle(description_width='in…


Epoch 00002: val_loss improved from 0.75839 to 0.40800, saving model to ../results/4Newsgroups/model2.h5


HBox(children=(IntProgress(value=0, description='Epoch 2', max=2371, style=ProgressStyle(description_width='in…


Epoch 00003: val_loss improved from 0.40800 to 0.34723, saving model to ../results/4Newsgroups/model2.h5


HBox(children=(IntProgress(value=0, description='Epoch 3', max=2371, style=ProgressStyle(description_width='in…


Epoch 00004: val_loss improved from 0.34723 to 0.27213, saving model to ../results/4Newsgroups/model2.h5


HBox(children=(IntProgress(value=0, description='Epoch 4', max=2371, style=ProgressStyle(description_width='in…


Epoch 00005: val_loss improved from 0.27213 to 0.26422, saving model to ../results/4Newsgroups/model2.h5


HBox(children=(IntProgress(value=0, description='Epoch 5', max=2371, style=ProgressStyle(description_width='in…


Epoch 00006: val_loss improved from 0.26422 to 0.24120, saving model to ../results/4Newsgroups/model2.h5


HBox(children=(IntProgress(value=0, description='Epoch 6', max=2371, style=ProgressStyle(description_width='in…


Epoch 00007: val_loss improved from 0.24120 to 0.23920, saving model to ../results/4Newsgroups/model2.h5


HBox(children=(IntProgress(value=0, description='Epoch 7', max=2371, style=ProgressStyle(description_width='in…


Epoch 00008: val_loss improved from 0.23920 to 0.23256, saving model to ../results/4Newsgroups/model2.h5


HBox(children=(IntProgress(value=0, description='Epoch 8', max=2371, style=ProgressStyle(description_width='in…


Epoch 00009: val_loss did not improve from 0.23256


HBox(children=(IntProgress(value=0, description='Epoch 9', max=2371, style=ProgressStyle(description_width='in…


Epoch 00010: val_loss improved from 0.23256 to 0.23191, saving model to ../results/4Newsgroups/model2.h5


HBox(children=(IntProgress(value=0, description='Epoch 10', max=2371, style=ProgressStyle(description_width='i…


Epoch 00011: val_loss did not improve from 0.23191


HBox(children=(IntProgress(value=0, description='Epoch 11', max=2371, style=ProgressStyle(description_width='i…


Epoch 00012: val_loss did not improve from 0.23191


HBox(children=(IntProgress(value=0, description='Epoch 12', max=2371, style=ProgressStyle(description_width='i…


Epoch 00013: val_loss did not improve from 0.23191

Preparing for further analysis ... at 2019-08-22 18:04:02.525872 (from last timestamp 0:02:07.958231 )
Feature extraction model:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text_indexes (InputLayer)       (None, None)         0                                            
__________________________________________________________________________________________________
embedding_2 (Embedding)         (None, 150, 300)     120000600   text_indexes[0][0]               
__________________________________________________________________________________________________
conv1d_4 (Conv1D)               (None, 149, 50)      30050       embedding_2[0][0]                
__________________________________________________________________________________________________
conv1d_5 (Conv1D)         



Partial model starting from embedded text matrix:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
embedded_text_input_383 (InputL (None, 150, 300)     0                                            
__________________________________________________________________________________________________
conv1d_7 (Conv1D)               (None, 149, 50)      30050       embedded_text_input_383[0][0]    
__________________________________________________________________________________________________
conv1d_8 (Conv1D)               (None, 148, 50)      45050       embedded_text_input_383[0][0]    
__________________________________________________________________________________________________
conv1d_9 (Conv1D)               (None, 147, 50)      60050       embedded_text_input_383[0][0]    
___________________________________________________________

HBox(children=(IntProgress(value=0, max=150), HTML(value='')))


Done at 2019-08-22 18:04:10.412133 (from last timestamp 0:00:00.195527 )


HBox(children=(IntProgress(value=0, max=150), HTML(value='')))


Creating decision trees ...


HBox(children=(IntProgress(value=0, max=4), HTML(value='')))


Pruning decision trees ...


HBox(children=(IntProgress(value=0, max=4), HTML(value='')))


Done at 2019-08-22 18:04:12.404775 (from last timestamp 0:00:01.992642 )


<keras.callbacks.History at 0x1d6b2eb9390>

Finally, we test the trained model using the test dataset and report the classification performance.

In [11]:
prediction_test = cnn_model.predict(X_test, batch_size = 128)
print(classification_report(y_test, prediction_test, target_names=target_names))

                 precision    recall  f1-score   support

      sci.crypt       0.97      0.89      0.93       218
sci.electronics       0.87      0.92      0.89       196
        sci.med       0.94      0.95      0.94       198
      sci.space       0.93      0.94      0.94       179

       accuracy                           0.93       791
      macro avg       0.93      0.93      0.93       791
   weighted avg       0.93      0.93      0.93       791



## Explanation examples

First, we select one input text from the test dataset to be an example.

In [38]:
index = 80
input_text, actual_class = text_test[index], y_test[index]
print(f"Input test: {input_text}")

Input test: In article <1993Apr15.160415.8559@magnus.acs.ohio-state.edu> ashall@magnus.acs.ohio-state.edu (Andrew S Hall) writes:
>I am postive someone will correct me if I am wrong, but doesn't the Fifth
>also cover not being forced to do actions that are self-incriminating?
>e.g. The police couldn't demand that you silently take them to where the
>body is buried or where the money is hidden.

But they can make you piss in a jar, and possibly provide DNA, semen,
and hair samples or to undergo tests for gunpowder residues on your hand.

(BTW, that was why the chemical engineer arrested in the WTC explosion
thrust his hands into a toilet filled with urine as the cops were breaking
down the door -- the nitrogen in the urine would mask any residue from
explosives.  I found it interesting the news reported his acts, but not
his reasons).

Somewhere, perhaps a privacy group, they discussed the legal ramifications
of using a password like

  I shot Jimmy Hoffa and his body is in a storage lo

Second, we convert the input text into an array of word indices and then predict using the trained CNN.

In [39]:
X_input = utils.get_data_matrix([input_text], word2index, cnn_model.max_len, use_tqdm = False)
predicted_class = cnn_model.predict(X_input)
print(f"The predicted class is {target_names[predicted_class]} (class_id = {predicted_class})")
print(f"The actual class is {target_names[actual_class]} (class_id = {actual_class})")

The predicted class is sci.med (class_id = 2)
The actual class is sci.crypt (class_id = 0)


Then, we explain the prediction using some local explanation techniques. 
    - To use Grad-CAM-Text, call `explain_prediction_heatmap`. Changing `is_support` to False will report counter-evidence against instead of evidence for the predicted class.  

In [40]:
explain_prediction_heatmap(cnn_model, input_text, actual_class, is_support = True)

Input text: In article <1993Apr15.160415.8559@magnus.acs.ohio-state.edu> ashall@magnus.acs.ohio-state.edu (Andrew S Hall) writes:
>I am postive someone will correct me if I am wrong, but doesn't the Fifth
>also cover not being forced to do actions that are self-incriminating?
>e.g. The police couldn't demand that you silently take them to where the
>body is buried or where the money is hidden.

But they can make you piss in a jar, and possibly provide DNA, semen,
and hair samples or to undergo tests for gunpowder residues on your hand.

(BTW, that was why the chemical engineer arrested in the WTC explosion
thrust his hands into a toilet filled with urine as the cops were breaking
down the door -- the nitrogen in the urine would mask any residue from
explosives.  I found it interesting the news reported his acts, but not
his reasons).

Somewhere, perhaps a privacy group, they discussed the legal ramifications
of using a password like

  I shot Jimmy Hoffa and his body is in a storage lo

----------------------------------------------------------------
Non-overlapping ngrams evidence:
to undergo tests for (location: [93, 94, 95, 96])
filled with urine as (location: [123, 124, 125, 126])
explosion thrust his hands (location: [116, 117, 118, 119])
provide dna , semen (location: [84, 85, 86, 87])
gunpowder residues on your (location: [97, 98, 99, 100])


[('to undergo tests for', [93, 94, 95, 96], 3.061652251460805),
 ('filled with urine as', [123, 124, 125, 126], 0.9063269341561986),
 ('explosion thrust his hands', [116, 117, 118, 119], 0.5732848181139784),
 ('provide dna , semen', [84, 85, 86, 87], 0.5532439973400114),
 ('gunpowder residues on your', [97, 98, 99, 100], 0.5313845255337272)]

    - To use LIME, follow the code below.

In [41]:
explainer = LimeTextExplainer(class_names=target_names)
exp = explainer.explain_instance(input_text, cnn_model.text2proba, num_features=10, labels=[int(predicted_class)])
print(exp.as_list(label=int(predicted_class)))

  self.as_list = [s for s in splitter.split(self.raw) if s]


HBox(children=(IntProgress(value=0, max=5000), HTML(value='')))


[('undergo', 0.31298029136070904), ('password', -0.23383510463571303), ('urine', 0.17105287549307016), ('tests', 0.1705421175596789), ('residues', 0.10084355160438846), ('semen', 0.09634871273170983), ('privacy', -0.09217619132801623), ('hair', 0.08015507411512592), ('to', -0.053246490847526515), ('for', -0.02716605749078148)]


    - To use LRP-epsilon, call `explain_example_innvestigate` and specify `lrp.epsilon` as a method. To use deeplift, change to `lrp.epsilon` to `deep_lift.wrapper`. Explain_level can be either `word` or `ngram`. 

In [42]:
explain_example_innvestigate(cnn_model, input_text, 'lrp.epsilon', explain_level = "word", is_support = True)

Input text: In article <1993Apr15.160415.8559@magnus.acs.ohio-state.edu> ashall@magnus.acs.ohio-state.edu (Andrew S Hall) writes:
>I am postive someone will correct me if I am wrong, but doesn't the Fifth
>also cover not being forced to do actions that are self-incriminating?
>e.g. The police couldn't demand that you silently take them to where the
>body is buried or where the money is hidden.

But they can make you piss in a jar, and possibly provide DNA, semen,
and hair samples or to undergo tests for gunpowder residues on your hand.

(BTW, that was why the chemical engineer arrested in the WTC explosion
thrust his hands into a toilet filled with urine as the cops were breaking
down the door -- the nitrogen in the urine would mask any residue from
explosives.  I found it interesting the news reported his acts, but not
his reasons).

Somewhere, perhaps a privacy group, they discussed the legal ramifications
of using a password like

  I shot Jimmy Hoffa and his body is in a storage lo

----------------------------------------------------------------
Non-overlapping ngrams evidence:
nitrogen (location: [136])
i (location: [14])
undergo (location: [94])
wtc (location: [115])
> (location: [13])


[('nitrogen', [136], 0.42666155),
 ('i', [14], 0.30809563),
 ('undergo', [94], 0.21874759),
 ('wtc', [115], 0.21835871),
 ('>', [13], 0.18928789)]

    - To use decision trees, use the following code.

In [43]:
explain_prediction_global(cnn_model, cnn_model.pruned_tree_list, input_text, print_results = False)

Unnamed: 0,Filter ID,Class identity,Class name,N-grams,Positions
0,79,2,sci.med,"dna , semen","[[85, 86, 87]]"
1,123,2,sci.med,or to undergo tests,"[[92, 93, 94, 95]]"
2,134,2,sci.med,or to undergo tests,"[[92, 93, 94, 95]]"


    - To draw the generated decision trees, use the following code. The results will be in the project path folder.

In [44]:
draw_tree_list(cnn_model.pruned_tree_list, cnn_model, folder = project_path)

HBox(children=(IntProgress(value=0, max=4), HTML(value='')))


