### Layers

We start off by importing the following important layers

- **LSTM (Long Short-Term Memory) layer.** A sequential model that learn feature from a series of data:
<img width="550" src="./LSTM-model.png">

    In our model we are just interested in the last hidden state $h_{699}$. Occationally we also create model that consumes all hidden states ($h_0, \dots, h_{699}$), for example, attention model (that implements **teacher forcing**) for machine translation or data auto correction.
    

- **Relu-activation layer.** One of the common layers that provide non-linearity to the network (without activation, a neural network is simply an affine transform which work nothing better than a single layer network)



- **Softmax-activation layer.** Common top layer of a network for multi-classification problem since it provides nice computational formula for "computing weight" in order to update learning parameter of its previous layer. In other words, it makes the process of ***back-propagation*** simple)


- **(Word) Embedding layer.** We will create our own embedding layer using pretrained weights (this process is called **transfer-learning**) instead of learning another one in our network. 


- **BatchNormalization layer.** It can, by experiment, increase numerical stability in the training process.
 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pickle
import random

from tensorflow.keras.layers import LSTM, Activation, Dense, Dropout, Input, BatchNormalization, Embedding
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.models import load_model

from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split

import os
import string
import re

os.environ["TF_FORCE_GPU_ALLOW_GROWTH"]="true"

### Define Constants

In [2]:
bbc_business = os.path.sep.join(["data_set","bbc","business"])
bbc = os.path.sep.join(["data_set","bbc"]) 
WORD_EMBEDDING_DIMENSION = 50
MAX_VOCAB_SIZE = 10000
MAX_LENGTH = 500
NUM_CLASSES = 5

### Prepare our Training data as a Python Array Object

In [55]:
contents = []
labels = []

def remove_punctuation(paragraph):
    for punc in string.punctuation:
        paragraph = paragraph.replace(punc,"")
    return paragraph

def preprocess_data(folder_path):
    for i, (dir_path, dir_names, file_names) in enumerate(os.walk(folder_path)):
        if dir_path != os.path.sep.join(["data_set","bbc"]):
            print(f"{len(file_names)} files in {dir_path} have been loaded")
            for file_name in file_names:
                file_path = os.path.sep.join([dir_path, file_name])
                category = file_path.split(os.path.sep)[-2]
                with open(file_path, "r", encoding="ISO-8859-1") as f:
                    content = f.read().strip()
                    content = remove_punctuation(content)
                    content = re.sub(r"(\n)+", " ", content)
                    content = content.lower()
                    
                    contents.append(content)
                    labels.append(category)
                    
preprocess_data(bbc)

510 files in data_set\bbc\business have been loaded
386 files in data_set\bbc\entertainment have been loaded
417 files in data_set\bbc\politics have been loaded
511 files in data_set\bbc\sport have been loaded
401 files in data_set\bbc\tech have been loaded


### Understand a bit more About our Dataset

In [5]:
print(f"we have total of {len(contents)} training data")

nums=np.array([len(content.split()) for content in contents])

max_num_of_words = np.max(nums)
min_num_of_words = np.min(nums)
total_num_of_words = np.sum(nums)
average_num_of_words = total_num_of_words//len(contents)

for threshold in np.arange(500, 1601, 100):
    print(f"{len([num for num in nums if num < threshold])} of paragraph has number of words less than {threshold}")

print(f"max number of words: {max_num_of_words}")
print(f"min number of words: {min_num_of_words}")
print(f"number of words: {total_num_of_words}")
print(f"average number of words: {average_num_of_words}")

we have total of 2225 training data
1764 of paragraph has number of words less than 500
1982 of paragraph has number of words less than 600
2095 of paragraph has number of words less than 700
2146 of paragraph has number of words less than 800
2191 of paragraph has number of words less than 900
2203 of paragraph has number of words less than 1000
2209 of paragraph has number of words less than 1100
2210 of paragraph has number of words less than 1200
2212 of paragraph has number of words less than 1300
2216 of paragraph has number of words less than 1400
2216 of paragraph has number of words less than 1500
2217 of paragraph has number of words less than 1600
max number of words: 4416
min number of words: 89
number of words: 851028
average number of words: 382


### Split our dataset into training ones and testing/validation ones. 

The validation dataset is used to test whether or not our prediction model can "generalize" to data that the model has never seen. It can happen that our trained model performs very well on training dataset but works poorly to new data. This phenomenon is called **over-fitting**.

In [6]:
X_train, X_test, Y_train, Y_test = train_test_split(contents, labels, test_size=0.15)

### Transform Dataset to fit the Model
Since the toppest layer in our network, the softmax-activation layer, takes a batch of data and returns a batch of vectors that represent probabilities, we encode our label (as a string) into an array with only one non-zero entry for training purpose (the model learn the true answer and adjust its training weight).

In [7]:
labelBinarizer = LabelBinarizer()
Y_train = labelBinarizer.fit_transform(Y_train)
Y_test = labelBinarizer.transform(Y_test)

### Tokenizer and Embedding Layer

We will not feed our model by simply the string of whole paragraph. Each word in a paragraph does carry meaning, we transform each word into a vector whose positional weight does carry meaning. What we will do:

$$\text{word} \mapsto  \underbrace{\overbrace{\text{word_to_index}}^{\large \text{tokenizer}}(\text{word})}_{\Large \in \, \mathbb N} \mapsto  \underbrace{\text{embedding}(\text{word_to_index}(\text{word}))}_{\Large \in \,\mathbb R^{50}} $$

We will be defining our embedding matrix which takes an integer into a vector (array) of length 50. The term **matrix** here is merely a numpy array and is not really a mathematical object that turns column vectors into another column vectors.

In [8]:
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)

tokenizer.fit_on_texts(X_train)
training_word_to_index = tokenizer.word_index

sequences_train = tokenizer.texts_to_sequences(X_train)
sequences_test =  tokenizer.texts_to_sequences(X_test)

X_train = np.array(sequence.pad_sequences(sequences_train, maxlen=MAX_LENGTH, padding='post'))
X_test = np.array(sequence.pad_sequences(sequences_test, maxlen=MAX_LENGTH, padding='post'))

### Create Embedding Layer Using Pretrained Weight

Note that our matrix is just based on our training set, we are not interested in any other words outside of the scope of our training data.

In [9]:
def define_embedding_layer():
    print("Loading word vectors...")
    word_to_vec = {}
    embedding_file_path = os.path.sep.join(["word_embedding", "glove.6B.{}d.txt".format(WORD_EMBEDDING_DIMENSION)])
    with open(embedding_file_path, encoding="utf8") as f:
        for line in f:
            values = line.split()
            word = values[0]
            vec = np.array(values[1:], dtype="float32")
            word_to_vec[word] = vec
    
    vocab_size = max(
        MAX_VOCAB_SIZE,
        len(training_word_to_index) + 1
    )
    embedding_matrix = np.zeros((vocab_size, WORD_EMBEDDING_DIMENSION))

    # for embedding matrix, we are just interested in words in our training set:
    for word, index in training_word_to_index.items():
        word_vec = word_to_vec.get(word)
        if word_vec is not None:
            embedding_matrix[index] = word_vec

    training_word_embedding_layer = Embedding(
        vocab_size,
        WORD_EMBEDDING_DIMENSION,
        weights=[embedding_matrix],
        input_length=MAX_LENGTH
    )
    
    print("Done!")
    return training_word_embedding_layer

training_word_embedding_layer = define_embedding_layer()

Loading word vectors...
Done!


### Example of Transformed Training data that will be fed into LSTM Model

In [10]:
print(X_train[0])

[3038 1215 1037    6 1849  812 1850 3038 2124    5 1037 2037  520    6
 1849    6   30   64 1851 9161  133   15  151 2216 2572 1110   20    1
  999 1538 3038  126  379    3 5593 3039 1069    2  138    1  670   20
    1 2936  566  212   15  124 4004   25    5  884  235    6    1  155
  670    2  521 6869 4336    4 3310   30  297    3    4 1199   41  109
  737    3  446 5951 2037   76  216    3   25 1852   28    5   92  394
  216    2  794   24   27  368  633 6349   92  109   11    1   76 1183
  683   43  748    2   88 1784    2    1  459   32  411   12   13  350
    2  183   54    4   86  663   27  225 3197  648  117   62  670   13
    1  664  262    4   50   13    5  292    3 8239   32   13   81   72
   27  503    2   44    4   88   70  670 3311  228   92 4758   48    1
  107 1111    3  180 1037  664  683 3844   98   64 1936   10  126 8240
  133  514 2866    4 2525 1110    6  999   16    5  493   50   13    5
  237  520    7  256   63 5594   54 2526    4 9162    2  138    1 1335
 2037 

### Define Model

Based on the size of training data and after numerical experiement, we finally come up with the following structure. In case when we have more and more data in the future, we need to adjust the retrain the model in order to accept more information.

In [11]:
def build_model():
    inputs = Input(name='inputs',shape=(MAX_LENGTH,))
    x = training_word_embedding_layer(inputs)
    # x = LSTM(128, return_sequences=True)(x)
    x = LSTM(128)(x)
    x = Dense(256,name='FC1')(x)
    x = Activation('relu')(x)
    x = BatchNormalization()(x)
    x = Dropout(0.5)(x)
    
#     x = LSTM(128)(x)
#     x = Dense(256,name='FC2')(x)
#     x = Activation('relu')(x)
#     x = BatchNormalization()(x)
#     x = Dropout(0.5)(x) 

    x = Dense(NUM_CLASSES, name='out_layer')(x)
    x = Activation('softmax')(x)
    model = Model(inputs=inputs, outputs=x)
    return model

model = build_model()

In [12]:
from tensorflow.keras.utils import plot_model 
plot_model(model, to_file='model1.png')
model.compile(
    loss='categorical_crossentropy', 
    optimizer=RMSprop(), 
    metrics=['acc']
)

### Check the Shape of data Before Training

In [13]:
print(X_train.shape)
print(len(Y_train))
print(X_test.shape)
print(len(Y_test))

(1891, 500)
1891
(334, 500)
334


### Train the model using GPU of my GTX-3090 graphic card with cuCNN 11

In [14]:
model.fit(X_train, 
          Y_train,
          validation_data = (X_test, Y_test),
          batch_size=64,
          epochs=50
         )

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x1e081995760>

### Save the Model

In [15]:
model.save("./output/classifier.hdf5")
plot_model(model, to_file='model2.png')

# saving
with open('./output/tokenizer.pickle', 'wb') as handle:
    pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
with open('./output/labelBinarizer.pickle', 'wb') as handle:
    pickle.dump(labelBinarizer, handle, protocol=pickle.HIGHEST_PROTOCOL)

### Retrieve the Model from Local Storage

In [16]:
model = load_model("./output/classifier.hdf5")

tokenizer = None
labelBinarizer = None

with open('./output/tokenizer.pickle', 'rb') as handle:
    tokenizer = pickle.load(handle)
    
with open('./output/labelBinarizer.pickle', 'rb') as handle:
    labelBinarizer = pickle.load(handle)

### Define Prediction Method with Human Readable Result

In [72]:
def predict_paragraph_category(paragraph):
    seq = np.array(
        sequence.pad_sequences(
            tokenizer.texts_to_sequences([paragraph.strip()]), 
            maxlen=MAX_LENGTH, 
            padding='post')
    )
    probabilities = model.predict(seq)[0]
    index = np.argmax(probabilities)
    return labelBinarizer.classes_[index], probabilities[index]

### Example of Output

In [73]:
predict_paragraph_category(contents[10])

('entertainment', 0.9367099)

In [75]:
wrong_answer = []

for index in random.sample(range(0, 2000), 100):
    content = contents[index]
    label = labels[index]
    prediction, score = predict_paragraph_category(content)
        
    if label != prediction:
        wrong_answer.append(index)
    
    print(index)
    print(f"[paragraph] {content[0:100]}...")
    print("[prediction]", prediction)
    print("[answer]", label)
    print("[confidence]", score)
    print("------------")

print(f"accuracy: {(100 - len(wrong_answer)/100)}%")

1616
[paragraph] double injury blow strikes wales wales centre sonny parker and number eight ryan jones will miss sat...
[prediction] business
[answer] sport
[confidence] 0.49221826
------------
1091
[prediction] politics
[answer] politics
[confidence] 0.9740305
------------
1963
[paragraph] players sought for 1m prize uk gamers are getting a chance to take part in a 1m tournament thanks to...
[prediction] tech
[answer] tech
[confidence] 0.771528
------------
1826
[paragraph] microsoft seeking spyware trojan microsoft is investigating a trojan program that attempts to switch...
[prediction] tech
[answer] tech
[confidence] 0.98780453
------------
587
[paragraph] baby becomes new oscar favourite clint eastwoods boxing drama million dollar baby has become the new...
[prediction] entertainment
[answer] entertainment
[confidence] 0.94615114
------------
299
[paragraph] jj agrees 25bn guidant deal pharmaceutical giant johnson  johnson has agreed to buy medical technolo...
[prediction] busine

1455
[paragraph] wenger handed summer war chest arsenal boss arsene wenger has been guaranteed transfer funds to boos...
[prediction] sport
[answer] sport
[confidence] 0.95015264
------------
725
[paragraph] housewives lift channel 4 ratings the debut of us television hit desperate housewives has helped lif...
[prediction] entertainment
[answer] entertainment
[confidence] 0.94845647
------------
950
[paragraph] kelly trails new discipline power teachers could get more powers to remove unruly pupils from classe...
[prediction] politics
[answer] politics
[confidence] 0.9850677
------------
1172
[paragraph] howard and blair tax pledge clash tony blair has said voters will have to wait for labours manifesto...
[prediction] politics
[answer] politics
[confidence] 0.9699224
------------
136
[paragraph] bank set to leave rates on hold uk interest rates are set to remain on hold at 475 following the lat...
[prediction] business
[answer] business
[confidence] 0.99884725
------------
1241
[parag

706
[paragraph] john peel replacement show begins the permanent replacement for late dj john peels bbc radio 1 show ...
[prediction] entertainment
[answer] entertainment
[confidence] 0.90526944
------------
714
[paragraph] wife swap makers sue us copycat the british producers of us wife swap are taking legal action agains...
[prediction] business
[answer] entertainment
[confidence] 0.58350396
------------
1196
[paragraph] taxes must be trusted  kennedy public trust in taxes is breaking down because labour and tories are ...
[prediction] politics
[answer] politics
[confidence] 0.99901795
------------
1957
[paragraph] gates opens biggest gadget fair bill gates has opened the consumer electronics show ces in las vegas...
[prediction] tech
[answer] tech
[confidence] 0.9979461
------------
401
[paragraph] us interest rate rise expected us interest rates are expected to rise for the fifth time since june ...
[prediction] business
[answer] business
[confidence] 0.99894804
------------
67
[par

### Prediction from Latest News (outside of the dataset)

In [76]:
paragraph = remove_punctuation("The Competition and Markets Authority found local competition concerns regarding fuel in 37 areas in the UK. Zuber and Mohsin Issa, and TDR Capital, agreed to buy Asda for £6.8bn last year. However, they also own 395 UK petrol stations while Asda owns 323.")                         
                               
predict_paragraph_category(paragraph)

('business', 0.99860305)