# Reuters_B - Newswire Topic Classification
The Reuters dataset contains 11,228 newswires from Reuters, labeled over 46 topics.   As with the IMDB dataset, each wire is encoded as a sequence of numbers.   

Our task is to create a neural network that can classify which topic the piece of text came from. We will use an embedding layer to input the data. 

Approach: Vectorise the sequence-of-words-integers as One-hot word vector

In [1]:
import numpy as np

In [2]:
from keras.datasets import reuters
from keras.preprocessing.sequence import pad_sequences
maxlen=1500 # specify the max number of words of each newswire you want
vocab_size=1000
(x_train, y_train), (x_test, y_test) = reuters.load_data(path="reuters.npz",
                                                         num_words=vocab_size, # use top 1000 frequent words
                                                         skip_top=5, # skip top frequency word
                                                         maxlen=None,
                                                         test_split=0.2,
                                                         seed=113,
                                                         start_char=1,
                                                         oov_char=2,
                                                         index_from=3)

Using TensorFlow backend.


In [3]:
x_train.shape

(8982,)

In [4]:
x_test.shape

(2246,)

In [5]:
x_train[0]

[2,
 2,
 2,
 8,
 43,
 10,
 447,
 5,
 25,
 207,
 270,
 5,
 2,
 111,
 16,
 369,
 186,
 90,
 67,
 7,
 89,
 5,
 19,
 102,
 6,
 19,
 124,
 15,
 90,
 67,
 84,
 22,
 482,
 26,
 7,
 48,
 2,
 49,
 8,
 864,
 39,
 209,
 154,
 6,
 151,
 6,
 83,
 11,
 15,
 22,
 155,
 11,
 15,
 7,
 48,
 9,
 2,
 2,
 504,
 6,
 258,
 6,
 272,
 11,
 15,
 22,
 134,
 44,
 11,
 15,
 16,
 8,
 197,
 2,
 90,
 67,
 52,
 29,
 209,
 30,
 32,
 132,
 6,
 109,
 15,
 17,
 12]

In [6]:
#x_train = pad_sequences(x_train, maxlen=maxlen)
#x_test =  pad_sequences(x_test, maxlen=maxlen)

In [7]:
from keras.preprocessing.text import Tokenizer

In [8]:
tokenizer = Tokenizer(num_words=maxlen)

* The `sequences_to_matrix` bascially produces a binarised word-count vector, i.e. each column represnets a word and if that word appears in that newswire, it will give 1 to that column, regardless of how many times that word appears.
* In simple terms we just call it one-hot word vector.

In [9]:
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')

In [10]:
x_train.shape

(8982, 1500)

In [11]:
x_test.shape

(2246, 1500)

In [12]:
x_train

array([[ 0.,  0.,  1., ...,  0.,  0.,  0.],
       [ 0.,  0.,  1., ...,  0.,  0.,  0.],
       [ 0.,  0.,  1., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  1., ...,  0.,  0.,  0.],
       [ 0.,  0.,  1., ...,  0.,  0.,  0.],
       [ 0.,  0.,  1., ...,  0.,  0.,  0.]])

In [13]:
y_train

array([ 3,  4,  3, ..., 25,  3, 25])

In [14]:
y_train.shape

(8982,)

In [15]:
y_train.max()

45

In [16]:
from keras.utils import np_utils # one hot encode the y-label

In [17]:
y_train = np_utils.to_categorical(y_train, 46)
y_test = np_utils.to_categorical(y_test, 46)

In [18]:
y_train

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [19]:
y_train.shape

(8982, 46)

In [20]:
y_test.shape

(2246, 46)

### Building Model: Fully-connected perceptrons layer

In [21]:
from keras.models import Sequential
from keras.layers import Dense, Dropout
model = Sequential()
model.add(Dense(256, input_shape=(maxlen,), activation='relu')) 
model.add(Dropout(0.5)) 
model.add(Dense(46, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 256)               384256    
_________________________________________________________________
dropout_1 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 46)                11822     
Total params: 396,078
Trainable params: 396,078
Non-trainable params: 0
_________________________________________________________________


In [22]:
history = model.fit(x_train, y_train, epochs=5, batch_size=32, validation_split=0.1)

Train on 8083 samples, validate on 899 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [23]:
score = model.evaluate(x_test, y_test)



In [24]:
score

[0.85140501125943735, 0.79697239536954589]

In [25]:
print('The test accuracy is:', round(score[1]*100,2))

The test accuracy is: 79.7


### References:

1. [keras/examples/reuters_mlp.py](https://github.com/keras-team/keras/blob/master/examples/reuters_mlp.py)