# Openhack
## Machine Learning for Hackers
### Text Classification in Keras — A Simple Reuters News Classifier
https://towardsdatascience.com/text-classification-in-keras-part-1-a-simple-reuters-news-classifier-9558d34d01d3

In [3]:
import keras
from keras.datasets import reuters

Using TensorFlow backend.


### Get data
Load a the Reuters dataset from Keras, split it into 80% training data and 20% test data.

In [44]:
(x_train, y_train), (x_test, y_test) = reuters.load_data(num_words=None, test_split=0.2)
word_index = reuters.get_word_index(path="reuters_word_index.json")

In [5]:
print('# of Training Samples: {}'.format(len(x_train)))
print('# of Test Samples: {}'.format(len(x_test)))

# of Training Samples: 8982
# of Test Samples: 2246


In [6]:
num_classes = max(y_train) + 1
print('# of Classes: {}'.format(num_classes))

# of Classes: 46


The training data is tokenized on word level (each word has been replaced with a number), so a string is represented by a list of integers. The map of which word belong to which number exist in `word_index`. A reverse mapping (number to word) is created in `index_to_word`.

In [12]:
index_to_word = {}
for key, value in word_index.items():
    index_to_word[value] = key

Use the the reverse mapping to print a sample of from the dataset

In [17]:
print(' '.join([index_to_word[x] for x in x_train[0]]))
print(f'\nClass: {y_train[0]}')

the wattie nondiscriminatory mln loss for plc said at only ended said commonwealth could 1 traders now april 0 a after said from 1985 and from foreign 000 april 0 prices its account year a but in this mln home an states earlier and rise and revs vs 000 its 16 vs 000 a but 3 psbr oils several and shareholders and dividend vs 000 its all 4 vs 000 1 mln agreed largely april 0 are 2 states will billion total and against 000 pct dlrs

Class: 3


In [31]:
from keras.preprocessing.text import Tokenizer
max_words = 10000

### Preprocess
Tokenize the x data using a binary tokenizer. The list of integers will be converted to a fixed size vector, where the each position in the vector corresponds to a word in the corpus. The length of the vector is decided by `max_words`, which is the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept. The content in the vector is 1 if the word is represented in the string and 0 if not. Similarly the labels are one-hot encoded.

In [45]:
tokenizer = Tokenizer(num_words=max_words)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')

In [46]:
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

#### Example of the input data

In [50]:
print(f'Binary sequence: {x_train[0]}')
print(f'Length of input vector: {len(x_train[0])}')

Binary sequence: [0. 1. 0. ... 0. 0. 0.]
Length of input vector: 10000


#### Example of labels

In [53]:
print(f'One-hot encoded label: {y_train[0]}')
print(f'\nLength of one-hot vector, corresponding to the number of classes: {len(y_train[0])}')

One-hot encoded label: [0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Length of one-hot vector, corresponding to the number of classes: 46


In [55]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.callbacks import TensorBoard

### Building and training a model
Use Tensorboard to visualize progress. Run tensorboard: `tensorboard --logdir path_to_current_dir/Graph`

In [56]:
tbCallBack = TensorBoard(log_dir='./Graph', histogram_freq=0, write_graph=True, write_images=True)

#### The actual model
Create an ANN with dense connections, ReLU activation function, droput and softmax classification. Below lines might generate warnings depending on your Tensorflow and Keras versions, but it's fine.

In [58]:
model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Calculate loss by categorical crossentropy and use the Adam optimizer for training

In [59]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.metrics_names)

['loss', 'acc']


Batch size defines how many examples you should feed through the network at once. The epochs defines how many times the entire training set will be run through the network.

In [62]:
batch_size = 32
epochs = 3

In [63]:
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_split=0.1, callbacks=[tbCallBack])
score = model.evaluate(x_test, y_test, batch_size=batch_size, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Train on 8083 samples, validate on 899 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Test loss: 1.0902739759013786
Test accuracy: 0.799198575297956


You may watch the training in realtime in Tensorboard. As we can see, the model is starting to overfit after ~3 epochs. What can we do to reduce overfitting? Early stopping, dropout, reduce network complexity, regularization, etc. Easiest solution in this case is to change to `epochs = 3` and run the training again. 

### Test the model on your own data
You need to preprocess the text in similar way as with the training data, using the same word map. Will not comment this part further, since it's basically doing the same thing as before.

In [81]:
test_sentence = 'Hackers are changing the world at Openhack'
tokenized = test_sentence.split()

In [83]:
tokenized_with_word_index = []
for t in tokenized:
    if t in word_index:
        tokenized_with_word_index.append(word_index[t])
    else:
        tokenized_with_word_index.append(max(word_index.values())+1) # in case a word is not present in the original corpus, add as unknown
test_sequence = tokenizer.sequences_to_matrix([tokenized_with_word_index], mode='binary')

In [90]:
model.predict_classes(test_sequence)

array([3])