# Openhack
## Machine Learning for Hackers
### Text Classification in Keras — A Simple Reuters News Classifier
https://towardsdatascience.com/text-classification-in-keras-part-1-a-simple-reuters-news-classifier-9558d34d01d3

In [1]:
import keras
from keras.datasets import imdb

Using TensorFlow backend.


### Get data
Load a the IMDB dataset from Keras, split it into 50% training data and 50% test data.

In [2]:
NUM_WORDS=20000 # only use top 1000 words
INDEX_FROM=3   # word index offset

train,test = imdb.load_data(num_words=NUM_WORDS, index_from=INDEX_FROM)
x_train,y_train = train
x_test,y_test = test


In [3]:
#x_train, y_train

In [4]:
print('# of Training Samples: {}'.format(len(x_train)))
print('# of Test Samples: {}'.format(len(x_test)))

# of Training Samples: 25000
# of Test Samples: 25000


In [5]:
num_classes = max(y_train) + 1
print('# of Classes: {}'.format(num_classes))

# of Classes: 2


The training data is tokenized on word level (each word has been replaced with a number), so a string is represented by a list of integers. The map of which word belong to which number exist in `word_index`. A reverse mapping (number to word) is created in `index_to_word`.

In [6]:
word_to_id = imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2

id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in x_train[0] ))

<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for retail and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also congratulations to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the praising list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be p

Use the the reverse mapping to print a sample of from the dataset

In [7]:
from keras.preprocessing.text import Tokenizer
max_words = 10000

### Preprocess
Tokenize the input data using a binary tokenizer. The list of integers will be converted to a fixed size vector, where the each position in the vector corresponds to a word in the corpus. The length of the vector is decided by `max_words`, which is the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept. The content in the vector is 1 if the word is represented in the string and 0 if not. Similarly the labels are one-hot encoded.

In [8]:
tokenizer = Tokenizer(num_words=max_words)
x_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
x_test = tokenizer.sequences_to_matrix(x_test, mode='binary')

In [9]:
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

#### Example of the input data

In [10]:
print(f'Binary sequence: {x_train[0]}')
print(f'Length of input vector: {len(x_train[0])}')

Binary sequence: [0. 1. 1. ... 0. 0. 0.]
Length of input vector: 10000


#### Example of labels

In [11]:
print(f'One-hot encoded labels: {y_train[0]}, {y_train[1]}')
print(f'\nLength of one-hot vector, corresponding to the number of classes: {len(y_train[0])}')

One-hot encoded labels: [0. 1.], [1. 0.]

Length of one-hot vector, corresponding to the number of classes: 2


In [12]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.callbacks import TensorBoard

### Building and training a model
Use Tensorboard to visualize progress. Run tensorboard: `tensorboard --logdir path_to_current_dir/Graph`

In [13]:
tbCallBack = TensorBoard(log_dir='./Graph', histogram_freq=0, write_graph=True, write_images=True)

#### The actual model
Create an ANN with dense connections, ReLU activation function, droput and softmax classification. Below lines might generate warnings depending on your Tensorflow and Keras versions, but it's fine.

In [14]:
model = Sequential()
model.add(Dense(512, input_shape=(max_words,)))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes))
model.add(Activation('softmax'))

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


Calculate loss by categorical crossentropy and use the Adam optimizer for training

In [15]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.metrics_names)

['loss', 'acc']


Batch size defines how many examples you should feed through the network at once. The epochs defines how many times the entire training set will be run through the network.

In [22]:
batch_size = 150
epochs = 3

In [23]:
history = model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_split=0.1, callbacks=[tbCallBack])
score = model.evaluate(x_test, y_test, batch_size=batch_size, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Train on 22500 samples, validate on 2500 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3
Test loss: 0.6504178770780563
Test accuracy: 0.8690800015926361


You may watch the training in realtime in Tensorboard. As we can see, the model is starting to overfit after ~3 epochs. What can we do to reduce overfitting? Early stopping, dropout, reduce network complexity, regularization, etc. Easiest solution in this case is to change to `epochs = 3` and run the training again. 

### Test the model on your own data
You need to preprocess the text in similar way as with the training data, using the same word map. Will not comment this part further, since it's basically doing the same thing as before.

In [42]:
test_sentence = 'I was happily disappointed it was s'
tokenized = test_sentence.split()
tokenized_with_word_index = []
for t in tokenized:
    if t in word_to_id:
        tokenized_with_word_index.append(word_to_id[t])
    else:
        tokenized_with_word_index.append(max(word_to_id.values())+1) # in case a word is not present in the original corpus, add as unknown
test_sequence = tokenizer.sequences_to_matrix([tokenized_with_word_index], mode='binary')
model.predict_classes(test_sequence)

array([0])