# deepNeuralText

A multi-class text classification based on a deep recurrent neural network (RNN) with Long Short Term Memory (LSTM) units. The network includes an embedding layer, so the input is first transformed into padded sequences of fixed length. PoS-tagging is also included to improve classification, but is not essential for the network to run.

Version: 0.2

Author: Michal Pikusa (pikusa.michal@gmail.com)

In [16]:
import numpy as np
import nltk
import keras
from keras.layers import Dense, Embedding
from keras.layers import LSTM, GlobalMaxPool1D, Dropout
from keras.preprocessing import text, sequence
from keras.wrappers.scikit_learn import KerasClassifier
from keras.models import Sequential
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

Function for loading the data. The input text file contains a text to classify in each line, so the function returns a list.

In [2]:
def load_data(infile):
    text_file = open(infile,'r')
    text = text_file.readlines()
    text = list(map(str.strip,text))
    return text

Function for Parts-of-Speech tagging with NLTK. Each line of the input is first tokenized and then tagged. Resulting tag is combined with the token to create a single entity.

In [4]:
def pos_tag(docs):
    tagged_sentences = []
    for item in docs:
        combined = []
        tokenized = nltk.word_tokenize(item)
        tagged = nltk.pos_tag(tokenized)
        for i in range(len(tagged)):
            combined.append('_'.join([tagged[i][0], tagged[i][1]]))
        combined_string = ' '.join(combined)
        tagged_sentences.append(combined_string)
    return tagged_sentences

Function for encoding the data. Each line of input is tokenized to create a vocabulary with an upper limit of 20,000 words. Then, sequences of vocabulary indices are created with a fixed length of 128.

In [3]:
def encode_data(docs):
    tokenizer = text.Tokenizer(num_words=20000)
    tokenizer.fit_on_texts(docs)
    tokenized = tokenizer.texts_to_sequences(docs)
    docs = sequence.pad_sequences(tokenized, maxlen=128)
    return docs

Function for building the model with Keras. Main function is embedded into another function to accomodate passing the model to a scikit-keras wrapper.

In [13]:
def build_model(embed_size,max_length,vocab_size):
    def build_model():
        model = Sequential()
        model.add(Embedding(vocab_size, embed_size, input_length=max_length))
        model.add(LSTM(50, return_sequences=True))
        model.add(GlobalMaxPool1D())
        model.add(Dropout(0.1))
        model.add(Dense(50, activation="relu"))
        model.add(Dropout(0.1))
        model.add(Dense(1, activation="sigmoid"))
        model.compile(loss='binary_crossentropy',
                      optimizer='adam',
                      metrics=['accuracy'])
        return model
    return build_model

Load the data.

In [6]:
docs = load_data('corpus.txt')
labels = load_data('labels.txt')

Tag it (it can be commented out to improve speed).

In [7]:
docs = pos_tag(docs)

Encode it, make sure to have all the data and labels as integers, and finally reshapre the labels to fit them into the network output layer.

In [8]:
docs_encoded = encode_data(docs)
docs_encoded = np.array(docs_encoded)
labels_encoded = np.array(labels)
labels_encoded = labels_encoded.astype(int)
train_data = docs_encoded.astype(int)
train_labels = labels_encoded.astype(int)
train_labels = train_labels.reshape(len(train_labels), 1)

Validate the results with 10-fold cross-validation, and print the resulting accuracy and standard deviation.

In [14]:
estimator = KerasClassifier(build_fn=build_model(128,128,20000),epochs=3,batch_size=32,verbose=0)
folds = KFold(n_splits=10, shuffle=True, random_state=128)
results = cross_val_score(estimator=estimator,X=train_data,y=train_labels,cv=folds)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

Baseline: 85.77% (1.07%)
