Here, I'm going to use fastText to attempt to classify this dataset. Some experimentation I did with this can be found [here](Here, I'm going to use fastText to attempt to classify this dataset. Some experimentation I did with this can be found [here](https://frankkloster.github.io/2018/11/20/comparison-of-deep-learning-techniques-for-text-classification.html).

# Preliminaries

In [None]:
from collections import defaultdict

# Numpy libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Keras libraries
import keras
from keras.layers import Activation, Conv1D, Dense, Dropout, GlobalAveragePooling1D, Embedding
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.callbacks import EarlyStopping

# Scikit-learn, just to split everything into training/validation/testing sets.
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv('../input/train.csv')

Our training data should be loaded into memory! Lets have a look.

In [None]:
print(df.head())

Our target prediction is our author. Let's translate everything a strandard one-hot encoding

In [None]:
author_dict = {'EAP': 0, 'HPL': 1, 'MWS': 2}
y = np.array([author_dict[x] for x in df.author])
y = to_categorical(y)

Let's translate our text into something a bit more interpretable by Keras. Specifically, we are going to vectorize our text, using 1-grams and 2-grams.

In [None]:
def preprocess(text):
    text = text.replace("' ", " ' ")
    signs = set(',.:;"?!')
    prods = set(text) & signs
    if not prods:
        return text

    for sign in prods:
        text = text.replace(sign, ' {} '.format(sign) )
    return text

def add_ngram(q, n_gram_max):
    '''
    Creates a list of n-grams, up to n_gram_max.
    
    q -> (1-grams of q) + (2-grams of q) + ... + (n_gram_max-grams of q)
    '''
    ngrams = []
    for n in range(2, n_gram_max+1):
        for w_index in range(len(q)-n+1):
            ngrams.append('--'.join(q[w_index:w_index+n]))
    return q + ngrams

def create_docs(df, n_gram_max=2):
    '''
    Preprocesses text located into dataframe, creating all n-grams up to n_gram_max.
    '''
    docs = []
    for doc in df.text:
        doc = preprocess(doc).split()
        docs.append(' '.join(add_ngram(doc, n_gram_max)))
    
    return docs

def create_vector(df, n_gram_max=2, min_count=2, maxlen=256):
    '''
    Creates a tokenized vector to train.
    '''
    X = create_docs(df)
    tokenizer = Tokenizer(lower=False, filters='')
    tokenizer.fit_on_texts(X)
    num_words = sum([1 for _, v in tokenizer.word_counts.items() if v >= min_count])

    tokenizer = Tokenizer(num_words=num_words, lower=False, filters='')
    tokenizer.fit_on_texts(X)
    X = tokenizer.texts_to_sequences(X)
    
    X = pad_sequences(sequences=X, maxlen=maxlen)
    
    return X

In [None]:
maxlen = 256

X = create_vector(df, maxlen=maxlen)

Now we have everything dataset in proper form. All that is left is to split everything into training and testing sets.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Here, we create an auxilary function to visualize how well our model is learning, or potentially overtraining.

In [None]:
def plot_scores(history):
    acc = history.history['acc']
    val_acc = history.history['val_acc']

    loss = history.history['loss']
    val_loss = history.history['val_loss']

    epochs = range(1, len(loss) + 1)

    plt.figure(figsize=(20, 10))

    plt.subplot(121)
    plt.plot(epochs, acc, 'bo', label='Training accuracy')
    plt.plot(epochs, val_acc, 'b', label='Validation accuracy')
    plt.title("Training and validation accuracy")
    plt.legend()

    plt.subplot(122)
    plt.plot(epochs, loss, 'bo', label='Training loss')
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title("Training and validation loss")
    plt.legend()
    plt.show()

# FastText Model
Let's try and test out FastText! Note that doing some experimentation with the few hyperparameters doesn't yield much different results. Thus I won't be doing a standard grid search through hyperparameters.

In [None]:
model = Sequential()

max_features = 20000
batch_size = 32
embedding_dims = 15
epochs = 30

model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=maxlen))
model.add(GlobalAveragePooling1D())
model.add(Dense(3, activation='softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
history = model.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    validation_data=(x_test, y_test),
                    callbacks=[EarlyStopping(patience=2, monitor='val_loss')])

In [None]:
plot_scores(history)

We were starting to get around 85% validation classification rate. Not bad!