# Tweet Classification: Airline Complaints

## Objective: train a classifier to classify tweets as complaints or not

We have labeled 4,960 tweets [here](https://docs.google.com/spreadsheets/d/1rU3Gt81fwjHAcB0-a0N3rwsfquKQJjxNK838lhsCDCg/edit#gid=65146049) with binary labels of **complaint** (1) or **not a complaint** (0).

The first step in classification is to represent our tweets numerically while retaining semantic information within the tweet. We do this with [Fasttext via Gensim](https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/FastText_Tutorial.ipynb). 

## Load Modules

In [25]:
from gensim.models import FastText
from gensim.test.utils import common_texts
import gensim
import keras
import matplotlib.pylab as plt
from gensim.models.word2vec import LineSentence

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

from nltk.tokenize import RegexpTokenizer

import csv
import os
import pandas as pd
import numpy as np

## Train Embeddings with FastText

In [None]:
embed_train_file = '../data/tweet_sample_2M_noRT.txt'

In [None]:
embed_train_data = LineSentence(embed_train_file)

In [None]:
model_gensim = FastText(size=45)

In [None]:
# build the vocabulary
model_gensim.build_vocab(embed_train_data)

# train the model
model_gensim.train(embed_train_data, total_examples=model_gensim.corpus_count, epochs=model_gensim.iter)

## Save the model to a file and load it back

In [None]:
# saving a model trained via Gensim's fastText implementation
model_gensim.save('../models/gensim_FT_45_cbow.dat')

In [4]:
fasttext100 = FastText.load('../models/gensim_FT_100_cbow.dat')
print(fasttext100)

FastText(vocab=142971, size=100, alpha=0.025)


### Get the word vector for a word

In [5]:
fasttext100["delta"]

  """Entry point for launching an IPython kernel.


array([-2.6900437 , -0.7829389 ,  0.8260516 , -0.8935343 ,  4.5002947 ,
       -2.0800745 ,  0.8185378 , -2.0441744 , -1.131624  , -3.5746562 ,
       -1.1319908 , -0.55258864,  0.6499821 , -2.4114673 ,  2.1817873 ,
        3.349079  , -0.00708565,  3.561728  , -1.7320576 ,  3.3835554 ,
       -1.682171  , -2.6499684 , -3.4524546 , -1.6946793 ,  0.09061141,
       -2.4246001 , -2.6531866 , -2.423885  ,  2.8988311 , -2.6459887 ,
        3.896811  , -0.50217664,  1.1331049 ,  0.4293008 , -0.69297755,
        1.0952688 ,  3.4780877 , -1.5056956 ,  3.2781224 , -0.8678973 ,
        0.8762853 ,  2.0280821 ,  0.40427354, -2.909961  , -0.37729537,
        2.6488566 ,  2.0457501 ,  0.67427605,  0.1820736 , -2.3562267 ,
       -0.6233044 , -0.4258164 , -2.7493412 , -0.05465397,  1.4000791 ,
        0.8776595 ,  2.2817457 ,  0.24748203,  0.09730937, -2.567825  ,
       -1.403272  , -0.7174882 , -4.0232043 ,  1.8875126 ,  1.2348607 ,
        0.69388485, -1.2996528 ,  0.40212575, -1.057632  , -2.21

## Import data

In [7]:
tweets = pd.read_csv("../data/Marketing Research Labeled Tweets_ - tweet_sample_5k_FULL.csv")

In [8]:
tweets.head(10)

Unnamed: 0,label,tweet_text
0,0,"two airports, one green grass and one sandy co..."
1,0,bismillahi majreha wa mursaha inna robbi la gh...
2,0,@americanair i understand
3,0,@jae_nita @delta i'll make it up to you come t...
4,1,@jetblue why are your employees so rude today ...
5,0,@usairways dg: “it is a pic of a woman…&amp; s...
6,0,@icelandair awesome thanks for these recommend...
7,0,@emirates good idea!;)
8,0,@americanair voila! careers site feedback page...
9,0,@airlineflyer @baltiausa should be a relief fo...


In [9]:
tweets.groupby("label").count()

Unnamed: 0_level_0,tweet_text
label,Unnamed: 1_level_1
0,3621
1,1338


In [10]:
tokenizer = RegexpTokenizer(r'\w+')
tweets["tokens"] = tweets["tweet_text"].apply(tokenizer.tokenize)
tweets.head()

Unnamed: 0,label,tweet_text,tokens
0,0,"two airports, one green grass and one sandy co...","[two, airports, one, green, grass, and, one, s..."
1,0,bismillahi majreha wa mursaha inna robbi la gh...,"[bismillahi, majreha, wa, mursaha, inna, robbi..."
2,0,@americanair i understand,"[americanair, i, understand]"
3,0,@jae_nita @delta i'll make it up to you come t...,"[jae_nita, delta, i, ll, make, it, up, to, you..."
4,1,@jetblue why are your employees so rude today ...,"[jetblue, why, are, your, employees, so, rude,..."


In [11]:
all_words = [word for tokens in tweets["tokens"] for word in tokens]
sentence_lengths = [len(tokens) for tokens in tweets["tokens"]]
VOCAB = sorted(list(set(all_words)))
print("%s words total, with a vocabulary size of %s" % (len(all_words), len(VOCAB)))
print("Max sentence length is %s" % max(sentence_lengths))

78062 words total, with a vocabulary size of 14514
Max sentence length is 33


## CNN Classifier

In [12]:
EMBEDDING_DIM = 100
MAX_SEQUENCE_LENGTH = 33
VOCAB_SIZE = len(VOCAB)

VALIDATION_SPLIT=.2
tokenizer = Tokenizer(num_words=VOCAB_SIZE)
tokenizer.fit_on_texts(tweets["tweet_text"].tolist())
sequences = tokenizer.texts_to_sequences(tweets["tweet_text"].tolist())

In [13]:
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

cnn_data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
labels = to_categorical(np.asarray(tweets["label"]))

indices = np.arange(cnn_data.shape[0])
np.random.shuffle(indices)
cnn_data = cnn_data[indices]
labels = labels[indices]
num_validation_samples = int(VALIDATION_SPLIT * cnn_data.shape[0])

embedding_weights = np.zeros((len(word_index)+1, EMBEDDING_DIM))
for word,index in word_index.items():
    embedding_weights[index,:] = fasttext100[word] if word in fasttext100 else np.random.rand(EMBEDDING_DIM)
embedding_weights.shape

Found 14966 unique tokens.


  from ipykernel import kernelapp as app
  from ipykernel import kernelapp as app


(14967, 100)

In [14]:
from keras.layers import Dense, Input, Flatten, Dropout, Add, Concatenate
from keras.layers import Conv1D, MaxPooling1D, Embedding
from keras.layers import LSTM, Bidirectional
from keras.models import Model

def ConvNet(embeddings, max_sequence_length, num_words, embedding_dim, labels_index, trainable=False, extra_conv=True):
    
    embedding_layer = Embedding(num_words,
                            embedding_dim,
                            weights=[embeddings],
                            input_length=max_sequence_length,
                            trainable=trainable)

    sequence_input = Input(shape=(max_sequence_length,), dtype='int32')
    embedded_sequences = embedding_layer(sequence_input)

    # Yoon Kim model (https://arxiv.org/abs/1408.5882)
    convs = []
    filter_sizes = [3,4,5]

    for filter_size in filter_sizes:
        l_conv = Conv1D(filters=128, kernel_size=filter_size, activation='relu')(embedded_sequences)
        l_pool = MaxPooling1D(pool_size=3)(l_conv)
        convs.append(l_pool)

    l_merge = Concatenate(axis=1)(convs)

    # add a 1D convnet with global maxpooling, instead of Yoon Kim model
    conv = Conv1D(filters=128, kernel_size=3, activation='relu')(embedded_sequences)
    pool = MaxPooling1D(pool_size=3)(conv)

    if extra_conv==True:
        x = Dropout(0.5)(l_merge)  
    else:
        # Original Yoon Kim model
        x = Dropout(0.5)(pool)
    x = Flatten()(x)
    x = Dense(128, activation='relu')(x)
    #x = Dropout(0.5)(x)

    preds = Dense(labels_index, activation='softmax')(x)

    model = Model(sequence_input, preds)
    model.compile(loss='categorical_crossentropy',
                  optimizer='adam',
                  metrics=['acc'])

    return model

## Training the Neural Network

In [15]:
x_train = cnn_data[:-num_validation_samples]
y_train = labels[:-num_validation_samples]
x_val = cnn_data[-num_validation_samples:]
y_val = labels[-num_validation_samples:]

model = ConvNet(embedding_weights, MAX_SEQUENCE_LENGTH, len(word_index)+1, EMBEDDING_DIM, 
                len(list(tweets["label"].unique())), False)

## First attempt with embedding vector length 50

In [None]:
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=30, batch_size=200)

## Second Attempt with Embedding Vector Length 100

In [38]:
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=30, batch_size=200)

Train on 3968 samples, validate on 991 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x1a2a9975c0>

In [35]:
class AccuracyHistory(keras.callbacks.Callback):
    def on_train_begin(self, logs={}):
        self.acc = []

    def on_epoch_end(self, batch, logs={}):
        self.acc.append(logs.get('acc'))

In [45]:
history = AccuracyHistory()

In [42]:
plt.plot(range(1,11), history.acc)
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.show()

AttributeError: 'AccuracyHistory' object has no attribute 'acc'

## Third Attempt with Embedding Vector Length 25 and a Larger Embedding Training Corpus (2mil instead of 1.5 mil tweets)

In [None]:
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=30, batch_size=200)

## Fourth Attempt with Embedding Vector Length 45 and a Larger Embedding Training Corpus (2mil instead of 1.5 mil tweets)

In [None]:
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=30, batch_size=150)

## Load Data and Add Labels
We will use Fastexts supervised learning functionality to classify our tweets. The format that fasttext uses for training a supervised classifier is as follows:

    __label__complaint @jetblue why are your employees so rude today at dallas-fort worth? tons of attitude on simple questions. #notimpressed
    __label__notcomplaint @icelandair awesome thanks for these recommendations! @sarahamil and i are very excited!!

The Fasttext model trains on a txt file that contains sentences with one sentence per line and each sentence preceded by a label prefixed by ``__label__``


## References

1) **Bag of Tricks for Efficient Text Classification** https://arxiv.org/pdf/1607.01759.pdf 

2) **Convolutional Neural Networks for Sentence Classification** https://arxiv.org/abs/1408.5882

3) **How to solve 90% of NLP problems: a step-by-step guide** https://blog.insightdatascience.com/how-to-solve-90-of-nlp-problems-a-step-by-step-guide-fda605278e4e 

4) **Fasttext Classification** https://github.com/facebookresearch/fastText#text-classification

5) **Gensim Wrapper for Fasttext** https://radimrehurek.com/gensim/models/fasttext.html