## LSTM

The following notebook contains the implementation for the LSTM network. The code and experiments ran on the selected datasets is shown in the notebook. 

In [1]:
import pandas as pd
import numpy as np
from keras.models import Model, Input
from keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional
from keras.layers.merge import add
from nltk.tokenize import word_tokenize
from random import shuffle
%matplotlib inline
import matplotlib.pyplot as plt
from numpy import zeros
from keras.models import Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from sklearn.model_selection import KFold
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Meghna
[nltk_data]     Patel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

The prepare_data function takes the name of the desired input file to be processed, and returns the processed data (X), the processed labels (y), the word-index pairs (word2idx), the length of the longest sentence (maxLen) and total number of unique labels (totalTags). The purpose of this function is to read the input file, and process it for input into the neural network. 

In [2]:
def prepare_data(input_data):
    # read the input data
    if (input_data == 'subjectivity'):
        data = pd.read_csv("data/subjectivity.txt", delimiter = "\t", 
                           header = None, names=['tag','sentence'])
    
    elif (input_data == 'mpqa'):
        data = pd.read_csv("data/mpqa.txt", delimiter = "\t", 
                           header = None, encoding='latin-1',names=['tag','sentence'])
        
    elif (input_data == 'bbc'):
        data = pd.read_csv("data/bbc_text.txt", delimiter = "\t", 
                           header = None, names=['tag','sentence'])
    
    elif (input_data == 'rt-polarity'):
        data = pd.read_csv("data/rt-polarity.txt", delimiter = "\t", 
                           header = None, encoding='latin-1',names=['tag','sentence'])
        
    df = pd.DataFrame(columns = ['Sentence#', 'Word', 'Tag']) # define an empty dataframe 
    
    # tokenize word and pair with its corresonding tag, store this information in a df 
    for id, sent in data.iterrows():
        tokens=[word.lower() for word in nltk.word_tokenize(sent[1])]
        for tk in tokens:
            sid = 'Sentence:'+str(id) 
            new_row = {'Sentence#': sid, 'Word': tk, 'Tag': sent[0]}
            df = df.append(new_row, ignore_index=True)
    
    # build a word to index and tag to index list 
    words = list(set(df['Word'].values))
    words.append('UNK')
    totalWords = len(words)

    tags = list(set(df["Tag"].values))
    totalTags = len(tags)
    agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(),s["Tag"].values.tolist())]
    sentencesData = df.groupby("Sentence#").apply(agg_func)
    sentencesData=[s for s in sentencesData]

    largest_sen = max([len(sen) for sen in sentencesData])
    maxLen = largest_sen
    word2idx = {w: i + 1 for i, w in enumerate(words)}
    tag2idx = {t: i for i, t in enumerate(tags)}
    print(tag2idx)
    
    # make all sentences equal size by adding the UNK token at the end of each sentences whose size is less
    # than maximum sentence lenght
    
    X = [[word2idx[w[0]] for w in s] for s in sentencesData]
    X = pad_sequences(maxlen=maxLen, sequences=X, padding="post", value=word2idx['UNK'])#totalWords)
    Y = [[tag2idx[w[1]] for w in s] for s in sentencesData]
    
    # additional padding specific to the datasets 
    if input_data == 'bbc':
        Y = pad_sequences(maxlen=maxLen, sequences=Y, padding="post", value=tag2idx['sport'])
    elif input_data == 'mpqa':
        Y = pad_sequences(maxlen=maxLen, sequences=Y, padding="post", value=tag2idx[0])
    else:
        Y = pad_sequences(maxlen=maxLen, sequences=Y, padding="post", value=tag2idx['pos'])
    Y = [to_categorical(tagIdx, num_classes=totalTags) for tagIdx in Y]
    y = np.array(Y)
    
    return X, y, word2idx, maxLen, totalTags

The below function builds the embedding matrix using pre-trained Google News vectors

In [3]:
def build_embedding_matrix(word2idx):
    embeddings_index = dict()
    # Reading embedding file
    f = open('GoogleNews-vectors-negative300.bin','rb') # load the pre-trained Google News vectors 
    # iterate the file line by line add words and coefficient accordingly
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:])
        embeddings_index[word] = coefs
    f.close()
    
    vocab_size = len(word2idx) + 1
    embedding_matrix = zeros((vocab_size, 300)) # initialize an empty embedding matrix 
    for word, idx_word in word2idx.items():
        embedding_vector = embeddings_index.get(word.lower()) 
        if embedding_vector is not None:
            embedding_matrix[idx_word] = embedding_vector
    input = Input(shape=(maxLen,))
    
    return embedding_matrix, vocab_size, input

## LSTM model

define_lstm_model is a helper function to define the lstm model. This function takes as input the vocabulary size, the embeddings matrix (defined using pre-trained Google News vectors), the maximum length (input length), total tags, input and data. For the activation function we use softmax activation. We use adam as the optimization algorithm. 

In [4]:
def define_lstm_model(vocab_size, embedding_matrix, maxLen, totalTags, input):
    model = Embedding(vocab_size, 300, weights=[embedding_matrix], input_length=maxLen, trainable=False)(input)  
    model = Bidirectional(LSTM(units=50, return_sequences=True, recurrent_dropout=0.1))(model)  
    out = TimeDistributed(Dense(totalTags, activation="softmax"))(model)
    model = Model(input, out)
    model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
    return model

run_lstm is a helper function that trains the model using 10-fold cross validation. At each iter, we reserve 10% of the data as validation set. We keep track of the classification accuracy on the test set for each iter and return the mean (average) classification accuracy at the end. 

In [5]:
def run_lstm(model, X, y):
    kf = KFold(n_splits=10)
    kf.get_n_splits(X)
    KFold(n_splits=10, random_state=None, shuffle=False)
    scores = []
    for train_index, test_index in kf.split(X): # 10-fold CV, build train-test set according to split indices   
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        # train the model, keep 10% of the training-set as the validation set 
        history = model.fit(X_train, y_train, batch_size=32, epochs=2, validation_split=0.1, verbose=1)
        scores.append(model.evaluate(X_test,y_test)) # keep track of accuracy for each CV iter 
    return np.mean(scores, axis = 0)[1] # return average (mean) accuracy 

## Run model on datasets

In this section we run our experiements on the selected datasets.

#### Subjectivity dataset

In [6]:
X, Y, word2idx, maxLen, totalTags = prepare_data('subjectivity') # prepare the dataset 
# {'neg': 0, 'pos': 1}

FileNotFoundError: [Errno 2] No such file or directory: 'data/subjectivity.txt'

In [10]:
embedding_matrix, vocab_size, input = build_embedding_matrix(word2idx) # build embedding matrix 

In [10]:
model = define_lstm_model(vocab_size, embedding_matrix, maxLen, totalTags, input)

In [11]:
%%capture
acc = run_lstm(model,X,Y)

In [12]:
print('The mean classification accuracy for the Subjectivity dataset using LSTM is: ' , np.round(acc*100, 2), '%')

The mean classification accuracy for the Subjectivity dataset using LSTM is: 87.66 %


#### rt-polarity dataset

In [13]:
X, Y, word2idx, maxLen, totalTags = prepare_data('rt-polarity') # prepare the dataset 

{'neg': 0, 'pos': 1}


In [14]:
embedding_matrix, vocab_size, input = build_embedding_matrix(word2idx) # build embedding matrix 

In [15]:
model = define_lstm_model(vocab_size, embedding_matrix, maxLen, totalTags, input)

In [16]:
%%capture
acc = run_lstm(model,X,Y)

In [17]:
print('The mean classification accuracy for the rt-polarity dataset using LSTM is: ' , np.round(acc*100, 2), '%')

The mean classification accuracy for the rt-polarity dataset using LSTM is: 79.73%
