# Dakota Murray
## Applying ML to CL
### Assignment 2


In this assignment, I use Keras to implement an artificial neural network for Parts of Speech (PoS) tagging and evaluate the result. I use data from the Penn treebank corpus in the NLTK library. I draw heavily from a tutorial at the following link, but make alterations to the code in the hopes of improving performance and better understanding the methodology. 

https://becominghuman.ai/part-of-speech-tagging-tutorial-with-the-keras-deep-learning-library-d7f93fa05537


I use a standard **sequential Neural Network architecture**, rather than something fancy like a recurrent neural net. I feal that, for a first attempt at implementing a PoS tagger that this basic architecture is preferable. 

It seemed that the Schmitt paper only used the term itself and its suffix to determine the parts of speech. I used these features, but also considered many more. 

- The term itself
- Number of terms in the sentence
- Whether the term was the first in the sentence
- Whether the term was the last in the sentence
- Whether the term was capitalized (i.e.: bill vs. Bill)
- Whether the word was all capitalized (i.e.: HELLO)
- Whether the word all lowercase (i.e.: hello)
- The 1, 2, 3, 4, and 5-character prefix (if long enough) was available
- The 1, 2, 3, 4, and 5-character suffic (if long enough) was available
- Whether the term was all alphanumeric (i.e.: had no puntuation)
- The token before the term (if not the first term)
- The token that appears after the term (if not the last term)
- Whether the term has a hyphen (as in "-")
- Whether the term has a period (e.g., U.S.A, Dr.)
- Whether the term has an aporstrophe (e.g., "Jane Smith's")
- whether the term was entirely numeric
- Whether the term had any capitalization inside the text (e.g., PhD)


Overall, I have included way more features than those used in the Schmitt paper. I am hoping that these new features will mosly help in certain special cases which Schmitt originally didn't especially account for, such as apostrophes denoting proper nouns.

I experimneted with many different settings for the neural network classifier. The network I settled on had an input layer, an output layer, and a single hidden layer. I experimented with adding a second hidden layer, but didn't notice any increase in efficiency and so it wasn't worth the training slowdown. I used 512 nodes in the hidden layer; this is what the tutorial used and I didn't notice much difference using other values, and so I kept this as is. I was worried about overfitting, and so I set the "dropout" pretty high. Dropout is a technique where some portion of nodes are randomly ignored during each stage of training—this helps distribute waeights to other nodes and makes the architecture less likely to overfit. Rather than sigmoud activation that was used in Schmitt's paper, I use relu activiation, mostly because this seems to be the most-oftenly used in the tutorials and descriptions I read through.

For evaluation, I used standard accuracy, as this is what was used in the Schmitt paper and so allows my results to be comparable. 

My final result is an accuracy on the test set of about 96.3%, which is comparable to that in the Schmitt paper. I guess that this means that all of my extra features didn't really help much, and that really the term + suffix is the most important feature. 

In [90]:
import re

def sentence_to_features(sentence_terms, index):
    """ 
        This function takes as input a list of words 
        that represent a sentence, and outputs a 
        dictionary contianing the features to use for 
        parts of speech tagging
    """
    term = sentence_terms[index]
    return {
        'nb_terms': len(sentence_terms),
        'term': term,
        'is_first': index == 0,
        'is_last': index == len(sentence_terms) - 1,
        'is_capitalized': term[0].upper() == term[0],
        'is_all_caps': term.upper() == term,
        'is_all_lower': term.lower() == term,
        'prefix-1': term[0],
        'prefix-2': term[:2],
        'prefix-3': term[:3],
        'prefix-4': term[:4],
        'prefix-5': term[:5],
        'suffix-1': term[-1],
        'suffix-2': term[-2:],
        'suffix-3': term[-3:],
        'suffix-4': term[-4:],
        'suffix-5': term[-5:],
        'is_alphanumeric': int(bool((re.match('^(?=.*[0-9]$)(?=.*[a-zA-Z])',sentence_terms[index])))),
        'prev_word': '' if index == 0 else sentence_terms[index - 1],
        'next_word': '' if index == len(sentence_terms) - 1 else sentence_terms[index + 1],
        'has_hyphen': "'" in sentence_terms[index],
        'has_period': "." in sentence_terms[index],
        'has_apost': term.replace("'", "") != term,
        'is_numeric': sentence_terms[index].isdigit(),
        'capitals_inside': sentence_terms[index][1:].lower() != sentence_terms[index][1:]
    }


def untag(tagged_sentence):
    """ 
    Helper function to remove the tag for each tagged term.
    """
    return [w for w, _ in tagged_sentence]


def corpus_to_data(tagged_sentences):
    """
    Given a corpus of PoS tagged sentences, produce a feature dataset
    with associated features, X, and output data, y
    """
    X, y = [], []

    # iterate through every sentence and produce features
    for pos_tags in tagged_sentences:
        for index, (term, tag) in enumerate(pos_tags):
            # Create the features
            X.append(sentence_to_features(untag(pos_tags), index))
            y.append(tag)
            
    return X, y


First, load the data and produce a training, testing, and validation corpuses. 

The training data is used to train the neural network model. I am using 80% of the total tagged dataset in order to train the model. 

I use the remining 20% of the dataset as the testing data, which I use to produce an accuracy score at the end. 

The validation data is used during training. Basically, this validation data is used to calculate loss of the model and tune the neural net weights. I use 25% of the training data as the validation set, which is then excluded from the remainder of the training data. 

In [91]:
# Split the dataset for training and testing
from nltk.corpus import treebank
sentences = treebank.tagged_sents(tagset='universal')

train_test_cutoff = int(.80 * len(sentences)) 
training_sentences = sentences[:train_test_cutoff]
testing_sentences = sentences[train_test_cutoff:]

train_val_cutoff = int(.25 * len(training_sentences))
validation_sentences = training_sentences[:train_val_cutoff]
training_sentences = training_sentences[train_val_cutoff:]

# Build the feature sets for each partition
X_train, y_train = corpus_to_data(training_sentences)
X_test, y_test = corpus_to_data(testing_sentences)
X_val, y_val = corpus_to_data(testing_sentences)

Keras only likes its data to be a numeric, vectorized form. However, up until now, the features I created are in the form of a dictionary of numbers and tokens. I need a way to turn these dictionaries into vectors.

Fortunately, sklearn comes with some pretty great tools for vectorization. using sklearn's DictVectorizer, I produced a vectorspace that can fit the entire training and testing sets. The vectorizer is fit using all of the data to ensure that they share a common vector space. 

In [93]:
# Next we need to vectorize our inputs...
from sklearn.feature_extraction import DictVectorizer

# Fit our DictVectorizer with our set of features
dict_vectorizer = DictVectorizer(sparse=False)
dict_vectorizer.fit(X_train + X_test + X_val)

# Convert dict features to vectors
X_train = dict_vectorizer.transform(X_train)
X_test = dict_vectorizer.transform(X_test)
X_val = dict_vectorizer.transform(X_val)

We can have a look at one of these, and we see that it is just a big numeric matrix

In [112]:
X_test

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Similarly, we also need to encode the labels (the classification tags). This is done using one-hot encoding. There are 12 total tags, so each vector is of size 12. Fortuantely, sklearn also has a good function for doing this.

In [94]:
# Now we encode ouor output vector, y
from sklearn.preprocessing import LabelEncoder

# Fit LabelEncoder with our list of classes
label_encoder = LabelEncoder()
label_encoder.fit(y_train + y_test + y_val)

# Encode class values as integers
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
y_val = label_encoder.transform(y_val)

We then use a Keras utility in order create the dummy variables for the classifications

In [95]:
# Convert integers to dummy variables (one hot encoded)
# Use keras module to make it happen
from keras.utils import np_utils

y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
y_val = np_utils.to_categorical(y_val)

Again, we can look at the data and see a series of numeric vectors with a single "1" and filled with 0

In [117]:
y_test

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 1., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.]], dtype=float32)

Here, a function is created which will create the keras model given a set of input parameters. This function is used in the actual model building below. This is also where I define the architecture of the Keras model classifeir. As stated above, I use a model with a single hidden layer, and use relu activation for each layer, and softmax for the final output layer. 

In [101]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation

def build_model(input_dim, hidden_neurons, output_dim):
    """
    This function takes a set of arguments as input and outputs the compuled 
    (but not trained) Keras model. 
    """
    model = Sequential([
        Dense(hidden_neurons, input_dim=input_dim),
        Activation('relu'),
        Dropout(0.40),
        Dense(hidden_neurons),
        Activation('relu'),
        Dropout(0.40),
        Dense(output_dim, activation='softmax')
    ])

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

Here, I format a set of data that is used as the model parameters for the Keras classifier. I tried experimenting with several different parameters, but they had a trivial impact on the final accuracy, so I mosly settled on what was provided by the main tutorial I was following. 

In [105]:
from keras.wrappers.scikit_learn import KerasClassifier

model_params = {
    'build_fn': build_model,
    'input_dim': X_train.shape[1],
    'hidden_neurons': 512,
    'output_dim': y_train.shape[1],
    'epochs': 5,
    'batch_size': 256,
    'verbose': 1,
    'validation_data': (X_val, y_val),
    'shuffle': True
}

clf = KerasClassifier(**model_params)

Here I actually do the training. The history of the training is shown—it seems like the accuracy mosly levels off after the second epoch of training. 

In [106]:
hist = clf.fit(X_train, y_train)

Train on 61107 samples, validate on 20039 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Finally, I print the accuracy of the model, which is just the accuracy of the trained model's performance at classifying the test data. The final score is very similar to that reported by the Schmitt paper, so all this facny training didn't seem to improve that much over the basic original.

In [107]:
score = clf.score(X_test, y_test)
print(score)

0.9633215230300913
