# Dakota Murray
## Applying ML to CL
### Assignment 2 Part 2

In this assignment, I chose to attempt to classify the Switchboard Dialogue Act Corpus into three categories: "question", "answer", and "statement". I manually aggregated dialogue acts into one of three categories, though some dialogue acts did not fit easily onto one, and as such were classifeid into "other" and were removed from further training/testing. 

The only features that I used for this classifier were the stemmed tokens from the utterance itself. Becuse utterances were short, I chose not to use parts-of-speech tagged tokens, as this would have made the resulting feature space much too sparse. Accuracy would likely be improved if additional features were included, such as the length of the utterance, its index in the transcript, and the characteristics of the speaker (education, sex, etc.), but I do not include these features here. 

Once the features are vectorized (using a basic bag-of-words model), the training and evaluation process follow that of the parts-of-speech tagger almost identically. 

The final accuracy is about 93%, which seems fairly good. However, examining the confusion matrix, its clear from the confuion matrix that this is perhaps misleading–the vast majority of labels are under the "statement" category, so the accuracy is dominated by performance on this one category. Still, such a classifeir could prove useful for high-level classification of dialogue acts, and with some more tweaking (both of the categories and of the features) I see no reason why this cannot perform better. 

I am using a slightly different version of the Switchboard corpus code which works better with python 3. This code is stored on the github page linked below,

https://github.com/cgpotts/swda


This first bit of code below defines some helper functions for use elsewhere in the notebook. 

In [1]:
def full_to_simple_tag(tag):
    """
    This fuction takes as input a single tag as that defined for the Switchboard 
    Dialogue Act corpus, and returns the custom aggregated label
    """
    if tag in ['^g', 'qh', 'qo', 'bh', 'qy^d', 'qw', 'qy', '^g', 'qw^d']:
        return('question')
    elif tag in ['aa', 'ny', 'nn', 'na', '^h', 'ng', 'no', 'arp_nd', 'ar', 'aap_am']:
        return('answer')
    elif tag in ['sd', 'sv', 'ba', 'fc', 'bk', 'h', 'fo_o_fw_by_bc', '^q', 'bf', 'ad', 'b^m', 'br', 'fp', 'qrr', 'oo_co_cc', 'fa', 'ft']:
        return('statement')
    else:
        return('other')
    

def corpus_to_data(utterances):
    """
    Iterates through the utterances and builds the feature and label sets
    """
    X, y = [], []

    for utt in utterances:
        tag = full_to_simple_tag(utt.act_tag)
        if tag != "other":
            X.append(utt.text)
            y.append(tag)
        
    return X, y
    

This code uses the functions that came along with the Switchboard Dialogue Act corpus in order to load the data and create a list of utterances. 

In [2]:
from swda import CorpusReader
corpus = CorpusReader('swda')

utterances = []
# consider Question, Answer, or Statement
for trans in corpus.iter_transcripts():
    for utt in trans.utterances:
        utterances.append(utt)

transcript 1155


With the utterances loaded, I then use the helper functions I creaed in the first cell in order to construct the training, testing, and valiation corpuses. 

In [3]:
train_test_cutoff = int(.80 * len(utterances)) 
training_utt = utterances[:train_test_cutoff]
testing_utt = utterances[train_test_cutoff:]

train_val_cutoff = int(.25 * len(training_utt))
validation_utt = training_utt[:train_val_cutoff]
training_utt = training_utt[train_val_cutoff:]

# Build the feature sets for each partition
X_train, y_train = corpus_to_data(training_utt)
X_test, y_test = corpus_to_data(testing_utt)
X_val, y_val = corpus_to_data(testing_utt)

Whereas I used the dictionary vectorizer for the parts of speech tagger, here I instead use a Count Vectorizer, which simply tokenizes a sentence, stems the tokens, and counts them. The result is a simple bag-of-words model. As before, I train the vectorizer on all teh training data in order to create a common features space.

In [4]:
# Next we need to vectorize our inputs...
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.snowball import EnglishStemmer

stemmer = EnglishStemmer()
analyzer = CountVectorizer().build_analyzer()

def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))


# Fit our DictVectorizer with our set of features
dict_vectorizer = CountVectorizer(analyzer=stemmed_words)

dict_vectorizer.fit(X_train + X_test + X_val)

# Convert dict features to vectors
X_train = dict_vectorizer.transform(X_train)
X_test = dict_vectorizer.transform(X_test)
X_val = dict_vectorizer.transform(X_val)

As before, I encode the labels using the sklearn LabelEncoder package. This code is identical to the parts-of-speech tagger. I also use the np_utils functions from the Keras library in order to convert the label vecors into the categoricals expected by the Keras Classifier. 

In [5]:
# Now we encode ouor output vector, y
from sklearn.preprocessing import LabelEncoder

# Fit LabelEncoder with our list of classes
label_encoder = LabelEncoder()
label_encoder.fit(y_train + y_test + y_val)

# Encode class values as integers
y_train = label_encoder.transform(y_train)
y_test = label_encoder.transform(y_test)
y_val = label_encoder.transform(y_val)

# Convert integers to dummy variables (one hot encoded)
# Use keras module to make it happen
from keras.utils import np_utils

y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
y_val = np_utils.to_categorical(y_val)

Using TensorFlow backend.


From here, I now use the same exact Keras setup as before. I reduce the number of neurons somewhat, as this is a less complex problem. I again use a `relu` activation, and again use sogtmax to produce the final bounded output. 

In [6]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation

def build_model(input_dim, hidden_neurons, output_dim):
    """
    This function takes a set of arguments as input and outputs the compuled 
    (but not trained) Keras model. 
    """
    model = Sequential([
        Dense(hidden_neurons, input_dim=input_dim),
        Activation('relu'),
        Dropout(0.40),
        Dense(hidden_neurons),
        Activation('relu'),
        Dropout(0.40),
        Dense(output_dim, activation='softmax')
    ])

    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

from keras.wrappers.scikit_learn import KerasClassifier

model_params = {
    'build_fn': build_model,
    'input_dim': X_train.shape[1],
    'hidden_neurons': 256,
    'output_dim': y_train.shape[1],
    'epochs': 5,
    'batch_size': 128,
    'verbose': 1,
    'validation_data': (X_val, y_val),
    'shuffle': True
}

clf = KerasClassifier(**model_params)

Finally, I train the model using the training data. 

In [7]:
hist = clf.fit(X_train, y_train)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.
Train on 81755 samples, validate on 27391 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Here, I produce a raw accuracy score of the data. We get a reasonable score of 93 percent, which seems pretty good for a natural language classification with short utterances.

In [8]:
score = clf.score(X_test, y_test)
print(score)

0.9288087328824864


However, upon closer look at the classification, we see that much of this accuracy comes from classification of the "statemenet" instances—as these comprised the majority of the dataset. Indeex, a classifier that always returned "statemenet" would also have pretty good prformance. Still, the other categories had more correct than incorrect assignments and the classifier did better than chance—I would still call this successful, though more tweaking and work could definintely improve it. 

In [9]:
from sklearn.metrics import confusion_matrix
y_pred = clf.predict(X_test)
confusion_matrix(y_test.argmax(axis=1), y_pred)



array([[ 3228,    11,   586],
       [   49,   942,   842],
       [  295,   167, 21271]])