## Introduction

In this task, I used Switchboard Dialog Act Corpus (SwDA) to develop a dialog act classifier that can differentiate three types of dialog act -- **i.e., starting conversation, closing conversation, and info request** -- from all other types of dialog act. Since these three of types of dialog act are the most common types of dialog act people interact with a machine conversation system, the ability to correctly notice and differentiate those dialog acts is crucial for a successful machine conversation system. This is the basic reasoning why we want to develop a classifier which aims to differentiate those three types of dialog act.

In the following discussion, I first explained how I obtained and processed corpus data, then explained how to specify the model, lastly discussed model performance. 

## Data Preparation 

As mentioned, the corpus I used was Switchboard Dialog Act corpus. However, I didn't use the original dataset. A  reformatted version of SwDA corpus, produced by Sanjay Meena, was used instead [1]. This version of SwDA corpus provided us conversational transcripts did not contain any NLP notation or marker, only consisted by plain English text. It is much more easier for us to transform plain English text to language feature than working with NLP notation, as we can see at below.

In fact, since the only language feature I planned use was word vectors, which can easily generated by applying pre-trained word2vec network to corpus, the preprocessing steps I needed were the following: tokenized conversational transcripts, and made all conversational part have the same length. It was done by padding, i.e., adding empty word to short conversational parts until they all have the same length as the longest conversational part in the corpus.

After finished tokenization and padding, we paired conversational parts with their correspondent dialog act tag -- one of starting conversation, info request, closing conversation, and **other dialog act**, labeled as 0,1,2,3 respectively (see [2] for tag type). The proportion of training data, validation data, testing data were 72%, 8%, 20%, respectively.


In [170]:
import random
import re
import numpy as np
import pandas as pd
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import OneHotEncoder

def recode_tag(tag):
    """
    recode tag
    0 = greeting (Conventional-opening)
    1 = info request
    2 = goodbye (Conventional-closing)
    3 = other tags
    """
    if re.search('(qy|qw|qo|qr|qrr|\^d|\^g)', tag):
        return 1 
    elif re.search('fc',tag):
        return 0
    elif re.search('fp',tag):
        return 2
    else:
        return 3

corpus = pd.read_csv('switchboard_complete.csv', usecols =['caller','clean_text','act_tag','act_label_1'])
corpus = corpus.dropna().reset_index()

DICT_SIZE = 20000
MAX_TEXT_LENGTH = 80

tokenizer = Tokenizer(num_words= DICT_SIZE)
tokenizer.fit_on_texts(corpus['clean_text'])
corpus['text'] = tokenizer.texts_to_sequences(corpus['clean_text'])
corpus['text'] = [arr for arr in sequence.pad_sequences(corpus['text'], maxlen= MAX_TEXT_LENGTH)]   
corpus['tag'] = corpus['act_tag'].apply(recode_tag)

In [211]:
# split data
all_idx = set(list(range(len(corpus))))
train_idx=random.sample(all_idx, k=int(np.ceil(len(corpus)*0.8)))
test_idx = all_idx.difference(set(train_idx))

val_idx = random.sample(train_idx, k=int(np.ceil(len(train_idx)*0.1)))
train_idx = set(train_idx).difference(set(val_idx))

def split_xy(data, idx):
    x = np.stack(data.loc[idx, 'text'].values,axis=0) 
    y = OneHotEncoder().fit_transform(np.reshape(data.loc[idx, 'tag'].values,[-1,1])).toarray()
    return x,y
    
train_x, train_y = split_xy(corpus, train_idx)
val_x, val_y = split_xy(corpus, val_idx)
test_x, test_y = split_xy(corpus, test_idx)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


## Model Specification

Since every conversational part has different length, it's natural to utilize a sequence model to work with this kind of training data. Here I specify a LSTM neural network with 128 hidden units as my model. The input of LSTM network were word vectors, which generated from feeding words in conversational part into the word2vec network. The length of word vectors were fixed at 32. The output of LSTM network were probabilities of the conversational part belonged to one of the four dialog act classes. The other training parameters were assigned with commonly used value. 

In [215]:
from keras.models import Sequential
from keras.layers import LSTM, Dense, Embedding
from keras.wrappers.scikit_learn import KerasClassifier


def build_model(dict_size, embedding_size, text_length, num_class):
    model = Sequential()
    model.add(Embedding(dict_size, embedding_size, input_length=text_length))
    model.add(LSTM(128, return_sequences =False))
    model.add(Dense(num_class, activation='softmax'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

model_params = {
    # model spec
    'build_fn': build_model,
    'dict_size': DICT_SIZE,
    'embedding_size': 32,
    'text_length': MAX_TEXT_LENGTH,
    'num_class':4,    
    # training spec 
    'epochs': 3,
    'batch_size': 64,
    'verbose': 1,
    'validation_data': (val_x, val_y),
    'shuffle': True
}

mdl = KerasClassifier(**model_params)

## Result and Discussion

From the cells below, we can see that the LSTM classifier had 98.65% accuracy on training data, 98.55% accuracy on validation data, and 98.51% accuracy on testing data. Since people use very specific words and phrases to develop these three kind of dialog act, for example, "hello" for opening conversation, "see you" for closing conversation, and using "What","Why","Where" for info requests, it's not so surprising that a trivially tuned model can achieve such a high accuracy.

In [213]:
hist = mdl.fit(train_x, train_y)

Train on 155905 samples, validate on 17323 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [214]:
mdl.score(test_x, test_y)



0.9851694914978966

## References

[1] http://sanjaymeena.io/tech/nlp/Simplified-Switchboard-Corpus/
[2] https://web.stanford.edu/~jurafsky/ws97/manual.august1.html