### HW2: Task 2 - Dialogue act detection

#### Khandokar Md. Nayem (knayem) 

#### Approach:
We try to classify text using RNNs to identify the goal or Dialogue Act in a conversation. As dataset, we use the Switchboard Dialog Act Corpus. We generate a sub corpus from the Switchboard Dialog Act Corpus extracting only the sentences which represent `greeting`, `goodby` and `order`. We saved the sub corpus in a file named "swad_small.txt". This text file is attached.  

We preprocess each sentence by removing punctuations and lowering each words. Then we take the stem of each words and create a tokenize list for each sentence. We create word embedding for each word using Word2Vec with N-gram language model. It gives us a vector of length 100 for each word.  

We consider 80% of the dataset as train dataset and 20% as test dataset from the whole dataset. And one-hot encodeing as output label.

We use keras sequential model for this task. To classify the dialogues, we experimented with two different model. First one with is a single LSTM with pre-trained word embeddings. We use 0.0 drop out and last pooling for our network. Then we add a densed output layer where we used sigmoid as a activation function as we are trying to classify the dialogue acts. To calculate the loss function we used cross-entropy and AdamOptimizer as optimizer function.

Second one is a single GRU (Gated Recurrent Unit) with pre-trained word embeddings. Like Hamed et. al we use 0.2 drop out and last pooling for our network. Then we add a densed output layer where we used sigmoid, cross-entropy as loss function and AdamOptimizer as optimizer.

While training the model, we split the train dataset into 5% as validation set. We consider 100 epoch and 64 batch size. 

#### Accuracy:
After evaluating the model we found the accuracy of: 78.51% in both cases. This is almost same as the paper.

#### Ideas for improvement:
Here we used the small dataset of Swda and we considered only 3 dialogue acts. So our dataset is pretty small. Larger size dataset can significanltly improve the result. Also we only use N-gram model as vectorization scheme. Other linguistic feature like dependency tree, lemma can be useful to improve the accuracy. So in future, we will incorporate these features in the vectorization scheme and investigate those.

In [19]:
from __future__ import print_function

import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"   # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"]="1"

from keras.models import Model
from keras.layers import Input, LSTM, GRU, Dense, Dropout, Bidirectional
from keras.models import Sequential
from keras.layers import Dense
from keras.utils.np_utils import to_categorical

import numpy as np

import csv

import gensim
import nltk
from nltk.stem.lancaster import LancasterStemmer
import string

### File defined

In [2]:
stemmer = LancasterStemmer()
ignore_words = []

# Set the directory you want to start from
rootDir = './swda'

file_name_sm = './swda_small.txt'

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.33
set_session(tf.Session(config=config))

### Create Train File for 3 classes 

In [3]:
if os.path.exists(file_name_sm):
    mode = 'w' # append if already exists
else:
    mode = 'a' # make a new file if not

with open(file_name_sm, mode) as train_file:
    mode = 'a' # make a new file if not
    
    for dirName, subdirList, fileList in os.walk(rootDir):
        print('Found directory:', dirName)
        for fname in fileList:
#             print(fname)
            f=os.path.join(dirName,fname)

            with open(f, 'r') as csvfile:
    #             spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|')
                spamreader = csv.reader(csvfile)
                for row in spamreader:
#                     print(row)
                    # greeting
                    if row[4]=='qo' or row[4]=='fp':
                        train_file.write(' '.join(['greeting',row[8],'\n'] ))
                    
                    # goodbye
                    elif row[4]=='fc':
                        train_file.write(' '.join(['goodbye',row[8],'\n'] ))
                                         
                    # request
                    elif row[4]=='ad':
                        train_file.write(' '.join(['request',row[8],'\n'] ))
    

Found directory: ./swda
Found directory: ./swda/sw13utt
Found directory: ./swda/sw06utt
Found directory: ./swda/sw11utt
Found directory: ./swda/sw01utt
Found directory: ./swda/sw03utt
Found directory: ./swda/sw00utt
Found directory: ./swda/sw02utt
Found directory: ./swda/sw07utt
Found directory: ./swda/sw12utt
Found directory: ./swda/sw10utt
Found directory: ./swda/sw08utt
Found directory: ./swda/sw09utt
Found directory: ./swda/sw04utt
Found directory: ./swda/sw05utt


### Word2Vec Model and Tokenize

In [7]:
def preProcessing(line):
    line = line.split()[1:]
    line = [''.join(c for c in s if c not in string.punctuation) for s in line]
    line = [s.lower() for s in line if s]
    line = [s for s in line if len(s)>1 or s=='i' or s=='a']
    line = [stemmer.stem(w) for w in line if w not in ignore_words]
    return line

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for line in open(file_name_sm, 'r'):
#             print(preProcessing(line))
            yield preProcessing(line)
 
sentences = MySentences(file_name_sm) # a memory-friendly iterator
model = gensim.models.Word2Vec(sentences, min_count=1)

### Train and Test data prepare

In [8]:
Classes = np.array(['greeting','goodbye','request'])
Max_RNN = 100

train_set={'x':[],'y':[]}
dev_set={'x':[],'y':[]}
test_set={'x':[],'y':[]}

for line in open(file_name_sm, 'r'):
    tag = line.split()[0]
    
    ws = preProcessing(line)
    for e in np.arange(Max_RNN):
        if e < len(ws):
            w = ws[e]
#             print( len([model.wv[w][:5]]) ,'w: ',[model.wv[w][:5]])
            vec = np.array([model.wv[w]],ndmin=2) if e==0 else np.append(vec, [model.wv[w]],axis=0)
#             print(vec.shape,'V: ',vec)
        else:
#             print(np.zeros((1,5)).shape,'w: ',np.zeros((1,5)))
            vec = np.zeros((1,100)) if e==0 else np.append(vec, np.zeros((1,100)),axis=0)
#             print(vec.shape,'V: ',vec)
            
        
#     print('U-', vec.shape)
    
    DataType = np.random.choice(2, 1, p=[0.8, 0.2])
    if DataType == 0:
        if len(train_set['x'])<1:
            train_set['x'] = np.array([vec],ndmin=2)
    #         train_set['y'] = np.array( [np.array([1*(Classes == line[0])], ndmin=2).T] )
            train_set['y'] = np.array(  np.where(Classes==tag) ,ndmin=2)
#             print(train_set['y'],tag)
        else:
            train_set['x'] = np.append(train_set['x'], [vec], axis=0)
    #         train_set['y'] = np.append(train_set['y'], [np.array([1*(Classes == line[0])], ndmin=2).T],  axis=0 )
            train_set['y'] = np.append(train_set['y'],  np.where(Classes==tag),  axis=0 )
#             print(train_set['y'],tag)
            
    elif DataType == 1:
        if len(test_set['x'])<1:
            test_set['x'] = np.array([vec],ndmin=2)
            test_set['y'] = np.array(  np.where(Classes==tag) ,ndmin=2)
        else:
            test_set['x'] = np.append(test_set['x'], [vec], axis=0)
            test_set['y'] = np.append(test_set['y'],  np.where(Classes==tag),  axis=0 )
            
#     print('X-',train_set['x'].shape, train_set['y'].shape)
    
    
# Convert labels to categorical one-hot encoding
# one_hot_labels = keras.utils.to_categorical(train_set['y'], num_classes=3)
train_set['y'] = to_categorical(train_set['y'], num_classes=Classes.size)
test_set['y'] = to_categorical(test_set['y'], num_classes=Classes.size)
print('Q-', train_set['x'].shape )
print('Q-', train_set['y'].shape )
print('Q-', test_set['x'].shape )
print('Q-', test_set['y'].shape )

Q- (3144, 100, 100)
Q- (3144, 3)
Q- (777, 100, 100)
Q- (777, 3)


### LSTM Model

In [17]:
# create the model

model = Sequential()

model.add(LSTM(Max_RNN, return_sequences=False, input_shape=(Max_RNN,100)))

model.add(Dense(Classes.size, activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(train_set['x'], train_set['y'],validation_split=0.05, shuffle=True, nb_epoch=100, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(test_set['x'], test_set['y'], verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_9 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense_5 (Dense)              (None, 3)                 303       
Total params: 80,703
Trainable params: 80,703
Non-trainable params: 0
_________________________________________________________________
None




Train on 2986 samples, validate on 158 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100


Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100
Accuracy: 78.51%


### GRU Model

In [22]:
# create the model

model = Sequential()
model.add(Bidirectional(GRU(Max_RNN, return_sequences=False), input_shape=(Max_RNN,100)))
model.add(Dropout(0.2))
model.add(Dense(Classes.size, activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(train_set['x'], train_set['y'],validation_split=0.05, shuffle=True, nb_epoch=100, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(test_set['x'], test_set['y'], verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_3 (Bidirection (None, 200)               120600    
_________________________________________________________________
dropout_7 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_6 (Dense)              (None, 3)                 603       
Total params: 121,203
Trainable params: 121,203
Non-trainable params: 0
_________________________________________________________________
None




Train on 2986 samples, validate on 158 samples
Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100


Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78/100
Epoch 79/100
Epoch 80/100
Epoch 81/100
Epoch 82/100
Epoch 83/100
Epoch 84/100
Epoch 85/100
Epoch 86/100
Epoch 87/100
Epoch 88/100
Epoch 89/100
Epoch 90/100
Epoch 91/100
Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100
Accuracy: 78.51%
