### HW2: Task1 - PoS-tagging

#### Khandokar Md. Nayem (knayem)

#### Approach:
We build a improved neural network system using the resources from NLTK as a corpus for supervised Neural Network training and development. We use the Brown corpus as dataset and use all the catagories. As label PoS (Parts of Speech), we use `universal` tags. It has in total 12 types of Pos, they are `.`, `ADP`, `ADV`, `PRT`, `X`, `VERB`, `DET`, `NOUN`, `NUM`, `CONJ`, `ADJ`, `PRON`. 

We use Word2Vec model to get vector representation of a word. We use N-gram model for this. So we make the language model from the whole Brown corpus and get the 100-dim vector represntation of each word. 

Brown corpus is huge to train to. So we take only the first 5% of the data and randomly assign 80% data to train set and rest to test set. Each word is represented by the Word2Vec vector and lable for each word is an One-hot encoing vector.

We use sequencial Keras model since it is easy to use. The definition of this task demands word to word relations both local and long term. That's why we use bi-directional GRU (Gated Recurrent Unit). We experimented with LSTM too, but GRU is faster than LSTM. In the output layer, we use a dense layer with sigmoid activation function because this is a catagorical task. As loss function, we use cross-entropy and AdamOptimizer algorithm to train. We use 0.0 dropout here. 

In time of training, we use random suffling of samples. Also we use 5% of train data as validation set. It will prevent the model to overfit as recurrent models are tend to overfit. We use batch size of 64 and 100 epoch. Lastly we use test set to get the final accuracy.

#### Result:
The best accuracy we get is 98.67%. This is significently higher than the papers reported accuracy (~94%).

#### Ideas for improvement:
We use only 5% data to train of whole Brown coupus. We should use the whole coupus. Since the coupus is huge, memory efficient minibatching have to be used. Also batch normalization can be a good idea to keep the recurrent cells active. 

Also we only use N-gram model as vectorization scheme. Other linguistic feature like dependency tree, lemma can be useful to improve the accuracy. So in future, we will incorporate these features in the vectorization scheme and investigate those.

In [1]:
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"   # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"]="1"

import nltk
from nltk.stem.lancaster import LancasterStemmer
import os
import json
import datetime
import numpy as np

from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
from keras.utils.np_utils import to_categorical
from keras.layers import Input, LSTM, GRU, Dense
from keras.layers import Bidirectional

import gensim

Using TensorFlow backend.


### NLTK data download

In [2]:
# nltk.download('averaged_perceptron_tagger')
# nltk.download('universal_tagset')

import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.33
set_session(tf.Session(config=config))

In [3]:
from nltk.corpus import brown
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

### NLTK PoS tags

In [4]:
tagg_sent=brown.tagged_sents(tagset='universal')

flat_list = [item[1] for sublist in tagg_sent for item in sublist]
Classes = np.array(list(set(flat_list)))
Classes

array(['ADV', 'PRT', 'CONJ', 'ADP', 'NUM', 'NOUN', 'PRON', 'VERB', 'X',
       'DET', '.', 'ADJ'], dtype='<U4')

In [5]:
result_data=[]
pos_data=[]
for i in tagg_sent:
    pos_tag_list=[]
    sent_list=[]
    
    for tup in i:
        sent_list.append(tup[0].lower())
        pos_tag_list.append(tup[1])
        
    result_data.append(sent_list)
    pos_data.append(pos_tag_list)        

### Word2Vec Model

In [6]:
model = gensim.models.Word2Vec(result_data,min_count=1)
len(model.wv.vocab)

14221

### Prepare Train and Test dataset 

In [7]:
# Classes = np.array(['greeting','goodbye','request'])
Max_RNN = 100

train_set={'x':[],'y':[]}
dev_set={'x':[],'y':[]}
test_set={'x':[],'y':[]}

for index in np.arange(len(pos_data)//20):
# for index in np.arange(500):
    
    part,line = pos_data[index], result_data[index]
    
    for e in np.arange(Max_RNN):
        if e < len(line):
            w = line[e]
            p = part[e]
            
            vec = np.array([model.wv[w]],ndmin=2) if e==0 else np.append(vec, [model.wv[w]],axis=0)
            pos = np.array(np.where(Classes==p),ndmin=2) if e==0 else np.append(pos, np.where(Classes==p))

        else:
#             print(np.zeros((1,5)).shape,'w: ',np.zeros((1,5)))
            vec = np.zeros((1,100)) if e==0 else np.append(vec, np.zeros((1,100)),axis=0)
            pos = np.array(np.where(Classes=='X'),ndmin=2) if e==0 else np.append(pos, np.where(Classes=='X'))
#             print(vec.shape,'V: ',vec)
            
    DataType = np.random.choice(2, 1, p=[0.8, 0.2])
    if DataType == 0:
        if len(train_set['x'])<1:
            train_set['x'] = np.array([vec],ndmin=2)
            train_set['y'] = np.array(pos,ndmin=2)
        else:
            train_set['x'] = np.append(train_set['x'], [vec], axis=0)
            train_set['y'] = np.append(train_set['y'], [pos], axis=0)
            
    elif DataType == 1:
        if len(test_set['x'])<1:
            test_set['x'] = np.array([vec],ndmin=2)
            test_set['y'] = np.array([pos],ndmin=2)
        else:
            test_set['x'] = np.append(test_set['x'], [vec], axis=0)
            test_set['y'] = np.append(test_set['y'], [pos], axis=0)
            
#     print('X-',train_set['x'].shape, train_set['y'].shape)
    
# print(train_set['x'].shape)
# print( to_categorical(train_set['y'], num_classes=Classes.size) )

# # Convert labels to categorical one-hot encoding
# # one_hot_labels = keras.utils.to_categorical(train_set['y'], num_classes=3)
train_set['y'] = to_categorical(train_set['y'], num_classes=Classes.size)
# train_set['y'] = train_set['y'].reshape(train_set['y'].shape[0],train_set['y'].shape[1]*train_set['y'].shape[2])

test_set['y'] = to_categorical(test_set['y'], num_classes=Classes.size)
# test_set['y'] = test_set['y'].reshape(test_set['y'].shape[0],test_set['y'].shape[1]*test_set['y'].shape[2])

print('Q-', train_set['x'].shape )
print('Q-', train_set['y'].shape )
# print('Q-', train_set['y'] )
print('Q-', test_set['x'].shape )
print('Q-', test_set['y'].shape )

KeyError: "word 'atlanta's' not in vocabulary"

### Model 

In [None]:
# create the model

model = Sequential()
# model.add(LSTM(Max_RNN, input_shape=(Max_RNN,100)))
model.add(Bidirectional(GRU(Max_RNN, return_sequences=True), input_shape=(Max_RNN,100)))
model.add(Dense(Classes.size, activation='sigmoid'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

model.fit(train_set['x'], train_set['y'],validation_split=0.05, shuffle=True, nb_epoch=100, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(test_set['x'], test_set['y'], verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))