## Dependency parscer

1. ablation study
 - Try to delete only the 12 dep features and check UAS
 - Try to delete only the 18 pos features and check UAS  

2. comparison study
 - Try to use (1) glove embedding (smallest)
 - nn.Embedding (train from scratch)
3. compare 2-3 sentences with spaCy and see whether our neural network gives the same dependency

what i do 
1. add option to delete pos or dep in Parser class
2. use glove from gensim library to embedding and use nn.embadding from scratch (skipgram)
3. put sentence in parser to compare




In [1]:
import sys
import numpy as np
import time
import os
import logging
from collections import Counter
from datetime import datetime
import math

from tqdm import tqdm  #gimmick for progressbar when you train
import pickle #saving and loading models

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import nn, optim

## 1. Parsing function

We gonna start with a class `Parsing`, representing a parser for each sentence.  For each sentence, we need the `stack`, `buffer`, and the `dependencies`.

In [2]:
#basically, it takes the current state of the buffer, stack, dependencies
#tell us how SHIFT, LA, RA changes these three objects

class Parsing(object):
    
    #init stack, buffer, dep
    def __init__(self, sentence):  
        self.sentence = sentence     #['The', 'cat', 'sat]  #conll format which is already in the tokenized form
        self.stack    = ['ROOT']
        self.buffer   = sentence[:]  #in the beginning, everything is inside the buffer
        self.dep      = []           #maintains a list of tuples of dep
    
    #parse function that tells me how shift, la, ra changes these three objects
    def parse_step(self, transition):     #transition could be either S, LA, RA
        if transition == 'S':
            #get the top guy in the buffer and put in stack
            head = self.buffer.pop(0)
            self.stack.append(head)
        elif transition == 'LA':  #stack = [ROOT, He, has] ==> append to dep (has, he) and then He is gone from the stack [ROOT, has]
            dependent = self.stack.pop(-2)  #He
            self.dep.append((self.stack[-1], dependent))  #(has, he)
        elif transition == 'RA':
            #can you guys try to this???
            dependent = self.stack.pop()  #stack = [ROOT, has, control] ==> dep (has, control), control will be gone fromt he stack [ROOT, has]
            self.dep.append((self.stack[-1], dependent))
        else:
            print(f"Bad transition: {transition}")
    
    #given some series of transition, it gonna for-loop the parse function
    def parse(self, transitions):
        for t in transitions:
            self.parse_step(t)
        return self.dep
    
    #check whether things are finished - no need to do anymore functions....
    def is_completed(self):
        return (len(self.buffer) == 0) and (len(self.stack) == 1)  #so buffer is empty and ROOT is the only guy in stack

### Minibatch parsing

We gonna create a minibatch loader that loads a bunch of sentences, and perform parse accordingly.  For now, we will assume a very dump model to predict the transitions.

In [3]:
def minibatch_parse(sentences, model, batch_size):
    dep = []  #all the resulting dep
    
    #init Parsing instance for each sentence in the batch
    partial_parses = [Parsing(sentence) for sentence in sentences]  #in tokenized form
    #Parsing(['The', 'cat', 'sat']), Parsing(['Chaky', 'is', 'mad'])
    
    unfinished_parses = partial_parses[:]
    
    #while we still have sentence
    while unfinished_parses:  #if there are still a Parsing object
    
        #take a certain batch of sentence
        minibatch = unfinished_parses[:batch_size] #number of Parsing object
        
        #create a dummy model to tell us what's the next transition for each sentence
        transitions = model.predict(minibatch) 
        #transitions = [S, S, .....]
        #minibatch   = [Parsing(sentence1), Parsing(sentence2)]
        
                
        # for transition predicted this dummy model
        for transition, partial_parse in zip(transitions, minibatch):
            #parse step
            #transition: S
            #partial_parse: Parsing(sentence)
            partial_parse.parse_step(transition)
            
        #remove any sentence is finish
        unfinished_parses[:] = [p for p in unfinished_parses if not p.is_completed()]
    
    dep = [parse.dep for parse in partial_parses]
    
    return dep

## 2. Load data

We used English Penn Treebank dataset in CoNLL format.

CoNLL is the conventional name for TSV formats in NLP (TSV - tab-separated values, i.e., CSV with <TAB> as separator).
It originates from a series of shared tasks organized at the Conferences of Natural Language Learning (hence the name)

In CoNLL formats,
- every word (token) is represented in one line
- every sentence is separated from the next by an empty line
- every column represents one annotation

There are many formats, in our case, our conll file has 10 columns, the important columns are:
- 1:  word
- 4:  pos
- 6:  head of the dependency
- 7:  type of dependency

In [4]:
def read_conll(filename):
    
    examples = []
    
    with open(filename) as f:
        i = 0
        word, pos, head, dep = [], [], [], []
        for line in f.readlines():
            i = i+1
            wa = line.strip().split('\t')  #['1', 'In', '_', 'ADP', 'IN', '_', '5', 'case', '_', '_']
            #In <--------  5th guy
            #     case
            
            if len(wa) == 10:  #if all the columns are there
                word.append(wa[1].lower())
                pos.append(wa[4])
                head.append(int(wa[6]))
                dep.append(wa[7])
            
            #the row is not exactly 10, it means new sentence
            elif len(word) > 0:  #if there is somethign inside the word
                examples.append({'word': word, 'pos': pos, 'head': head, 'dep': dep})  #in the sentence level
                word, pos, head, dep = [], [], [], [] #clear word, pos, head, dep
        
        if len(word) > 0:  #if there is somethign inside the word
            examples.append({'word': word, 'pos': pos, 'head': head, 'dep': dep})  #in the sentence level

    return examples                

In [5]:
def load_data():
    print("1. Loading data")
    train_set = read_conll("../data/train.conll")
    dev_set   = read_conll("../data/dev.conll")
    test_set   = read_conll("../data/test.conll")
    
    #make my dataset smaller because my mac cannot handle it
    train_set = train_set[:1000]
    dev_set   = dev_set[:500]
    test_set  = test_set[:500]
    
    return train_set, dev_set, test_set

### Testing the load function

In [6]:
train_set, dev_set, test_set = load_data()

1. Loading data


In [7]:
len(train_set), len(dev_set), len(test_set)

(1000, 500, 500)

To understand, we can draw these in a dependency tree, with the help of spaCy.  **Note** that spaCy do not draw the ROOT for us, but imagine the head of "plays" is ROOT.

In [8]:
#we eventually gonna make the dependency
#so maybe we can cheat a little bit, and see the answer

import spacy
from spacy import displacy #displacy is for visualization

nlp = spacy.load("en_core_web_sm")
doc = nlp("Ms. Haag plays Elianti .")
options = {"collapse_punct": False}

displacy.render(doc, options = options, style="dep", jupyter=True)

## 3. Parser

In [114]:
P_PREFIX = '<p>:' #indicating pos tags
D_PREFIX = '<d>:' #indicating dependency tags
UNK      = '<UNK>'
NULL     = '<NULL>'
ROOT     = '<ROOT>'

class Parser(object):

    def __init__(self, dataset,d_op):
        
        #set the root dep
        self.root_dep = 'root'
        self.d_op = d_op
                
        #get all the dep of the dataset as list, e.g., ['root', 'acl', 'nmod', 'nmod:npmod']
        all_dep = [self.root_dep] + list(set([w for ex in dataset
                                               for w in ex['dep']
                                               if w != self.root_dep]))
        
        #1. put dep into tok2id lookup table, with D_PREFIX so we know it is dependency
        #{'D_PREFIX:root': 0, 'D_PREFIX:acl': 1, 'D_PREFIX:nmod': 2, ..., 'D_PREFIX:<NULL>': 30}
        tok2id = {D_PREFIX + l: i for (i, l) in enumerate(all_dep)}
        tok2id[D_PREFIX + NULL] = self.D_NULL = len(tok2id)
        
        #we are using "unlabeled" where we do not label with the dependency
        #thus the number of dependency relation is 1
        trans = ['L', 'R', 'S']
        self.n_deprel = 1   #because we are not predicting the relations, we are only predicting S, L, R
        
        #create a simple lookup table mapping action and id
        #e.g., tran2id: {'L': 0, 'R': 1, 'S': 2}
        #e.g., id2tran: {0: 'L', 1: 'R', 2: 'S'}
        self.n_trans = len(trans)
        self.tran2id = {t: i for (i, t) in enumerate(trans)}  #use for easy coding
        self.id2tran = {i: t for (i, t) in enumerate(trans)}
        
        #2. put pos tags into tok2id lookup table, with P_PREFIX so we know it is pos
        tok2id.update(build_dict([P_PREFIX + w for ex in dataset for w in ex['pos']],
                                  offset=len(tok2id)))
        tok2id[P_PREFIX + UNK]  = self.P_UNK  = len(tok2id)  #also remember the pos tags of unknown
        tok2id[P_PREFIX + NULL] = self.P_NULL = len(tok2id)
        tok2id[P_PREFIX + ROOT] = self.P_ROOT = len(tok2id)
        
        #now tok2id:  {'P_PREFIX:root': 0, 'P_PREFIX:acl': 1, ..., 'P_PREFIX:JJR': 62, 'P_PREFIX:<UNK>': 63, 'P_PREFIX:<NULL>': 64, 'P_PREFIX:<ROOT>': 65}
        
        #3. put word into tok2id lookup table
        tok2id.update(build_dict([w for ex in dataset for w in ex['word']],
                                  offset=len(tok2id)))
        tok2id[UNK]  = self.UNK = len(tok2id)
        tok2id[NULL] = self.NULL = len(tok2id)
        tok2id[ROOT] = self.ROOT = len(tok2id)
        
        #now tok2id: {'D_PREFIX:root': 0, 'D_PREFIX:acl': 1, 'D_PREFIX:nmod': 2, ..., 'memory': 340, 'mr.': 341, '<UNK>': 342, '<NULL>': 343, '<ROOT>': 344}
        
        #create id2tok
        self.tok2id = tok2id
        self.id2tok = {v: k for (k, v) in tok2id.items()}

        if d_op == 'dep':
            self.n_features = 18 + 18 
        elif d_op == 'pos': 
            self.n_features = 18 + 12
        else:
            self.n_features = 18 + 18 + 12

        self.n_tokens = len(tok2id)
        
    #utility function, in case we want to convert token to id
    #function to turn train set with words to train set with id instead using tok2id
    def numericalize(self, examples):
        numer_examples = []
        for ex in examples:
            word = [self.ROOT] + [self.tok2id[w] if w in self.tok2id
                                  else self.UNK for w in ex['word']]
            pos  = [self.P_ROOT] + [self.tok2id[P_PREFIX + w] if P_PREFIX + w in self.tok2id
                                   else self.P_UNK for w in ex['pos']]
            head = [-1] + ex['head']
            dep  = [-1] + [self.tok2id[D_PREFIX + w] if D_PREFIX + w in self.tok2id
                            else -1 for w in ex['dep']]
            numer_examples.append({'word': word, 'pos': pos,
                                 'head': head, 'dep': dep})
        return numer_examples
    
    #function to extract features to form a feature embedding matrix
    def extract_features(self, stack, buf, arcs, ex):
             
        #ex['word']:  [55, 32, 33, 34, 35, 30], i.e., ['root', 'ms.', 'haag', 'plays', 'elianti', '.']
        #ex['pos']:   [29, 14, 14, 16, 14, 17], i.e., ['NNP', 'NNP', 'VBZ', 'NNP', '.']
        #ex['head']:  [-1, 2, 3, 0, 3, 3]  or ['root', 'compound', 'nsubj', 'root', 'dobj', 'punct']}
        #ex['dep']:   [-1, 1, 2, 0, 6, 12] or ['compound', 'nsubj', 'root', 'dobj', 'punct']

        #stack     :  [0]
        #buffer    :  [1, 2, 3, 4, 5]
        
        if stack[0] == "ROOT":
            stack[0] = 0  #start the stack with [ROOT]
            
        p_features = [] #pos features (2a, 2b, 2c) - 18
        d_features = [] #dep features (3b, 3c) - 12
        
        #last 3 things on the stack as features
        #if the stack is less than 3, then we simply append NULL from the left
        features = [self.NULL] * (3 - len(stack)) + [ex['word'][x] for x in stack[-3:]]
        
        # next 3 things on the buffer as features
        #if the buffer is less than 3, simply append NULL
        #the reason why NULL is appended on end because buffer is read left to right
        features += [ex['word'][x] for x in buf[:3]] + [self.NULL] * (3 - len(buf))
        
        #corresponding pos tags
        p_features = [self.P_NULL] * (3 - len(stack)) + [ex['pos'][x] for x in stack[-3:]]
        p_features += [ex['pos'][x] for x in buf[:3]] + [self.P_NULL] * (3 - len(buf))
        
        #get leftmost children based on the dependency arcs
        def get_lc(k):
            return sorted([arc[1] for arc in arcs if arc[0] == k and arc[1] < k])

        #get right most children based on the dependency arcs
        def get_rc(k):
            return sorted([arc[1] for arc in arcs if arc[0] == k and arc[1] > k],
                          reverse=True)
        
        #get the leftmost and rightmost children of the top two words, thus we loop 2 times
        for i in range(2):
            if i < len(stack):
                k = stack[-i-1] #-1, -2 last two in the stack
                
                #the first and second lefmost/rightmost children of the top two words (i=1, 2) on the stack
                lc = get_lc(k)  
                rc = get_rc(k)
                
                #the leftmost of leftmost/rightmost of rightmost children of the top two words on the stack:
                llc = get_lc(lc[0]) if len(lc) > 0 else []
                rrc = get_rc(rc[0]) if len(rc) > 0 else []

                #(leftmost of first word on stack, rightmost of first word, 
                # leftmost of the second word on stack, rightmost of second, 
                # leftmost of leftmost, rightmost of rightmost
                features.append(ex['word'][lc[0]] if len(lc) > 0 else self.NULL)
                features.append(ex['word'][rc[0]] if len(rc) > 0 else self.NULL)
                features.append(ex['word'][lc[1]] if len(lc) > 1 else self.NULL)
                features.append(ex['word'][rc[1]] if len(rc) > 1 else self.NULL)
                features.append(ex['word'][llc[0]] if len(llc) > 0 else self.NULL)
                features.append(ex['word'][rrc[0]] if len(rrc) > 0 else self.NULL)

                #corresponding pos
                p_features.append(ex['pos'][lc[0]] if len(lc) > 0 else self.P_NULL)
                p_features.append(ex['pos'][rc[0]] if len(rc) > 0 else self.P_NULL)
                p_features.append(ex['pos'][lc[1]] if len(lc) > 1 else self.P_NULL)
                p_features.append(ex['pos'][rc[1]] if len(rc) > 1 else self.P_NULL)
                p_features.append(ex['pos'][llc[0]] if len(llc) > 0 else self.P_NULL)
                p_features.append(ex['pos'][rrc[0]] if len(rrc) > 0 else self.P_NULL)
            
                #corresponding dep
                d_features.append(ex['dep'][lc[0]] if len(lc) > 0 else self.D_NULL)
                d_features.append(ex['dep'][rc[0]] if len(rc) > 0 else self.D_NULL)
                d_features.append(ex['dep'][lc[1]] if len(lc) > 1 else self.D_NULL)
                d_features.append(ex['dep'][rc[1]] if len(rc) > 1 else self.D_NULL)
                d_features.append(ex['dep'][llc[0]] if len(llc) > 0 else self.D_NULL)
                d_features.append(ex['dep'][rrc[0]] if len(rrc) > 0 else self.D_NULL)
                
            else:
                #attach NULL when they don't exist
                features += [self.NULL] * 6
                p_features += [self.P_NULL] * 6
                d_features += [self.D_NULL] * 6
                
        if self.d_op == 'dep':
            features += p_features 
        elif self.d_op == 'pos': 
            features +=  d_features
        else:
            features += p_features + d_features
        assert len(features) == self.n_features  #assert they are 18 + 18 + 12
        
        return features

    #generate training examples
    #from the training sentences and their gold parse trees 
    def create_instances(self, examples):  #examples = word, pos, head, dep
        all_instances = []
        
        for i, ex in enumerate(examples):
            #Ms. Haag plays Elianti .
            #e.g., ex['word]: [344, 163, 99, 164, 165, 68]
            #here 344 stands for ROOT
            #Chaky - I cheated and take a look
            n_words = len(ex['word']) - 1  #excluding the root
            
            #arcs = {(head, tail, dependency label)}
            stack = [0]
            buf = [i + 1 for i in range(n_words)]  #[1, 2, 3, 4, 5]
            arcs = []
            instances = []
            
            #because that's the maximum number of shift, leftarcs, rightarcs you can have
            #this will determine the sample size of each training example
            #if given five words, we will get a sample of (10, 48) where 10 comes from 5 * 2, and 48 is n_features
            #but this for loop can be break if there is nothing left....
            for i in range(n_words * 2):  #maximum times you can do either S, L, R
                
                #get the gold transition based on the parse trees
                #gold_t can be either shift(2), leftarc(0), or rightarc(1)
                gold_t = self.get_oracle(stack, buf, ex)
                
                #if gold_t is None, no need to extract features.....
                if gold_t is None:
                    break
                
                #make sure when the model predicts, we inform the current state of stack and buffer, so
                #the model is not allowed to make any illegal action, e.g., buffer is empty but trying to pop
                legal_labels = self.legal_labels(stack, buf)                
                assert legal_labels[gold_t] == 1
                
                #extract all the 48 features 
                features = self.extract_features(stack, buf, arcs, ex)
                instances.append((features, legal_labels, gold_t))
                
                #shift 
                if gold_t == 2:
                    stack.append(buf[0])
                    buf = buf[1:]
                #left arc 
                elif gold_t == 0:
                    arcs.append((stack[-1], stack[-2], gold_t))
                    stack = stack[:-2] + [stack[-1]]
                #right arc
                else:
                    arcs.append((stack[-2], stack[-1], gold_t - self.n_deprel))
                    stack = stack[:-1]
                    
            else:
                all_instances += instances

        return all_instances
    
    #provide an one hot encoding of the labels
    def legal_labels(self, stack, buf):
        labels =  ([1] if len(stack) > 2  else [0]) * self.n_deprel  #left arc but you cannot do ROOT <--- He
        labels += ([1] if len(stack) >= 2 else [0]) * self.n_deprel  #right arc because ROOT --> He
        labels += [1] if len(buf) > 0 else [0]  #shift
        return labels
    
    #a simple function to check punctuation POS tags
    def punct(self, pos):
        return pos in ["''", ",", ".", ":", "``", "-LRB-", "-RRB-"]
    
    #decide whether to shift, leftarc, or rightarc, based on gold parse trees
    #this is needed to create training examples which contain samples and ground truth
    def get_oracle(self, stack, buf, ex):
        
        #leave if the stack is only 1, thus nothing to predict....
        if len(stack) < 2:
            return self.n_trans - 1
        
        #predict based on the last two words on the stack
        #stack: [ROOT, he, has]
        i0 = stack[-1] #has
        i1 = stack[-2] #he
        
        #get the head and dependency
        h0 = ex['head'][i0]
        h1 = ex['head'][i1]
        d0 = ex['dep'][i0]
        d1 = ex['dep'][i1]
        
        #either shift, left arc or right arc
        #"Shift" = 2; "LA" = 0; "RA" = 1
        #if head of the second last word is the last word, then leftarc
        if (i1 > 0) and (h1 == i0):
            return 0  #action is left arc ---> gold_t
        #if head of the last word is the second last word, then rightarc
        #make sure nothing in the buffer has head with the last word on the stack
        #otherwise, we lose the last word.....
        elif (i1 >= 0) and (h0 == i1) and \
                (not any([x for x in buf if ex['head'][x] == i0])):
            return 1  #right arc
        #otherwise shift, if something is left in buffer, otherwise, do nothing....
        else:
            return None if len(buf) == 0 else 2  #shift
        
    def parse(self, dataset, eval_batch_size=5000):
        sentences = []
        sentence_id_to_idx = {}
        
        for i, example in enumerate(dataset):
            
            #example['word']=[188, 186, 186, ..., 59]
            #n_words=37
            #sentence=[1, 2, 3, 4, 5,.., 37]
            
            n_words = len(example['word']) - 1
            sentence = [j + 1 for j in range(n_words)]            
            sentences.append(sentence)
            
            #mapping the object unique id to the i            
            #The id is the object's memory address
            sentence_id_to_idx[id(sentence)] = i
            
        model = ModelWrapper(self, dataset, sentence_id_to_idx)
        dependencies = minibatch_parse(sentences, model, eval_batch_size)
        
        UAS = all_tokens = 0.0
        with tqdm(total=len(dataset)) as prog:
            for i, ex in enumerate(dataset):
                head = [-1] * len(ex['word'])
                for h, t, in dependencies[i]:
                    head[t] = h
                for pred_h, gold_h, gold_l, pos in \
                        zip(head[1:], ex['head'][1:], ex['dep'][1:], ex['pos'][1:]):
                        assert self.id2tok[pos].startswith(P_PREFIX)
                        pos_str = self.id2tok[pos][len(P_PREFIX):]
                        if (not self.punct(pos_str)):
                            UAS += 1 if pred_h == gold_h else 0
                            all_tokens += 1
                prog.update(i + 1)
        UAS /= all_tokens
        return UAS, dependencies



In [10]:
class ModelWrapper(object):
    def __init__(self, parser, dataset, sentence_id_to_idx):
        self.parser = parser
        self.dataset = dataset
        self.sentence_id_to_idx = sentence_id_to_idx

    def predict(self, partial_parses):
        mb_x = [self.parser.extract_features(p.stack, p.buffer, p.dep,
                                             self.dataset[self.sentence_id_to_idx[id(p.sentence)]])
                for p in partial_parses]
        mb_x = np.array(mb_x).astype('int32')
        mb_x = torch.from_numpy(mb_x).long()
        mb_l = [self.parser.legal_labels(p.stack, p.buffer) for p in partial_parses]

        pred = self.parser.model(mb_x)
        pred = pred.detach().numpy()
        
        #we need to multiply 10000 with legal labels, to force the model not to make any impossible prediction
        #other, when we parse sequentially, sometimes there is nothing in the buffer or stack, thus error....        
        pred = np.argmax(pred + 10000 * np.array(mb_l).astype('float32'), 1)
        pred = ["S" if p == 2 else ("LA" if p == 0 else "RA") for p in pred]
        
        return pred

In [11]:
#a simple function to create ids.....
def build_dict(keys, offset=0):
    #keys = ['P_PREFIX:IN', 'P_PREFIX:DT', 'P_PREFIX:NNP', 'P_PREFIX:CD', so on...]
    #offset is needed because this tok2id has something already inside....
    count = Counter()
    for key in keys:
        count[key] += 1
    
    #most_common = [('P_PREFIX:NN', 70), ('P_PREFIX:IN', 57), ... , ('P_PREFIX:JJR', 1)]
    #we use most_common in case we only want some maximum pos tags....
    mc = count.most_common()
    
    #{'P_PREFIX:NN': 31, 'P_PREFIX:IN': 32, .., 'P_PREFIX:JJR': 62} 
    return {w[0]: index + offset for (index, w) in enumerate(mc)}

In [12]:
def get_minibatches(data, minibatch_size, shuffle=True):
    data_size = len(data[0])
    indices = np.arange(data_size)
    if shuffle:
        np.random.shuffle(indices)
    for minibatch_start in np.arange(0, data_size, minibatch_size):
        minibatch_indices = indices[minibatch_start:minibatch_start + minibatch_size]
        yield [_minibatch(d, minibatch_indices) for d in data]

def _minibatch(data, minibatch_idx):
    return data[minibatch_idx] if type(data) is np.ndarray else [data[i] for i in minibatch_idx]

def minibatches(data, batch_size):
    x = np.array([d[0] for d in data])
    y = np.array([d[2] for d in data])
    one_hot = np.zeros((y.size, 3))
    one_hot[np.arange(y.size), y] = 1
    return get_minibatches([x, one_hot], batch_size)

In [13]:
class ParserModel(nn.Module):

    def __init__(self, embeddings, n_features=48,
                 hidden_size=400, n_classes=3, dropout_prob=0.5):

        super(ParserModel, self).__init__()
        self.n_features   = n_features
        self.n_classes    = n_classes
        self.dropout_prob = dropout_prob
        self.embed_size   = embeddings.shape[1]
        self.hidden_size  = hidden_size
        self.pretrained_embeddings = nn.Embedding(embeddings.shape[0], self.embed_size)
        self.pretrained_embeddings.weight = nn.Parameter(torch.tensor(embeddings))

        self.embed_to_hidden = nn.Linear(n_features * self.embed_size, hidden_size)
        self.dropout = nn.Dropout(p=dropout_prob)
        self.hidden_to_logits = nn.Linear(hidden_size, n_classes)

    def embedding_lookup(self, t):
        #t:  batch_size, n_features
        batch_size = t.size()[0]
                    
        x = self.pretrained_embeddings(t)        
        x = x.reshape(-1, self.n_features * self.embed_size)
        # x = (1024, 48 * 50)

        return x

    def forward(self, t):
        # t: (1024, 48)
        embeddings = self.embedding_lookup(t)  
    
        # embeddings: (1024, 48 * 50)
        hidden = self.embed_to_hidden(embeddings)
    
        # hidden: (1024, 200)
        hidden_activations = F.relu(hidden)
        # hidden_activations: (1024, 200)
        thin_net = self.dropout(hidden_activations)
        # thin_net: (1024, 200)
        logits = self.hidden_to_logits(thin_net)
        # logits: (1024, 3)

        return logits

In [14]:
#just a class to get the average.....
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

In [15]:
def train(parser, train_data, dev_data, output_path, batch_size=1024, n_epochs=10, lr=0.0005):
    
    best_dev_UAS = 0
    
    optimizer = optim.Adam(parser.model.parameters(), lr=0.001)
    loss_func = nn.CrossEntropyLoss()

    for epoch in range(n_epochs):
        print("Epoch {:} out of {:}".format(epoch + 1, n_epochs))
        dev_UAS = train_for_epoch(
            parser, train_data, dev_data, optimizer, loss_func, batch_size)
        if dev_UAS > best_dev_UAS:
            best_dev_UAS = dev_UAS
            print("New best dev UAS! Saving model.")
            torch.save(parser.model.state_dict(), output_path)
        print("")


def train_for_epoch(parser, train_data, dev_data, optimizer, loss_func, batch_size):
    
    parser.model.train()  # Places model in "train" mode, i.e. apply dropout layer
    n_minibatches = math.ceil(len(train_data) / batch_size)
    loss_meter = AverageMeter()

    with tqdm(total=(n_minibatches)) as prog:
        for i, (train_x, train_y) in enumerate(minibatches(train_data, batch_size)):
            
            #train_x:  batch_size, n_features
            #train_y:  batch_size, target(=3)
            
            optimizer.zero_grad() 
            loss = 0.
            train_x = torch.from_numpy(train_x).long()  #long() for int so embedding works....
            train_y = torch.from_numpy(train_y.nonzero()[1]).long()  #get the index with 1 because torch expects label to be single integer

            # Forward pass: compute predicted logits.
            logits = parser.model(train_x)
            # Compute loss
            loss = loss_func(logits, train_y)
            # Compute gradients of the loss w.r.t model parameters.
            loss.backward()
            # Take step with optimizer.
            optimizer.step()

            prog.update(1)
            loss_meter.update(loss.item())

    print("Average Train Loss: {}".format(loss_meter.avg))
    print("Evaluating on dev set",)
    parser.model.eval()  # Places model in "eval" mode, i.e. don't apply dropout layer
        
    dev_UAS, _ = parser.parse(dev_data)
    print("- dev UAS: {:.2f}".format(dev_UAS * 100.0))
    return dev_UAS

In [70]:
def prepare(parser,word_vectors,train_set,dev_set,test_set):

    train_set = parser.numericalize(train_set)
    dev_set   = parser.numericalize(dev_set)
    test_set  = parser.numericalize(test_set)

    

    #we use random.normal instead of zeros, to keep the embedding matrix arbitrary in case word vectors don't exist....
    embeddings_matrix = np.asarray(np.random.normal(0, 0.9, (parser.n_tokens, 50)), dtype='float32')

    for token in parser.tok2id:
            i = parser.tok2id[token]
            if token in word_vectors:
                embeddings_matrix[i] = word_vectors[token]
            elif token.lower() in word_vectors:
                embeddings_matrix[i] = word_vectors[token.lower()]


    

    train_examples = parser.create_instances(train_set)

    return dev_set,test_set,embeddings_matrix,train_examples

In [17]:
parser = Parser(train_set,d_op = '')
parser_cpos = Parser(train_set,d_op = 'pos')
parser_cdep = Parser(train_set,d_op = 'dep')

In [18]:
word_vectors = {}
for line in open("../data/en-cw.txt").readlines():
    we = line.strip().split() #we = word embeddings - first column: word;  the rest is embedding
    word_vectors[we[0]] = [float(x) for x in we[1:]] #{word: [list of 50 numbers], nextword: [another list], so on...}


In [19]:
Parsers =[(parser,"normal",48),(parser_cpos,"cut_pos",30),(parser_cdep,"cut_dep",36)]


for parser_t in Parsers:

    dev_setn,test_setn,embeddings_matrix,train_examples = prepare(parser_t[0],word_vectors,train_set,dev_set,test_set)

    # training
    #create directory if it does not exist for saving the weights...
    output_dir = "output/{:%Y%m%d_%H%M%S}/".format(datetime.now())
    output_path = output_dir + "model.weights"
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
        
    print(80 * "=")
    print(f"TRAINING {parser_t[1]}")
    print(80 * "=")
        
    model = ParserModel(embeddings_matrix,n_features=parser_t[2])
    parser_t[0].model = model

    start = time.time()
    train(parser_t[0], train_examples, dev_setn, output_path,
        batch_size=1024, n_epochs=10, lr=0.0005)


    print(80 * "=")
    print(f"TESTING {parser_t[1]}")
    print(80 * "=")

    print("Restoring the best model weights found on the dev set")
    parser_t[0].model.load_state_dict(torch.load(output_path))
    print("Final evaluation on test set",)
    parser_t[0].model.eval()
    UAS, dependencies = parser_t[0].parse(test_setn)
    print("- test UAS: {:.2f}".format(UAS * 100.0))
    print("Done!")
    
    



TRAINING normal
Epoch 1 out of 10


100%|██████████| 48/48 [00:03<00:00, 13.85it/s]


Average Train Loss: 0.5765360531707605
Evaluating on dev set


125250it [00:00, 11383981.10it/s]      


- dev UAS: 56.83
New best dev UAS! Saving model.

Epoch 2 out of 10


100%|██████████| 48/48 [00:03<00:00, 14.27it/s]


Average Train Loss: 0.2965432734539111
Evaluating on dev set


125250it [00:00, 9632487.00it/s]       


- dev UAS: 67.01
New best dev UAS! Saving model.

Epoch 3 out of 10


100%|██████████| 48/48 [00:03<00:00, 13.07it/s]


Average Train Loss: 0.24040098302066326
Evaluating on dev set


125250it [00:00, 11384721.22it/s]      


- dev UAS: 69.95
New best dev UAS! Saving model.

Epoch 4 out of 10


100%|██████████| 48/48 [00:03<00:00, 14.47it/s]


Average Train Loss: 0.2018256727606058
Evaluating on dev set


125250it [00:00, 11383734.42it/s]      


- dev UAS: 71.89
New best dev UAS! Saving model.

Epoch 5 out of 10


100%|██████████| 48/48 [00:03<00:00, 14.72it/s]


Average Train Loss: 0.1808547287558516
Evaluating on dev set


125250it [00:00, 11384474.50it/s]      


- dev UAS: 72.99
New best dev UAS! Saving model.

Epoch 6 out of 10


100%|██████████| 48/48 [00:03<00:00, 15.13it/s]


Average Train Loss: 0.1593743677561482
Evaluating on dev set


125250it [00:00, 11383981.10it/s]      


- dev UAS: 74.62
New best dev UAS! Saving model.

Epoch 7 out of 10


100%|██████████| 48/48 [00:03<00:00, 13.72it/s]


Average Train Loss: 0.14333658898249269
Evaluating on dev set


125250it [00:00, 10435146.42it/s]      


- dev UAS: 74.94
New best dev UAS! Saving model.

Epoch 8 out of 10


100%|██████████| 48/48 [00:03<00:00, 13.48it/s]


Average Train Loss: 0.13249690442656478
Evaluating on dev set


125250it [00:00, 12522026.46it/s]      


- dev UAS: 75.51
New best dev UAS! Saving model.

Epoch 9 out of 10


100%|██████████| 48/48 [00:03<00:00, 13.42it/s]


Average Train Loss: 0.12130653237303098
Evaluating on dev set


125250it [00:00, 9631957.17it/s]       


- dev UAS: 76.57
New best dev UAS! Saving model.

Epoch 10 out of 10


100%|██████████| 48/48 [00:03<00:00, 13.54it/s]


Average Train Loss: 0.10641626066838701
Evaluating on dev set


125250it [00:00, 11383734.42it/s]      


- dev UAS: 77.10
New best dev UAS! Saving model.

TESTING normal
Restoring the best model weights found on the dev set
Final evaluation on test set


125250it [00:00, 11383981.10it/s]      


- test UAS: 77.27
Done!
TRAINING cut_pos
Epoch 1 out of 10


100%|██████████| 48/48 [00:02<00:00, 20.56it/s]


Average Train Loss: 0.6318283757815758
Evaluating on dev set


125250it [00:00, 12522324.94it/s]      


- dev UAS: 51.08
New best dev UAS! Saving model.

Epoch 2 out of 10


100%|██████████| 48/48 [00:02<00:00, 21.38it/s]


Average Train Loss: 0.35350226797163486
Evaluating on dev set


125250it [00:00, 11383981.10it/s]      


- dev UAS: 55.08
New best dev UAS! Saving model.

Epoch 3 out of 10


100%|██████████| 48/48 [00:02<00:00, 21.73it/s]


Average Train Loss: 0.28512999415397644
Evaluating on dev set


125250it [00:00, 11383487.75it/s]      


- dev UAS: 59.90
New best dev UAS! Saving model.

Epoch 4 out of 10


100%|██████████| 48/48 [00:02<00:00, 21.60it/s]


Average Train Loss: 0.24917958086977401
Evaluating on dev set


125250it [00:00, 11383487.75it/s]      


- dev UAS: 61.65
New best dev UAS! Saving model.

Epoch 5 out of 10


100%|██████████| 48/48 [00:02<00:00, 20.44it/s]


Average Train Loss: 0.2183188833296299
Evaluating on dev set


125250it [00:00, 11383981.10it/s]      


- dev UAS: 60.79

Epoch 6 out of 10


100%|██████████| 48/48 [00:02<00:00, 19.98it/s]


Average Train Loss: 0.1951972267900904
Evaluating on dev set


125250it [00:00, 11383981.10it/s]      


- dev UAS: 62.95
New best dev UAS! Saving model.

Epoch 7 out of 10


100%|██████████| 48/48 [00:02<00:00, 17.61it/s]


Average Train Loss: 0.17611691045264402
Evaluating on dev set


125250it [00:00, 11383734.42it/s]      


- dev UAS: 63.92
New best dev UAS! Saving model.

Epoch 8 out of 10


100%|██████████| 48/48 [00:02<00:00, 21.63it/s]


Average Train Loss: 0.1604892654965321
Evaluating on dev set


125250it [00:00, 10435146.42it/s]      


- dev UAS: 64.88
New best dev UAS! Saving model.

Epoch 9 out of 10


100%|██████████| 48/48 [00:02<00:00, 20.78it/s]


Average Train Loss: 0.14753959933295846
Evaluating on dev set


125250it [00:00, 10435146.42it/s]      


- dev UAS: 66.27
New best dev UAS! Saving model.

Epoch 10 out of 10


100%|██████████| 48/48 [00:02<00:00, 18.41it/s]


Average Train Loss: 0.13502877221132317
Evaluating on dev set


125250it [00:00, 11383487.75it/s]      


- dev UAS: 63.67

TESTING cut_pos
Restoring the best model weights found on the dev set
Final evaluation on test set


125250it [00:00, 10435768.30it/s]      


- test UAS: 68.93
Done!
TRAINING cut_dep
Epoch 1 out of 10


100%|██████████| 48/48 [00:03<00:00, 15.77it/s]


Average Train Loss: 0.5305716153234243
Evaluating on dev set


125250it [00:00, 10435353.70it/s]      


- dev UAS: 58.06
New best dev UAS! Saving model.

Epoch 2 out of 10


100%|██████████| 48/48 [00:02<00:00, 16.66it/s]


Average Train Loss: 0.2823456625143687
Evaluating on dev set


125250it [00:00, 11383981.10it/s]      


- dev UAS: 62.24
New best dev UAS! Saving model.

Epoch 3 out of 10


100%|██████████| 48/48 [00:02<00:00, 16.23it/s]


Average Train Loss: 0.22581337330242
Evaluating on dev set


125250it [00:00, 10435146.42it/s]      


- dev UAS: 66.02
New best dev UAS! Saving model.

Epoch 4 out of 10


100%|██████████| 48/48 [00:02<00:00, 17.13it/s]


Average Train Loss: 0.19108643662184477
Evaluating on dev set


125250it [00:00, 11383734.42it/s]      


- dev UAS: 67.98
New best dev UAS! Saving model.

Epoch 5 out of 10


100%|██████████| 48/48 [00:02<00:00, 16.69it/s]


Average Train Loss: 0.16690226954718432
Evaluating on dev set


125250it [00:00, 10435560.99it/s]      


- dev UAS: 69.50
New best dev UAS! Saving model.

Epoch 6 out of 10


100%|██████████| 48/48 [00:02<00:00, 16.92it/s]


Average Train Loss: 0.1480982556628684
Evaluating on dev set


125250it [00:00, 10435146.42it/s]      


- dev UAS: 68.16

Epoch 7 out of 10


100%|██████████| 48/48 [00:02<00:00, 16.74it/s]


Average Train Loss: 0.13131928397342563
Evaluating on dev set


125250it [00:00, 10434939.14it/s]      


- dev UAS: 71.63
New best dev UAS! Saving model.

Epoch 8 out of 10


100%|██████████| 48/48 [00:02<00:00, 16.25it/s]


Average Train Loss: 0.11866455773512523
Evaluating on dev set


125250it [00:00, 10435768.30it/s]      


- dev UAS: 71.23

Epoch 9 out of 10


100%|██████████| 48/48 [00:02<00:00, 17.39it/s]


Average Train Loss: 0.10632103386645515
Evaluating on dev set


125250it [00:00, 11383734.42it/s]      


- dev UAS: 71.97
New best dev UAS! Saving model.

Epoch 10 out of 10


100%|██████████| 48/48 [00:02<00:00, 16.28it/s]


Average Train Loss: 0.0930035215181609
Evaluating on dev set


125250it [00:00, 11383487.75it/s]      


- dev UAS: 72.75
New best dev UAS! Saving model.

TESTING cut_dep
Restoring the best model weights found on the dev set
Final evaluation on test set


125250it [00:00, 11383981.10it/s]      

- test UAS: 74.52
Done!





## embbed

In [48]:
def get_embed(word,model):
    try:
        index = word2index[word]
    except:
        index = word2index['<UNK>']
    
    word = torch.LongTensor([index])

    center_embed  = model.embedding_center_word(word)
    outside_embed = model.embedding_outside_word(word)
    
    embed = (center_embed + outside_embed) / 2
    
    return  embed.detach().numpy()

In [49]:
def pre_embedmode(parser,train_set,dev_set,test_set,model):

    train_set = parser.numericalize(train_set)
    dev_set   = parser.numericalize(dev_set)
    test_set  = parser.numericalize(test_set)

    try:
        vocabm =  list(model.key_to_index.keys())
        # change chaky code to use embedding
        embeddings_matrix = np.zeros((len(parser.tok2id) + 1, model.vector_size), dtype='float32')
        for token in parser.tok2id:
            i = parser.tok2id[token]
            if token in vocabm:
                embeddings_matrix[i] = model[token]
            elif token.lower() in vocabm:
                embeddings_matrix[i] = model[token.lower()]

    except:
        embeddings_matrix = np.zeros((len(parser.tok2id) + 1, emb_size), dtype='float32')
        for token in parser.tok2id:
            i = parser.tok2id[token]
            if token in vocabs:
                embeddings_matrix[i] = get_embed(token,model)
            elif token.lower() in vocabs:
                embeddings_matrix[i] = get_embed(token.lower(),model)


    

    train_examples = parser.create_instances(train_set)

    return dev_set,test_set,embeddings_matrix,train_examples

### GloVe

In [50]:
from gensim.test.utils import datapath
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec

#you have to put this file in some python/gensim directory; just run it and it will inform where to put....
glove_file = datapath('glove.6B.50d.txt')
Gmodel = KeyedVectors.load_word2vec_format(glove_file, binary=False, no_header=True)

### skipgram (nn.embedding from scratch)

In [51]:
class Skipgram(nn.Module):
    
    def __init__(self, voc_size, emb_size):
        super(Skipgram, self).__init__()
        self.embedding_center_word  = nn.Embedding(voc_size, emb_size)  #is a lookup table mapping all ids in voc_size, into some vector of size emb_size
        self.embedding_outside_word = nn.Embedding(voc_size, emb_size)
    
    def forward(self, center_word, outside_word, all_vocabs):
        #center_word, outside_word: (batch_size, 1)
        #all_vocabs: (batch_size, voc_size)
        
        #convert them into embedding
        center_word_embed  = self.embedding_center_word(center_word)     #(batch_size, 1, emb_size)
        outside_word_embed = self.embedding_outside_word(outside_word)   #(batch_size, 1, emb_size)
        all_vocabs_embed   = self.embedding_outside_word(all_vocabs)     #(batch_size, voc_size, emb_size)
        
        #bmm is basically @ or .dot , but across batches (i.e., ignore the batch dimension)
        top_term = outside_word_embed.bmm(center_word_embed.transpose(1, 2)).squeeze(2)
        #(batch_size, 1, emb_size) @ (batch_size, emb_size, 1) = (batch_size, 1, 1) ===> (batch_size, 1)
        
        top_term_exp = torch.exp(top_term)  #exp(uo vc)
        #(batch_size, 1)
        
        lower_term = all_vocabs_embed.bmm(center_word_embed.transpose(1, 2)).squeeze(2)
         #(batch_size, voc_size, emb_size) @ (batch_size, emb_size, 1) = (batch_size, voc_size, 1) = (batch_size, voc_size)
         
        lower_term_sum = torch.sum(torch.exp(lower_term), 1) #sum exp(uw vc)
        #(batch_size, 1)
        
        loss_fn = -torch.mean(torch.log(top_term_exp / lower_term_sum))
        #(batch_size, 1) / (batch_size, 1) ==mean==> scalar
        
        return loss_fn

In [52]:
def prepare_sequence(seq, word2index):
    #map(function, list of something)
    #map will look at each of element in this list, and apply this function
    idxs = list(map(lambda w: word2index[w] if word2index.get(w) is not None else word2index["<UNK>"], seq))
    return torch.LongTensor(idxs)

In [53]:
def random_batch(batch_size, pair, window_size=2):
                 
    #only get a batch, not the entire list
    random_index = np.random.choice(range(len(pair)), batch_size, replace=False)
             
    #appending some list of inputs and labels
    random_inputs, random_labels = [], []   
    for index in random_index:
        random_inputs.append([pair[index][0]])  #outside words, this will be a shape of (1, ) --> (1, 1) for modeling
        random_labels.append([pair[index][1]])
        
    return np.array(random_inputs), np.array(random_labels)

In [54]:
def pair_data(corpus_tokenized,mode,window_size=2):

    pairs = []

    #for each corpus
    for sent in corpus_tokenized:
        #for each sent ["apple", "banana", "fruit"]
        # # start from window_size end at window_size before last
        for i in range(window_size,len(sent)-window_size):
            center_word = word2index[sent[i]]
            # outside words 
            outside_words = [word2index[sent[j]] for j in range(max(0, i - window_size), min(len(sent), i + window_size + 1)) if j != i]
            for o in outside_words:
                if mode == "skipgram":
                    # append outside word as input center word as output
                    pairs.append([center_word,o])
                elif mode == "cbow":
                    pairs.append([o,center_word])
    
    return pairs

load data for train

In [55]:
def flatten(l):
    return [item for sublist in l for item in sublist]

In [56]:


def numericalize_str(corpus_tokenized):

    #2.1 get all the unique words

    vocabs  = list(set(flatten(corpus_tokenized)))  #vocabs is a term defining all unique words your system know

    #2.2 assign id to all these vocabs
    word2index = {v: idx for idx, v in enumerate(vocabs)}

    #add <UNK>, which is a very normal token exists in the world
    vocabs.append('<UNK>') #chaky, can it be ##UNK, or UNKKKKKK, or anything

    #now we have a way to know what is the id of <UNK>
    word2index['<UNK>'] = len(word2index)  #usually <UNK> is 0

    #create index2word dictionary
    #2 min    
    index2word = {v:k for k, v in word2index.items()}

    return vocabs,word2index,index2word


In [57]:
# add more data 
import pandas as pd

# Read the CSV file which contain spotify song lyric 
df = pd.read_csv("C:\\Users\\ASUS\\My_Journal\\Text\\My-NLP\\spotify_millsongdata.csv")

# Randomly select 10 song
sample = df.sample(30)

In [58]:
import spacy

corpus = sample["text"]
#load 
nlp = spacy.load("en_core_web_sm")
# reduce space , lower all character and use spacy to tokenize
sparcy_tokenized = [nlp((' '.join(lyric.split())).lower()) for lyric in corpus]
# convert scapy token to str
corpus_tokenized = [[str(word) for word in sublist] for sublist in sparcy_tokenized]

In [59]:
vocabs,word2index,index2word = numericalize_str(corpus_tokenized) 

train skip gram

In [60]:
window_size = 5
skipgram_data = pair_data(corpus_tokenized,mode = "skipgram",window_size = window_size)
batch_size = 8 
emb_size   = 50
learning_rate = 0.0001
voc_size   = len(vocabs)
all_vocabs = prepare_sequence(list(vocabs), word2index).expand(batch_size, voc_size)
skip_gram_model = Skipgram(voc_size, emb_size)
Skipgram_optimizer  = optim.Adam(skip_gram_model.parameters(), lr=learning_rate)
num_epochs = 5000

In [61]:
import time


start_time = time.time()
pre_time = start_time

#for epoch
for epoch in range(num_epochs):

    #get random batch
    input_batch, label_batch = random_batch(batch_size, skipgram_data, window_size)
    input_batch = torch.LongTensor(input_batch)
    label_batch = torch.LongTensor(label_batch)
    
    # print(input_batch.shape, label_batch.shape, all_vocabs.shape)
    
    #loss = model
    loss = skip_gram_model(input_batch, label_batch, all_vocabs)

    #backpropagate
    loss.backward()

    #update alpha
    Skipgram_optimizer.step()
    
    

    #print epoch loss
    if (epoch + 1) % 1000 == 0:
        curr_time = time.time()
        print(f"Epoch {epoch+1} | Loss: {loss:.6f} | Time: {curr_time-pre_time:.2f} sec")
        pre_time = curr_time

print(f"total time : {curr_time-start_time:.2f} sec")

Epoch 1000 | Loss: 15.521820 | Time: 9.89 sec
Epoch 2000 | Loss: 15.051564 | Time: 9.58 sec
Epoch 3000 | Loss: 8.072100 | Time: 9.31 sec
Epoch 4000 | Loss: 11.207620 | Time: 9.22 sec
Epoch 5000 | Loss: 6.522009 | Time: 9.10 sec
total time : 47.09 sec


test parsing

In [62]:
parser = Parser(train_set,d_op = '')

In [63]:
embed_model = [(skip_gram_model,"skip gram"),(Gmodel,"GloVe")]

for emodel in embed_model:
    dev_setn,test_setn,embeddings_matrix,train_examples = pre_embedmode(parser,train_set,dev_set,test_set,emodel[0])

    # training
    #create directory if it does not exist for saving the weights...
    output_dir = "output/{:%Y%m%d_%H%M%S}/".format(datetime.now())
    output_path = output_dir + "model.weights"
    if not os.path.exists(output_dir):
        os.makedirs(output_dir)
        
    print(80 * "=")
    print(f"TRAINING {emodel[1]}")
    print(80 * "=")
        
    model = ParserModel(embeddings_matrix,n_features=48)
    parser.model = model

    start = time.time()
    train(parser, train_examples, dev_setn, output_path,
        batch_size=1024, n_epochs=10, lr=0.0005)


    print(80 * "=")
    print(f"TESTING {emodel[1]}")
    print(80 * "=")

    print("Restoring the best model weights found on the dev set")
    parser.model.load_state_dict(torch.load(output_path))
    print("Final evaluation on test set",)
    parser.model.eval()
    UAS, dependencies = parser.parse(test_setn)
    print("- test UAS: {:.2f}".format(UAS * 100.0))
    print("Done!")

    


TRAINING skip gram
Epoch 1 out of 10


100%|██████████| 48/48 [00:03<00:00, 14.17it/s]


Average Train Loss: 0.5999997624506553
Evaluating on dev set


125250it [00:00, 9632487.00it/s]       


- dev UAS: 56.63
New best dev UAS! Saving model.

Epoch 2 out of 10


100%|██████████| 48/48 [00:03<00:00, 14.05it/s]


Average Train Loss: 0.2893669254456957
Evaluating on dev set


125250it [00:00, 10435146.42it/s]      


- dev UAS: 64.74
New best dev UAS! Saving model.

Epoch 3 out of 10


100%|██████████| 48/48 [00:03<00:00, 15.20it/s]


Average Train Loss: 0.22070631633202234
Evaluating on dev set


125250it [00:00, 11383981.10it/s]      


- dev UAS: 68.32
New best dev UAS! Saving model.

Epoch 4 out of 10


100%|██████████| 48/48 [00:02<00:00, 16.09it/s]


Average Train Loss: 0.18032371330385408
Evaluating on dev set


125250it [00:00, 10435146.42it/s]      


- dev UAS: 69.59
New best dev UAS! Saving model.

Epoch 5 out of 10


100%|██████████| 48/48 [00:03<00:00, 15.62it/s]


Average Train Loss: 0.15260283959408602
Evaluating on dev set


125250it [00:00, 10435560.99it/s]      


- dev UAS: 71.54
New best dev UAS! Saving model.

Epoch 6 out of 10


100%|██████████| 48/48 [00:03<00:00, 14.88it/s]


Average Train Loss: 0.1276858615068098
Evaluating on dev set


125250it [00:00, 11383981.10it/s]      


- dev UAS: 71.75
New best dev UAS! Saving model.

Epoch 7 out of 10


100%|██████████| 48/48 [00:03<00:00, 14.82it/s]


Average Train Loss: 0.10646391442666452
Evaluating on dev set


125250it [00:00, 11384474.50it/s]      


- dev UAS: 72.42
New best dev UAS! Saving model.

Epoch 8 out of 10


100%|██████████| 48/48 [00:03<00:00, 14.90it/s]


Average Train Loss: 0.08934885046134393
Evaluating on dev set


125250it [00:00, 11383734.42it/s]      


- dev UAS: 72.09

Epoch 9 out of 10


100%|██████████| 48/48 [00:03<00:00, 14.05it/s]


Average Train Loss: 0.07381294077883165
Evaluating on dev set


125250it [00:00, 8348349.29it/s]       


- dev UAS: 71.90

Epoch 10 out of 10


100%|██████████| 48/48 [00:03<00:00, 13.94it/s]


Average Train Loss: 0.06216947144518296
Evaluating on dev set


125250it [00:00, 9632310.38it/s]       


- dev UAS: 71.23

TESTING skip gram
Restoring the best model weights found on the dev set
Final evaluation on test set


125250it [00:00, 10435146.42it/s]      


- test UAS: 73.70
Done!
TRAINING GloVe
Epoch 1 out of 10


100%|██████████| 48/48 [00:03<00:00, 13.77it/s]


Average Train Loss: 0.609759691481789
Evaluating on dev set


125250it [00:00, 11384967.95it/s]      


- dev UAS: 55.56
New best dev UAS! Saving model.

Epoch 2 out of 10


100%|██████████| 48/48 [00:03<00:00, 13.71it/s]


Average Train Loss: 0.31320780764023465
Evaluating on dev set


125250it [00:00, 8944335.07it/s]       


- dev UAS: 64.51
New best dev UAS! Saving model.

Epoch 3 out of 10


100%|██████████| 48/48 [00:03<00:00, 13.97it/s]


Average Train Loss: 0.23464122352500758
Evaluating on dev set


125250it [00:00, 12522026.46it/s]      


- dev UAS: 68.01
New best dev UAS! Saving model.

Epoch 4 out of 10


100%|██████████| 48/48 [00:03<00:00, 13.92it/s]


Average Train Loss: 0.19608110841363668
Evaluating on dev set


125250it [00:00, 11384967.95it/s]      


- dev UAS: 71.11
New best dev UAS! Saving model.

Epoch 5 out of 10


100%|██████████| 48/48 [00:03<00:00, 14.05it/s]


Average Train Loss: 0.16717577601472536
Evaluating on dev set


125250it [00:00, 11383734.42it/s]      


- dev UAS: 72.49
New best dev UAS! Saving model.

Epoch 6 out of 10


100%|██████████| 48/48 [00:03<00:00, 13.90it/s]


Average Train Loss: 0.1423601646286746
Evaluating on dev set


125250it [00:00, 10435560.99it/s]      


- dev UAS: 73.75
New best dev UAS! Saving model.

Epoch 7 out of 10


100%|██████████| 48/48 [00:03<00:00, 14.20it/s]


Average Train Loss: 0.12098399891207616
Evaluating on dev set


125250it [00:00, 9631957.17it/s]       


- dev UAS: 74.06
New best dev UAS! Saving model.

Epoch 8 out of 10


100%|██████████| 48/48 [00:03<00:00, 14.03it/s]


Average Train Loss: 0.10324204573407769
Evaluating on dev set


125250it [00:00, 10435768.30it/s]      


- dev UAS: 74.88
New best dev UAS! Saving model.

Epoch 9 out of 10


100%|██████████| 48/48 [00:03<00:00, 13.54it/s]


Average Train Loss: 0.08801894262433052
Evaluating on dev set


125250it [00:00, 11385461.43it/s]      


- dev UAS: 75.79
New best dev UAS! Saving model.

Epoch 10 out of 10


100%|██████████| 48/48 [00:03<00:00, 13.83it/s]


Average Train Loss: 0.07635472000886996
Evaluating on dev set


125250it [00:00, 12524713.33it/s]      


- dev UAS: 75.08

TESTING GloVe
Restoring the best model weights found on the dev set
Final evaluation on test set


125250it [00:00, 11383734.42it/s]      

- test UAS: 76.31
Done!





## Spacy

In [68]:
#we eventually gonna make the dependency
#so maybe we can cheat a little bit, and see the answer

import spacy
from spacy import displacy #displacy is for visualization
nlp = spacy.load("en_core_web_sm")
test_sent = ["It's a god awful small affair","I sometimes wish I'd never been born at all","I'm on high ground"]
for sent in test_sent:
    doc = nlp(sent)
    options = {"collapse_punct": False}
    displacy.render(doc, options = options, style="dep", jupyter=True)

In [71]:
word_vectors = {}
for line in open("../data/en-cw.txt").readlines():
    we = line.strip().split() #we = word embeddings - first column: word;  the rest is embedding
    word_vectors[we[0]] = [float(x) for x in we[1:]] #
parser = Parser(train_set,d_op = '')
dev_setn,test_setn,embeddings_matrix,train_examples = prepare(parser,word_vectors,train_set,dev_set,test_set)

In [119]:
# get previous model 
parser = Parser(train_set,d_op = '')
model = ParserModel(embeddings_matrix)
model.eval()

weight_path = 'output\\normal\\model.weights'
model.load_state_dict(torch.load(weight_path))
parser.model = model
parser.model.eval()

ParserModel(
  (pretrained_embeddings): Embedding(5157, 50)
  (embed_to_hidden): Linear(in_features=2400, out_features=400, bias=True)
  (dropout): Dropout(p=0.5, inplace=False)
  (hidden_to_logits): Linear(in_features=400, out_features=3, bias=True)
)

In [88]:
# tokenize sentences
sentences = [nlp(sent) for sent in test_sent]
sentences = [[str(word) for word in sublist] for sublist in sentences]
sentences

[['It', "'s", 'a', 'god', 'awful', 'small', 'affair'],
 ['I', 'sometimes', 'wish', 'I', "'d", 'never', 'been', 'born', 'at', 'all'],
 ['I', "'m", 'on', 'high', 'ground']]

In [125]:
# create dataset i do it like this because i want to do
dataseto = [{'word':sentence,'dep':['det']*len(sentence),'pos':['NN']*len(sentence), 'head':[1]*len(sentence)} for sentence in sentences]

In [127]:
# numerical
sentences_n = parser.numericalize(dataseto)

for sent in sentences_n:
    print(sent['word'])

[5156, 5154, 93, 90, 3770, 5154, 554, 5154]
[5156, 5154, 1282, 5154, 5154, 901, 628, 141, 5154, 109, 170]
[5156, 5154, 740, 105, 380, 1293]


In [130]:
_,dependencies = parser.parse(sentences_n)

6it [00:00, ?it/s]                   


In [139]:
dependencies

[[(5, 4), (5, 6), (3, 5), (7, 3), (2, 7), (1, 2), (0, 1)],
 [(8, 7),
  (8, 6),
  (8, 5),
  (8, 9),
  (4, 8),
  (3, 4),
  (2, 3),
  (2, 10),
  (1, 2),
  (0, 1)],
 [(4, 3), (2, 4), (1, 2), (5, 1), (0, 5)]]

In [136]:
[[(parser.id2tok[x], parser.id2tok[y]) for x, y in inner_list] for inner_list in dependencies]


[[('<d>:amod', '<d>:advcl'),
  ('<d>:amod', '<d>:nsubj'),
  ('<d>:iobj', '<d>:amod'),
  ('<d>:case', '<d>:iobj'),
  ('<d>:nsubjpass', '<d>:case'),
  ('<d>:appos', '<d>:nsubjpass'),
  ('<d>:root', '<d>:appos')],
 [('<d>:nmod:npmod', '<d>:case'),
  ('<d>:nmod:npmod', '<d>:nsubj'),
  ('<d>:nmod:npmod', '<d>:amod'),
  ('<d>:nmod:npmod', '<d>:acl'),
  ('<d>:advcl', '<d>:nmod:npmod'),
  ('<d>:iobj', '<d>:advcl'),
  ('<d>:nsubjpass', '<d>:iobj'),
  ('<d>:nsubjpass', '<d>:det'),
  ('<d>:appos', '<d>:nsubjpass'),
  ('<d>:root', '<d>:appos')],
 [('<d>:advcl', '<d>:iobj'),
  ('<d>:nsubjpass', '<d>:advcl'),
  ('<d>:appos', '<d>:nsubjpass'),
  ('<d>:amod', '<d>:appos'),
  ('<d>:root', '<d>:amod')]]

# Conclusion 

test result

|condition|UAS|
|--------|-----|
|normal| 77.27  |
|normal - pos|  68.93   |
|normal - dep|  74.52   |
|GloVe| 76.31   |
|skip gram|  73.70   |




code so looooooooooooong.   
I'm dizzy.


