# Lab 6 - Dependency parsing using machine learning techniques
Laboration 6 in EDAN20 @ LTH - http://cs.lth.se/edan20/coursework/assignment-6/

Author: Jonatan Kronander

## Objectives
The objectives of this assignment are to:
* Extract feature vectors and train a classifier
* Write a statistical dependency parser
* Understand how to design parameter sets
* Write a short report on your results

## Parsing the corpus
Code from lab 5

In [1]:
def extract(stack, queue, graph, feature_names, sentence):
    """
    Extract the features from one sentence returns the features in a dictionary 
    format compatible with scikit-learn. 
    In total, the feature sets will then have six, respectively ten and 14, parameters.
    """
    features = list()
    
    for feature in feature_names:
        # ORDER: stack0_POS, stack1_POS, stack0_word, stack1_word, queue0_POS, queue1_POS, 
        # queue0_word, queue1_word, can-re, can-la
        if feature == "pos_stack_0":
            if stack:
                features.append(stack[0]["postag"])
            else:
                features.append("nil")
        if feature == "pos_stack_1":
            if len(stack)>1:
                features.append(stack[1]["postag"])
            else:
                features.append("nil")
        if feature == "word_stack_0":
            if stack:
                features.append(stack[0]["form"])
            else:
                features.append("nil")
        if feature == "word_stack_1":
            if len(stack)>1:
                features.append(stack[1]["form"])
            else:
                features.append("nil")
        if feature == "pos_queue_0":
            if queue:
                features.append(queue[0]["postag"])
            else:
                features.append("nil")
        if feature == "pos_queue_1":
            if len(queue)>1:
                features.append(queue[1]["postag"])
            else:
                features.append("nil")
        if feature == "word_queue_0":
            if queue:
                features.append(queue[0]["form"])
            else:
                features.append("nil")
        if feature == "word_queue_1":
            if len(queue)>1:
                features.append(queue[1]["form"])
            else:
                features.append("nil")
        # For the third model, you will extract at least two more features, one of them being
        # the part of speech and the word form of the word following the top of the stack in 
        # the sentence order.
        if feature == "word_fl":
            if sentence[int(queue[0]["id"])-1]:
                features.append(sentence[int(queue[0]["id"])-1]["form"])
            else:
                features.append("nil")
        if feature == "pos_fl":
            if sentence[int(queue[0]["id"])-1]:
                features.append(sentence[int(queue[0]["id"])-1]["postag"])
            else:
                features.append("nil")
        if feature == "word_stack_2":
            if len(stack)>2:
                features.append(stack[2]["form"])
            else:
                features.append("nil")
        if feature == "pos_stack_2":
            if len(stack)>2:
                features.append(stack[2]["postag"])
            else:
                features.append("nil")
        if feature == "word_stack_3":
            if len(stack)>3:
                features.append(stack[3]["form"])
            else:
                features.append("nil")
        if feature == "pos_stack_3":
            if len(stack)>3:
                features.append(stack[3]["postag"])
            else:
                features.append("nil")
        if feature == "word_queue_2":
            if len(queue)>2:
                features.append(queue[2]["form"])
            else:
                features.append("nil")
        if feature == "pos_queue_2":
            if len(queue)>2:
                features.append(queue[2]["postag"])
            else:
                features.append("nil")
        if feature == "word_queue_3":
            if len(queue)>3:
                features.append(queue[3]["form"])
            else:
                features.append("nil")
        if feature == "pos_queue_3":
            if len(queue)>3:
                features.append(queue[3]["postag"])
            else:
                features.append("nil")
        if feature == "id":
            if stack:
                features.append(stack[0]["id"])
            else:
                features.append("nil")
        

    
    # These sets will include two additional Boolean parameters, 
    # "can do left arc" and "can do reduce"
    features.append(str(transition.can_reduce(stack, graph)))
    features.append(str(transition.can_leftarc(stack, graph)))
    
    return features

def extract_features(sentences, feature_names):
    """
    BASED ON: https://github.com/pnugues/ilppp/blob/master/programs/labs/chunking/chunker_python/ml_chunker.py
    Builds X matrix and y vector
    X is a list of dictionaries and y is a list
    :param sentences:
    :param w_size:
    :return:
    """
    
    X_l = []
    y_l = []
    
    for i,sentence in enumerate(sentences):
        stack = []
        queue = list(sentence)
        graph = {}
        graph['heads'] = {}
        graph['heads']['0'] = '0'
        graph['deprels'] = {}
        graph['deprels']['0'] = 'ROOT'
        
        while queue:
            X_l.append(extract(stack, queue, graph, feature_names, sentence))
            stack, queue, graph, trans = dparser.reference(stack, queue, graph)
            y_l.append(trans)
        stack, graph = transition.empty_stack(stack, graph)
        
    return X_l, y_l


In [2]:
import transition
import conll
import dparser

column_names_2006 = ['id', 'form', 'lemma', 'cpostag', 'postag', 'feats', 'head', 'deprel', 'phead', 'pdeprel']

sentences = conll.read_sentences("train.conll")
    
formatted_corpus = conll.split_rows(sentences, column_names_2006)

feature_names_1 = ['pos_stack_0', 'word_stack_0',
                    'pos_queue_0','word_queue_0']

feature_names_2 = ['pos_stack_0', 'pos_stack_1', 'word_stack_0', 'word_stack_1',
                    'pos_queue_0', 'pos_queue_1', 'word_queue_0', 'word_queue_1']

feature_names_3 = ['pos_stack_0', 'pos_stack_1', 'word_stack_0', 'word_stack_1',
                    'pos_queue_0', 'pos_queue_1', 'word_queue_0', 'word_queue_1',
                    'word_queue_2', 'pos_queue_2', 'word_queue_3', 'pos_queue_3',
                    'word_fl', 'pos_fl']

X1, y1 = extract_features(formatted_corpus, feature_names_1)
X2, y2 = extract_features(formatted_corpus, feature_names_2)
X3, y3 = extract_features(formatted_corpus, feature_names_3)

#### Training the classifiers

In [3]:
feature_names_1_t = ['pos_stack_0', 'word_stack_0',
                    'pos_queue_0','word_queue_0',
                    'can-re', 'can-la']

feature_names_2_t = ['pos_stack_0', 'pos_stack_1', 'word_stack_0', 'word_stack_1',
                    'pos_queue_0', 'pos_queue_1', 'word_queue_0', 'word_queue_1',
                    'can-re', 'can-la']

feature_names_3_t = ['pos_stack_0', 'pos_stack_1', 'word_stack_0', 'word_stack_1',
                     'pos_queue_0', 'pos_queue_1', 'word_queue_0', 'word_queue_1',
                     'word_queue_2', 'pos_queue_2', 'word_queue_3', 'pos_queue_3',
                     'word_fl', 'pos_fl', 
                     'can-re', 'can-la']

In [4]:
def create_feature_dict(X, feature_names):
    """
    Creates a feature dict from feature names and feature list
    :param: feature list
    :param: feature names
    :return: dict
    """
    X_train = []
    for el in X:
        X_dict = {key: el[i] for i,key in enumerate(feature_names)}
        X_train.append(X_dict)
        
    return X_train

In [5]:
X1_dict = create_feature_dict(X1, feature_names_1_t)
X2_dict = create_feature_dict(X2, feature_names_2_t)
X3_dict = create_feature_dict(X3, feature_names_3_t)

Training with features 1:

In [6]:
from sklearn import linear_model
from sklearn.feature_extraction import DictVectorizer

In [7]:
print("Encoding the features...")
# Vectorize the feature matrix and carry out a one-hot encoding
vec = DictVectorizer(sparse=True)
X = vec.fit_transform(X1_dict)

print("Training the model...")
classifier = linear_model.LogisticRegression(penalty='l2', dual=True, solver='liblinear')
model1 = classifier.fit(X, y1)
print(model1)

Encoding the features...
Training the model...
LogisticRegression(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)


Training with features 2:

In [8]:
print("Encoding the features...")
# Vectorize the feature matrix and carry out a one-hot encoding
vec = DictVectorizer(sparse=True)
X = vec.fit_transform(X2_dict)

print("Training the model...")
classifier = linear_model.LogisticRegression(penalty='l2', dual=True, solver='liblinear')
model2 = classifier.fit(X, y2)
print(model2)

Encoding the features...
Training the model...
LogisticRegression(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)


Training with features 3:

In [9]:
print("Encoding the features...")
# Vectorize the feature matrix and carry out a one-hot encoding
vec = DictVectorizer(sparse=True)
X = vec.fit_transform(X3_dict)

print("Training the model...")
classifier = linear_model.LogisticRegression(penalty='l2', dual=True, solver='liblinear')
model3 = classifier.fit(X, y3)
print(model3)

Encoding the features...
Training the model...
LogisticRegression(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)


In [10]:
import pickle

In [24]:
with open('model1.pickle', 'wb') as handle:
    pickle.dump(model1, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('model2.pickle', 'wb') as handle:
    pickle.dump(model2, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
with open('model3.pickle', 'wb') as handle:
    pickle.dump(model3, handle, protocol=pickle.HIGHEST_PROTOCOL)


In [12]:
#with open('model1.pickle', 'rb') as handle:
#    model1 = pickle.load(handle)

#with open('model2.pickle', 'rb') as handle:
#    model2 = pickle.load(handle)

#with open('model3.pickle', 'rb') as handle:
#    model3 = pickle.load(handle)

#### Then we want to embed the data in Nivre's parser to compute their respective efficiencies. We proceed sentence by sentence, and word by word. For a certain state, it will predict the next action using our classifier. We will then execute the corresponding action: la, ra, re, or sh. If an action is not possible, we will carry out a shift.

In [13]:
def parse_ml(stack, queue, graph, trans):
    """
    The parsing function, parse_ml(), takes the the stack, queue, graph, 
    and the transition predicted by the classifier, and carries out the 
    transition. You can use this model and complete it:
    """
    if stack and trans[:2] == 'ra':
        stack, queue, graph = dparser.transition.right_arc(stack, queue, graph, trans[3:])
        return stack, queue, graph, 'ra'
    if stack and trans[:2] == 'la':
        stack, queue, graph = dparser.transition.left_arc(stack, queue, graph, trans[3:])
        return stack, queue, graph, 'la'
    if stack and trans[:2] == 're':
        stack, queue, graph = dparser.transition.reduce(stack, queue, graph)
        return stack, queue, graph, 're'
    else:
        stack, queue, graph = transition.shift(stack, queue, graph)
        return stack, queue, graph, 'sh'
           

In [14]:
def extract_features_test(sentences, feature_names, model):
    """
    :param sentences:
    :param w_size:
    :return:
    """

    for sentence in sentences:
        stack = []
        queue = list(sentence)
        graph = {}
        graph['heads'] = {}
        graph['heads']['0'] = '0'
        graph['deprels'] = {}
        graph['deprels']['0'] = 'ROOT'
        while queue:
            features = extract(stack, queue, graph, feature_names, sentence)
            features_dict = create_feature_dict([features], feature_names)
            
            # Vectorize the test sentence and one hot encoding
            X_test = vec.transform(features_dict)
            # Predicts the chunks and returns numbers
            trans_pred = classifier.predict(X_test)

            #trans_nr = classifier.predict()
            
            stack, queue, graph, trans = parse_ml(stack, queue, graph, trans_pred[0])
        
        stack, graph = dparser.transition.empty_stack(stack, graph)
        
        for i,word in enumerate(sentence):
            word['head'] = graph['heads'].get(str(i), str(0))
            word['deprel'] = graph['deprels'].get(str(i), str(0))
        
    return sentences 
    

In [15]:
# ORDER: stack0_POS, stack1_POS, stack0_word, stack1_word, queue0_POS, queue1_POS, 
# queue0_word, queue1_word, can-re, can-la
sentences_test = conll.read_sentences("test_blind.conll")
    
formatted_corpus_test = conll.split_rows(sentences_test, column_names_2006)

In [16]:
sentences_test = extract_features_test(formatted_corpus_test, feature_names_1_t, model1)

In [17]:
f_out = open('out1.conll', 'w')
for sentence in sentences_test:
    for word in sentence[1:]:
        for col in column_names_2006[:-1]:
            if col in word:
                    f_out.write(word[col] + '\t')
            else:
                f_out.write('_\t')
        col = column_names_2006[-1]
        if col in word:
            f_out.write(word[col] + '\n')
        else:
            f_out.write('_\n')
    f_out.write('\n')
f_out.close()

And model 1 and 2

In [18]:
sentences_test2 = extract_features_test(formatted_corpus_test, feature_names_2_t, model2)

f_out = open('out2.conll', 'w')
for sentence in sentences_test2:
    for word in sentence[1:]:
        for col in column_names_2006[:-1]:
            if col in word:
                    f_out.write(word[col] + '\t')
            else:
                f_out.write('_\t')
        col = column_names_2006[-1]
        if col in word:
            f_out.write(word[col] + '\n')
        else:
            f_out.write('_\n')
    f_out.write('\n')
f_out.close()

In [19]:
sentences_test3 = extract_features_test(formatted_corpus_test, feature_names_3_t, model3)

f_out = open('out3.conll', 'w')
for sentence in sentences_test3:
    for word in sentence[1:]:
        for col in column_names_2006[:-1]:
            if col in word:
                    f_out.write(word[col] + '\t')
            else:
                f_out.write('_\t')
        col = column_names_2006[-1]
        if col in word:
            f_out.write(word[col] + '\n')
        else:
            f_out.write('_\n')
    f_out.write('\n')
f_out.close()


## EVALUATION
You need to reach a labelled attachment score of 75 to pass this lab.

In [20]:
import subprocess

In [21]:
cmd_1 = "perl eval.py -g test.conll -s out1.conll"
print(cmd_1)

p = subprocess.Popen(cmd_1, stdout=subprocess.PIPE, shell=True)
out_1, err = p.communicate() 
print(str(out_1,'utf-8'))

perl eval.py -g test.conll -s out1.conll
  Labeled   attachment score: 3289 / 5021 * 100 = 65.50 %
  Unlabeled attachment score: 3627 / 5021 * 100 = 72.24 %
  Label accuracy score:       3535 / 5021 * 100 = 70.40 %


  Evaluation of the results in out1.conll
  vs. gold standard test.conll:

  Legend: '.S' - the beginning of a sentence, '.E' - the end of a sentence

  Number of non-scoring tokens: 635

  The overall accuracy and its distribution over CPOSTAGs

  -----------+-------+-------+------+-------+------+-------+-------
  Accuracy   | words | right |   %  | right |   %  | both  |   %
             |       | head  |      |  dep  |      | right |
  -----------+-------+-------+------+-------+------+-------+-------
  total      |  5021 |  3627 |  72% |  3535 |  70% |  3289 |  66%
  -----------+-------+-------+------+-------+------+-------+-------
  NN         |  1109 |   835 |  75% |   789 |  71% |   767 |  69%
  PR         |   689 |   482 |  70% |   382 |  55% |   347 |  50%
  PO    

In [22]:
cmd_2 = "perl eval.py -g test.conll -s out2.conll"
print(cmd_2)

p = subprocess.Popen(cmd_2, stdout=subprocess.PIPE, shell=True)
out_2, err = p.communicate() 
print(str(out_2,'utf-8'))

perl eval.py -g test.conll -s out2.conll
  Labeled   attachment score: 3549 / 5021 * 100 = 70.68 %
  Unlabeled attachment score: 3927 / 5021 * 100 = 78.21 %
  Label accuracy score:       3730 / 5021 * 100 = 74.29 %


  Evaluation of the results in out2.conll
  vs. gold standard test.conll:

  Legend: '.S' - the beginning of a sentence, '.E' - the end of a sentence

  Number of non-scoring tokens: 635

  The overall accuracy and its distribution over CPOSTAGs

  -----------+-------+-------+------+-------+------+-------+-------
  Accuracy   | words | right |   %  | right |   %  | both  |   %
             |       | head  |      |  dep  |      | right |
  -----------+-------+-------+------+-------+------+-------+-------
  total      |  5021 |  3927 |  78% |  3730 |  74% |  3549 |  71%
  -----------+-------+-------+------+-------+------+-------+-------
  NN         |  1109 |   872 |  79% |   824 |  74% |   804 |  72%
  PR         |   689 |   509 |  74% |   411 |  60% |   372 |  54%
  PO    

In [23]:
cmd_3 = "perl eval.py -g test.conll -s out3.conll"
print(cmd_3)

p = subprocess.Popen(cmd_3, stdout=subprocess.PIPE, shell=True)
out_3, err = p.communicate() 
print(str(out_3,'utf-8'))

perl eval.py -g test.conll -s out3.conll
  Labeled   attachment score: 3915 / 5021 * 100 = 77.97 %
  Unlabeled attachment score: 4239 / 5021 * 100 = 84.43 %
  Label accuracy score:       4100 / 5021 * 100 = 81.66 %


  Evaluation of the results in out3.conll
  vs. gold standard test.conll:

  Legend: '.S' - the beginning of a sentence, '.E' - the end of a sentence

  Number of non-scoring tokens: 635

  The overall accuracy and its distribution over CPOSTAGs

  -----------+-------+-------+------+-------+------+-------+-------
  Accuracy   | words | right |   %  | right |   %  | both  |   %
             |       | head  |      |  dep  |      | right |
  -----------+-------+-------+------+-------+------+-------+-------
  total      |  5021 |  4239 |  84% |  4100 |  82% |  3915 |  78%
  -----------+-------+-------+------+-------+------+-------+-------
  NN         |  1109 |   958 |  86% |   938 |  85% |   909 |  82%
  PR         |   689 |   519 |  75% |   436 |  63% |   392 |  57%
  PO    

## Reading
Read the article: Globally Normalized Transition-Based Neural Networks by Andor and al. (2016) [pdf] and write in a few sentences how it relates to your work in this assignment. https://www.aclweb.org/anthology/P16-1231/

We have done a small piece of what the paper have done and with other ml methods. The authors did part-of speech tagging, dependency parsing and sentence compression results, whilest we only did dependency parsing. See section 4.2 in paper. Implementing a neural network the authors got a significant better result of above 90% (though on anohter dataset).