# lab 3 - Extracting noun groups using machine learning techniques

## Objectives
The objectives of this assignment are to:

- Write a program to detect partial syntactic structures
- Understand the principles of supervised machine learning techniques applied to language processing
- Use a popular machine learning toolkit: scikit-learn
- Write a short report of 1 to 2 pages on the assignment

## Choosing a training and a test sets

In [94]:
from urllib.request import urlopen

b_train_text = urlopen("http://fileadmin.cs.lth.se/cs/Education/EDAN20/corpus/conll2000/train.txt").read() # Open file and read
train_text = str(b_train_text,'utf-8')
b_test_text = urlopen("http://fileadmin.cs.lth.se/cs/Education/EDAN20/corpus/conll2000/test.txt").read() # Open file and read
test_text = str(b_test_text,'utf-8')

In [95]:
print("---TEXT EXAMPLE TRAIN---\n",train_text[:200], "\n ---TEXT EXAMPLE TEST--- \n",test_text[:200])

---TEXT EXAMPLE TRAIN---
 Confidence NN B-NP
in IN B-PP
the DT B-NP
pound NN I-NP
is VBZ B-VP
widely RB I-VP
expected VBN I-VP
to TO I-VP
take VB I-VP
another DT B-NP
sharp JJ I-NP
dive NN I-NP
if IN B-SBAR
trade NN B-NP
figur 
 ---TEXT EXAMPLE TEST--- 
 Rockwell NNP B-NP
International NNP I-NP
Corp. NNP I-NP
's POS B-NP
Tulsa NNP I-NP
unit NN I-NP
said VBD B-VP
it PRP B-NP
signed VBD B-VP
a DT B-NP
tentative JJ I-NP
agreement NN I-NP
extending VBG B-


In [96]:
import sklearn

## Baseline

Most statistical algorithms for language processing start with a so-called baseline. The baseline figure corresponds to the application of a minimal technique that is used to assess the difficulty of a task and for comparison with further programs.

#### 1. Read the baseline proposed by the organizers of the CoNLL 2000 shared task (In the Results Sect.). What do you think of it?


They get pretty high score overall but no method applying advanced ml methods with deep neural networks. (Which is understandable since the conference was held at year 2000.) 

#### 2. Implement this baseline program. You may either create a completely new program or start from an existing program that you will modify. https://github.com/pnugues/ilppp/tree/master/programs/labs/chunking/chunker_python/


Complete the train function so that it computes the chunk distribution for each part of speech. You will use the train file to derive your distribution and you will store the results in a dictionary. Below, you have an excerpt of the expected results:


In [97]:
column_names = ['form', 'pos', 'chunk']

In [98]:
# train_corpus = conll_reader.read_sentences(train_text)
sentences_train = train_text.split('\n\n') 
sentences_train.remove(sentences_train[-1]) # Last element needs to be removed

In [99]:
# train_corpus = conll_reader.split_rows(train_corpus, column_names)
train_corpus = []
for sentence in sentences_train:
    rows = sentence.split('\n')
    sentence = [dict(zip(column_names, row.split())) for row in rows]
    train_corpus.append(sentence)

In [100]:
# train_corpus = conll_reader.read_sentences(train_text)
sentences_test = test_text.split('\n\n')
sentences_test.remove(sentences_test[-1]) # Last element needs to be removed

In [101]:
# train_corpus = conll_reader.split_rows(train_corpus, column_names)
test_corpus = []
for sentence in sentences_test:
    rows = sentence.split('\n')
    sentence = [dict(zip(column_names, row.split())) for row in rows]
    test_corpus.append(sentence)

In [102]:
train_corpus[1]

[{'form': 'Chancellor', 'pos': 'NNP', 'chunk': 'O'},
 {'form': 'of', 'pos': 'IN', 'chunk': 'B-PP'},
 {'form': 'the', 'pos': 'DT', 'chunk': 'B-NP'},
 {'form': 'Exchequer', 'pos': 'NNP', 'chunk': 'I-NP'},
 {'form': 'Nigel', 'pos': 'NNP', 'chunk': 'B-NP'},
 {'form': 'Lawson', 'pos': 'NNP', 'chunk': 'I-NP'},
 {'form': "'s", 'pos': 'POS', 'chunk': 'B-NP'},
 {'form': 'restated', 'pos': 'VBN', 'chunk': 'I-NP'},
 {'form': 'commitment', 'pos': 'NN', 'chunk': 'I-NP'},
 {'form': 'to', 'pos': 'TO', 'chunk': 'B-PP'},
 {'form': 'a', 'pos': 'DT', 'chunk': 'B-NP'},
 {'form': 'firm', 'pos': 'NN', 'chunk': 'I-NP'},
 {'form': 'monetary', 'pos': 'JJ', 'chunk': 'I-NP'},
 {'form': 'policy', 'pos': 'NN', 'chunk': 'I-NP'},
 {'form': 'has', 'pos': 'VBZ', 'chunk': 'B-VP'},
 {'form': 'helped', 'pos': 'VBN', 'chunk': 'I-VP'},
 {'form': 'to', 'pos': 'TO', 'chunk': 'I-VP'},
 {'form': 'prevent', 'pos': 'VB', 'chunk': 'I-VP'},
 {'form': 'a', 'pos': 'DT', 'chunk': 'B-NP'},
 {'form': 'freefall', 'pos': 'NN', 'chunk': '

In [103]:
def count_pos(corpus):
    """
    Computes the part-of-speech distribution
    in a CoNLL 2000 file
    :param corpus:
    :return:
    """
    pos_cnt = {}
    for sentence in corpus:
        for row in sentence:
            if row['pos'] in pos_cnt:
                pos_cnt[row['pos']] += 1
            else:
                pos_cnt[row['pos']] = 1
    return pos_cnt

In [104]:
def train(corpus):
    """
    Computes the chunk distribution by pos
    The result is stored in a dictionary
    :param corpus:
    :return:
    """
    pos_cnt = count_pos(corpus)

    # We compute the chunk distribution by POS
    """
    Fill in code to compute the chunk distribution for each part of speech
    """
    chunk_dist = {key: {} for key in pos_cnt.keys()}
    for sentence in corpus:
        for row in sentence:
            if row['chunk'] in chunk_dist[row['pos']]:
                chunk_dist[row['pos']][row['chunk']] += 1
            else:
                chunk_dist[row['pos']][row['chunk']] = 1
        
    print("Example of probdist for JJR: ", chunk_dist['JJR'])
    # We determine the best association
    """
    Fill in code so that for each part of speech, you select the most frequent chunk.
    You will build a dictionary with key values:
    pos_chunk[pos] = most frequent chunk for pos
    """
    pos_ret = {key: "" for key in pos_cnt.keys()}
    for pos in chunk_dist:
        max_value = 0
        max_chunk = ""
        for chunk in chunk_dist[pos]:
            if max_value < chunk_dist[pos][chunk]:
                max_value = chunk_dist[pos][chunk]
                max_chunk = chunk
        pos_ret[pos] = max_chunk
    
    return pos_ret

In [105]:
model = train(train_corpus)

print("Example of train model for NN: ",model['NN'])

Example of probdist for JJR:  {'B-NP': 382, 'B-ADJP': 111, 'I-ADJP': 45, 'B-ADVP': 63, 'I-ADVP': 17, 'B-VP': 2, 'I-NP': 204, 'I-VP': 11, 'O': 16, 'B-PP': 2}
Example of train model for NN:  I-NP


In [106]:
def predict(model, corpus):
    """
    Predicts the chunk from the part of speech
    Adds a pchunk column
    :param model:
    :param corpus:
    :return:
    """
    """
    We add a predicted chunk column: pchunk
    """
    for sentence in corpus:
        for row in sentence:
            if 'pos' in row:
                row['pchunk'] = model[row['pos']]
            else:
                continue
            
    return corpus

In [107]:
predicted = predict(model, test_corpus)

print(predicted[50])

[{'form': 'In', 'pos': 'IN', 'chunk': 'B-PP', 'pchunk': 'B-PP'}, {'form': 'the', 'pos': 'DT', 'chunk': 'B-NP', 'pchunk': 'B-NP'}, {'form': 'same', 'pos': 'JJ', 'chunk': 'I-NP', 'pchunk': 'I-NP'}, {'form': 'statement', 'pos': 'NN', 'chunk': 'I-NP', 'pchunk': 'I-NP'}, {'form': ',', 'pos': ',', 'chunk': 'O', 'pchunk': 'O'}, {'form': 'US', 'pos': 'PRP', 'chunk': 'B-NP', 'pchunk': 'B-NP'}, {'form': 'Facilities', 'pos': 'NNPS', 'chunk': 'I-NP', 'pchunk': 'I-NP'}, {'form': 'also', 'pos': 'RB', 'chunk': 'B-ADVP', 'pchunk': 'B-ADVP'}, {'form': 'said', 'pos': 'VBD', 'chunk': 'B-VP', 'pchunk': 'B-VP'}, {'form': 'it', 'pos': 'PRP', 'chunk': 'B-NP', 'pchunk': 'B-NP'}, {'form': 'had', 'pos': 'VBD', 'chunk': 'B-VP', 'pchunk': 'B-VP'}, {'form': 'bought', 'pos': 'VBN', 'chunk': 'I-VP', 'pchunk': 'I-VP'}, {'form': 'back', 'pos': 'RB', 'chunk': 'B-ADVP', 'pchunk': 'B-ADVP'}, {'form': '112,000', 'pos': 'CD', 'chunk': 'B-NP', 'pchunk': 'I-NP'}, {'form': 'of', 'pos': 'IN', 'chunk': 'B-PP', 'pchunk': 'B-PP'}

In [108]:
def eval(predicted):
    """
    Evaluates the predicted chunk accuracy
    :param predicted:
    :return:
    """
    word_cnt = 0
    correct = 0
    for sentence in predicted:
        for row in sentence:
            word_cnt += 1
            if row['chunk'] == row['pchunk']:
                correct += 1
    return correct / word_cnt

In [109]:
accuracy = eval(predicted)
print(accuracy)

0.7729066846782194


In [110]:
f_out = open('out', 'w')
    # We write the word (form), part of speech (pos),
    # gold-standard chunk (chunk), and predicted chunk (pchunk)
for sentence in predicted:
    for row in sentence:
        f_out.write(row['form'] + ' ' + row['pos'] + ' ' + row['chunk'] + ' ' + row['pchunk'] + '\n')
    f_out.write('\n')
f_out.close()

In [111]:
import subprocess

cmd_1 = "perl conlleval.txt <out"
print(cmd_1)

p = subprocess.Popen(cmd_1, stdout=subprocess.PIPE, shell=True)
out_1, err = p.communicate() 
print(str(out_1,'utf-8'))

perl conlleval.txt <out
processed 47377 tokens with 23852 phrases; found: 26992 phrases; correct: 19592.
accuracy:  77.29%; precision:  72.58%; recall:  82.14%; FB1:  77.07
             ADJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             ADVP: precision:  44.33%; recall:  77.71%; FB1:  56.46  1518
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
             INTJ: precision:  50.00%; recall:  50.00%; FB1:  50.00  2
              LST: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  79.87%; recall:  86.80%; FB1:  83.19  13500
               PP: precision:  74.73%; recall:  97.07%; FB1:  84.45  6249
              PRT: precision:  75.00%; recall:   8.49%; FB1:  15.25  12
             SBAR: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               VP: precision:  60.53%; recall:  74.22%; FB1:  66.68  5711



## Using Machine Learning

In this exercise, you will apply and extend the ml_chunker.py program. You will start from the original program you downloaded and modify it so that you understand how to improve the performance of your chunker. You will not add new features to the feature vector.



In [138]:
b_train_text = urlopen("http://fileadmin.cs.lth.se/cs/Education/EDAN20/corpus/conll2000/train.txt").read() # Open file and read
train_text = str(b_train_text,'utf-8')
train_text = train_text.strip()
train_sentences = train_text.split('\n\n')

b_test_text = urlopen("http://fileadmin.cs.lth.se/cs/Education/EDAN20/corpus/conll2000/test.txt").read() # Open file and read
test_text = str(b_train_text,'utf-8').strip()
test_sentences = test_text.split('\n\n')

In [139]:
def extract_features_sent(sentence, w_size, feature_names):
    """
    Extract the features from one sentence
    returns X and y, where X is a list of dictionaries and
    y is a list of symbols
    :param sentence: string containing the CoNLL structure of a sentence
    :param w_size:
    :return:
    """

    # We pad the sentence to extract the context window more easily
    start = "BOS BOS BOS\n"
    end = "\nEOS EOS EOS"
    start *= w_size
    end *= w_size
    sentence = start + sentence
    sentence += end

    # Each sentence is a list of rows
    sentence = sentence.splitlines()
    padded_sentence = list()
    for line in sentence:
        line = line.split()
        padded_sentence.append(line)
    # print(padded_sentence)

    # We extract the features and the classes
    # X contains is a list of features, where each feature vector is a dictionary
    # y is the list of classes
    X = list()
    y = list()
    for i in range(len(padded_sentence) - 2 * w_size):
        # x is a row of X
        x = list()
        # The words in lower case
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j][0].lower())
        # The POS
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j][1])
        # The chunks (Up to the word)
        """
        for j in range(w_size):
            feature_line.append(padded_sentence[i + j][2])
        """
        # We represent the feature vector as a dictionary
        X.append(dict(zip(feature_names, x)))
        # The classes are stored in a list
        y.append(padded_sentence[i + w_size][2])
    return X, y

In [140]:
def extract_features(sentences, w_size, feature_names):
    """
    Builds X matrix and y vector
    X is a list of dictionaries and y is a list
    :param sentences:
    :param w_size:
    :return:
    """
    X_l = []
    y_l = []
    for sentence in sentences:
        X, y = extract_features_sent(sentence, w_size, feature_names)
        X_l.extend(X)
        y_l.extend(y)
    return X_l, y_l

In [141]:
feature_names = ['word_n2', 'word_n1', 'word', 'word_p1', 'word_p2','pos_n2', 'pos_n1', 'pos', 'pos_p1', 'pos_p2']
w_size = 2  # The size of the context window to the left and right of the word
   
print("Extracting the features...")

X_dict, y = extract_features(train_sentences, w_size, feature_names)

Extracting the features...


In [142]:
from sklearn.feature_extraction import DictVectorizer

print("Encoding the features...")
# Vectorize the feature matrix and carry out a one-hot encoding
vec = DictVectorizer(sparse=True)
X = vec.fit_transform(X_dict)
# The statement below will swallow a considerable memory
#X = vec.fit_transform(X_dict).toarray()
#print(vec.get_feature_names())

Encoding the features...


In [143]:
from sklearn import linear_model

print("Training the model...")
classifier = linear_model.LogisticRegression(penalty='l2', dual=True, solver='liblinear')
model = classifier.fit(X, y)
print(model)

Training the model...
LogisticRegression(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)


We apply the model to the test set

Here we carry out a chunk tag prediction and we report the per tag error.

This is done for the whole corpus without regard for the sentence structure.

In [144]:
from sklearn import metrics

print("Predicting the chunks in the test set...")
X_test_dict, y_test = extract_features(test_sentences, w_size, feature_names)

# Vectorize the test set and one-hot encoding
X_test = vec.transform(X_test_dict) # Possible to add: .toarray()
y_test_predicted = classifier.predict(X_test)
print("Classification report for classifier %s:\n%s\n" % (classifier, metrics.classification_report(y_test, y_test_predicted)))

Predicting the chunks in the test set...
Classification report for classifier LogisticRegression(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False):
             precision    recall  f1-score   support

     B-ADJP       0.93      0.83      0.88      2060
     B-ADVP       0.90      0.90      0.90      4227
    B-CONJP       0.93      0.73      0.82        56
     B-INTJ       0.64      0.23      0.33        31
      B-LST       1.00      0.10      0.18        10
       B-NP       0.98      0.98      0.98     55081
       B-PP       0.97      0.99      0.98     21281
      B-PRT       0.91      0.81      0.86       556
     B-SBAR       0.91      0.90      0.91      2207
      B-UCP       0.00      0.00      0.00         2
       B-VP       0.97      0.97      0.97     21467
     I-ADJP       0.

  'precision', 'predicted', average, warn_for)


Here we tag the test set and we save it.

This prediction is redundant with the piece of code above, but we need to predict one sentence at a time to have the same corpus structure.

In [145]:
def predict_ml(test_sentences, feature_names, f_out):
    for test_sentence in test_sentences:
        X_test_dict, y_test = extract_features_sent(test_sentence, w_size, feature_names)
        # Vectorize the test sentence and one hot encoding
        X_test = vec.transform(X_test_dict)
        # Predicts the chunks and returns numbers
        y_test_predicted = classifier.predict(X_test)
        # Appends the predicted chunks as a last column and saves the rows
        rows = test_sentence.splitlines()
        rows = [rows[i] + ' ' + y_test_predicted[i] for i in range(len(rows))]
        for row in rows:
            f_out.write(row + '\n')
        f_out.write('\n')
    f_out.close()

In [146]:
print("Predicting the test set...")
f_out = open('outml', 'w')
predict_ml(test_sentences, feature_names, f_out)

Predicting the test set...


In [147]:
cmd_2 = "perl conlleval.txt <outml"
print(cmd_2)

p = subprocess.Popen(cmd_2, stdout=subprocess.PIPE, shell=True)
out_1, err = p.communicate() 
print(str(out_1,'utf-8'))

perl conlleval.txt <outml
processed 211727 tokens with 106978 phrases; found: 108043 phrases; correct: 102094.
accuracy:  97.14%; precision:  94.49%; recall:  95.43%; FB1:  94.96
             ADJP: precision:  87.39%; recall:  80.78%; FB1:  83.96  1904
             ADVP: precision:  88.35%; recall:  89.33%; FB1:  88.84  4274
            CONJP: precision:  67.86%; recall:  67.86%; FB1:  67.86  56
             INTJ: precision:  63.64%; recall:  22.58%; FB1:  33.33  11
              LST: precision: 100.00%; recall:  10.00%; FB1:  18.18  1
               NP: precision:  94.33%; recall:  95.65%; FB1:  94.98  55852
               PP: precision:  97.10%; recall:  98.56%; FB1:  97.82  21601
              PRT: precision:  91.43%; recall:  80.58%; FB1:  85.66  490
             SBAR: precision:  90.61%; recall:  90.48%; FB1:  90.55  2204
              UCP: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               VP: precision:  94.72%; recall:  95.52%; FB1:  95.12  21650



#### What is the feature vector that corresponds to the ml_chunker.py program? Is it the same Kudoh and Matsumoto used in their experiment?

The feature vector in ml_chunker is similar but not exactly the same. Missing features: The values of the two previous chunk tags in the first part of the window: c i-2 , c i-1

#### What is the performance of the chunker?

See above, Accuracy 94.96%, compared with 77.29% when we used Maximum likelihood method. 

#### Remove the lexical features (the words) from the feature vector and measure the performance. You should observe a decrease.

In [126]:
def extract_features_sent_re(sentence, w_size, feature_names):
    """
    Extract the features from one sentence
    returns X and y, where X is a list of dictionaries and
    y is a list of symbols
    :param sentence: string containing the CoNLL structure of a sentence
    :param w_size:
    :return:
    """

    # We pad the sentence to extract the context window more easily
    start = "BOS BOS BOS\n"
    end = "\nEOS EOS EOS"
    start *= w_size
    end *= w_size
    sentence = start + sentence
    sentence += end

    # Each sentence is a list of rows
    sentence = sentence.splitlines()
    padded_sentence = list()
    for line in sentence:
        line = line.split()
        padded_sentence.append(line)
    # print(padded_sentence)

    # We extract the features and the classes
    # X contains is a list of features, where each feature vector is a dictionary
    # y is the list of classes
    X = list()
    y = list()
    for i in range(len(padded_sentence) - 2 * w_size):
        # x is a row of X
        x = list()
        # The words in lower case
        """
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j][0].lower())
        """
        # The POS
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j][1])
        # The chunks (Up to the word)
        """
        for j in range(w_size):
            X.append(padded_sentence[i + j][2])
        """
        # We represent the feature vector as a dictionary
        X.append(dict(zip(feature_names, x)))
        # The classes are stored in a list
        y.append(padded_sentence[i + w_size][2])
    return X, y

def extract_features_re(sentences, w_size, feature_names):
    """
    Builds X matrix and y vector
    X is a list of dictionaries and y is a list
    :param sentences:
    :param w_size:
    :return:
    """
    X_l = []
    y_l = []
    for sentence in sentences:
        X, y = extract_features_sent_re(sentence, w_size, feature_names)
        X_l.extend(X)
        y_l.extend(y)
    return X_l, y_l

def predict_ml_re(test_sentences, feature_names, f_out):
    for test_sentence in test_sentences:
        X_test_dict, y_test = extract_features_sent_re(test_sentence, w_size, feature_names)
        # Vectorize the test sentence and one hot encoding
        X_test = vec.transform(X_test_dict)
        # Predicts the chunks and returns numbers
        y_test_predicted = classifier.predict(X_test)
        # Appends the predicted chunks as a last column and saves the rows
        rows = test_sentence.splitlines()
        rows = [rows[i] + ' ' + y_test_predicted[i] for i in range(len(rows))]
        for row in rows:
            f_out.write(row + '\n')
        f_out.write('\n')
    f_out.close()

In [127]:
feature_names = ['pos_n2', 'pos_n1', 'pos', 'pos_p1', 'pos_p2']
w_size = 2  # The size of the context window to the left and right of the word
   
print("Extracting the features...")

X_dict, y = extract_features_re(train_sentences, w_size, feature_names)

print("Encoding the features...")
# Vectorize the feature matrix and carry out a one-hot encoding
vec = DictVectorizer(sparse=True)
X = vec.fit_transform(X_dict)
# The statement below will swallow a considerable memory
#X = vec.fit_transform(X_dict).toarray()
#print(vec.get_feature_names())

print("Training the model...")
classifier = linear_model.LogisticRegression(penalty='l2', dual=True, solver='liblinear')
model = classifier.fit(X, y)
print(model)

print("Predicting the test set...")
f_out = open('outml_re', 'w')
predict_ml_re(test_sentences, feature_names, f_out)

cmd_2 = "perl conlleval.txt <outml_re"
print(cmd_2)

p = subprocess.Popen(cmd_2, stdout=subprocess.PIPE, shell=True)
out_1, err = p.communicate() 
print(str(out_1,'utf-8'))

Extracting the features...
Encoding the features...
Training the model...
LogisticRegression(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
Predicting the test set...
perl conlleval.txt <outml_re
processed 211727 tokens with 106978 phrases; found: 109526 phrases; correct: 93969.
accuracy:  92.22%; precision:  85.80%; recall:  87.84%; FB1:  86.81
             ADJP: precision:  63.09%; recall:  51.70%; FB1:  56.83  1688
             ADVP: precision:  69.67%; recall:  72.89%; FB1:  71.25  4422
            CONJP: precision:   0.00%; recall:   0.00%; FB1:   0.00  13
             INTJ: precision:  54.55%; recall:  19.35%; FB1:  28.57  11
              LST: precision:   0.00%; recall:   0.00%; FB1:   0.00  0
               NP: precision:  86.52%; recall:  88.98%; FB1:  87.73  56649
               P

#### What is the classifier used in the program? 

Logistic Regression is used.

#### Decision trees

In [124]:
feature_names = ['word_n2', 'word_n1', 'word', 'word_p1', 'word_p2','pos_n2', 'pos_n1', 'pos', 'pos_p1', 'pos_p2']
w_size = 2  # The size of the context window to the left and right of the word
   
print("Extracting the features...")

X_dict, y = extract_features(train_sentences, w_size, feature_names)

print("Encoding the features...")
# Vectorize the feature matrix and carry out a one-hot encoding
vec = DictVectorizer(sparse=True)
X = vec.fit_transform(X_dict)
# The statement below will swallow a considerable memory
#X = vec.fit_transform(X_dict).toarray()
#print(vec.get_feature_names())

Extracting the features...
Encoding the features...


In [125]:
from sklearn import tree

print("Training the model...")
classifier = tree.DecisionTreeClassifier()
model = classifier.fit(X, y)
print(model)

print("Predicting the test set...")
f_out = open('outml_decision_tree', 'w')
predict_ml(test_sentences, feature_names, f_out)

cmd_2 = "perl conlleval.txt <outml_decision_tree"
print(cmd_2)

p = subprocess.Popen(cmd_2, stdout=subprocess.PIPE, shell=True)
out_1, err = p.communicate() 
print(str(out_1,'utf-8'))

Training the model...


KeyboardInterrupt: 

#### Perceptron

In [None]:
from sklearn.linear_model import Perceptron

print("Training the model...")
classifier = Perceptron()
model = classifier.fit(X, y)
print(model)

print("Predicting the test set...")
f_out = open('outml_Perceptron', 'w')
predict_ml(test_sentences, feature_names, f_out)

cmd_2 = "perl conlleval.txt <outml_Perceptron"
print(cmd_2)

p = subprocess.Popen(cmd_2, stdout=subprocess.PIPE, shell=True)
out_1, err = p.communicate() 
print(str(out_1,'utf-8'))

## Improving the Chunker
#### Implement one of these two options, the first one being easier.

Complement the feature vector used in the previous section with the two dynamic features, c i-2 , c i-1 , and train a new model. You will need to modify the extract_features_sent and predict functions.
In his experiments, your teacher obtained a F1 score of 92.65 with logistic regression and a lbfgs solver and automatic multiclass;

#### You need to reach a global F1 score of 92 to pass this laboratory.

In [None]:
def extract_features_sent_im(sentence, w_size, feature_names):
    """
    Extract the features from one sentence
    returns X and y, where X is a list of dictionaries and
    y is a list of symbols
    :param sentence:
    :param w_size:
    :return:
    """

    # We pad the sentence to extract the context window more easily
    start = "BOS BOS BOS\n"
    end = "\nEOS EOS EOS"
    start *= w_size
    end *= w_size
    sentence = start + sentence
    sentence += end

    # Each sentence is a list of rows
    sentence = sentence.splitlines()
    padded_sentence = list()
    for line in sentence:
        line = line.split()
        padded_sentence.append(line)
    # print(padded_sentence)

    # We extract the features and the classes
    # X contains is a list of features, where each feature vector is a dictionary
    # y is the list of classes
    X = list()
    y = list()
    for i in range(len(padded_sentence) - 2 * w_size):
        # x is a row of X
        x = list()
        # The words in lower case
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j][0].lower())
        # The POS
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j][1])
        # The chunks (Up to the word)
        for j in range(w_size):
            x.append(padded_sentence[i + j][2])
        # We represent the feature vector as a dictionary
        X.append(dict(zip(feature_names, x)))
        # The classes are stored in a list
        y.append(padded_sentence[i + w_size][2])
    return X, y

def extract_features_im(sentences, w_size, feature_names):
    """
    Builds X matrix and y vector
    X is a list of dictionaries and y is a list
    :param sentences:
    :param w_size:
    :return:
    """
    X_l = []
    y_l = []
    for sentence in sentences:
        X, y = extract_features_sent_im(sentence, w_size, feature_names)
        X_l.extend(X)
        y_l.extend(y)
    return X_l, y_l

def predict_ml_im(test_sentences, feature_names, f_out):
    for test_sentence in test_sentences:
        X_test_dict, y_test = extract_features_sent(test_sentence, w_size, feature_names)
        # Vectorize the test sentence and one hot encoding
        X_test = vec.transform(X_test_dict)
        # Predicts the chunks and returns numbers
        y_test_predicted = classifier.predict(X_test)
        # Appends the predicted chunks as a last column and saves the rows
        rows = test_sentence.splitlines()
        rows = [rows[i] + ' ' + y_test_predicted[i] for i in range(len(rows))]
        for row in rows:
            f_out.write(row + '\n')
        f_out.write('\n')
    f_out.close()
    
    

In [None]:

feature_names = ['word_n2', 'word_n1', 'word', 'word_p1', 'word_p2',
                 'pos_n2', 'pos_n1', 'pos', 'pos_p1', 'pos_p2',
                 'c_n2', 'c_n1']

print("Extracting the features...")

X_dict, y = extract_features_im(train_sentences, w_size, feature_names)

print("Encoding the features...")
# Vectorize the feature matrix and carry out a one-hot encoding
vec = DictVectorizer(sparse=True)
X = vec.fit_transform(X_dict)
# The statement below will swallow a considerable memory
#X = vec.fit_transform(X_dict).toarray()
#print(vec.get_feature_names())

print("Training the model...")
classifier = linear_model.LogisticRegression(penalty='l2', dual=True, solver='liblinear')
model = classifier.fit(X, y)
print(model)

print("Predicting the test set...")
f_out = open('outml_improved', 'w')
predict_ml_im(test_sentences, feature_names, f_out)

cmd_2 = "perl conlleval.txt <outml_improved"
print(cmd_2)

p = subprocess.Popen(cmd_2, stdout=subprocess.PIPE, shell=True)
out_1, err = p.communicate() 
print(str(out_1,'utf-8'))

#### If you know what beam search is, apply it using the probability output of logistic regression or the score if you use support vector machines.
With the same classifier and a beam diameter of 5, your teacher obtained 92.87.

## Reading 

#### You will read the article, Contextual String Embeddings for Sequence Labeling by Akbik et al. (2018) and you will outline the main differences between their system and yours. A LSTM is a type of recurrent neural network, while CRF is a sort of beam search. https://www.aclweb.org/anthology/C18-1139

First off they used a recurrent neural network instead of logistic regression. Also they already in abstract they mension that "Our proposed embeddings have the distinct properties that they (a) are trained without any explicit notion of words and thus fundamentally model words as sequences of characters, and (b) are contextualized by their surrounding text, meaning that the same word will have different embeddings depending on its contextual use.". We train our model with explicit notion of words / sequence of words not charaters.  

#### You will tell the performance they reach on the corpus you used in this laboratory.

They scored 96.72±0.05