# Assignment #4: Extracting syntactic groups using machine-learning techniques
Author: Pierre Nugues

In this assignment, you will create a system to extract syntactic groups from a text. You will apply it to the CoNLL 2000 dataset. In addition, you will try to link a few extracted named entities to real things using wikipedia.

## Objectives

The objectives of this assignment are to:
* Write a program to detect partial syntactic structures
* Extract named entities and link them to real things using Wikipedia
* Understand the principles of supervised machine learning techniques applied to language processing
* Use a popular machine learning toolkit: scikit-learn
* Write a short report of 2 to 3 pages on the assignment

## Choosing a training and a test sets

As annotated data and annotation scheme, you will use the data available from [CoNLL 2000](https://www.clips.uantwerpen.be/conll2000/chunking/).
1. Download both the training and test sets and decompress them.
2. Local copies are also available here: [train.txt](https://fileadmin.cs.lth.se/cs/Education/EDAN20/corpus/conll2000/train.txt) and [test.txt](https://fileadmin.cs.lth.se/cs/Education/EDAN20/corpus/conll2000/test.txt)
3. Read the description of the CoNLL 2000 task

CoNLL 2000 is an early dataset and contrary to many current ones, it has no development set.

## Loading the corpus

### The datasets

You may need to adjust the paths to load the datasets from your machine.

In [16]:
train_file = 'C:/Users/fanny/Desktop/Språkteknologi/train.txt'
test_file = 'C:/Users/fanny/Desktop/Språkteknologi/test.txt'

#### Reading the files

Read the functions below to load the datasets. They store the corpus in a list of sentences. Each sentence is a list of rows, where each row is a dictionary.

In [17]:
def read_sentences(file):
    """
    Creates a list of sentences from the corpus
    Each sentence is a string
    :param file:
    :return:
    """
    f = open(file).read().strip()
    sentences = f.split('\n\n')
    return sentences

In [18]:
def split_rows(sentences, column_names):
    """
    Creates a list of sentence where each sentence is a list of lines
    Each line is a dictionary of columns
    :param sentences:
    :param column_names:
    :return:
    """
    new_sentences = []
    for sentence in sentences:
        rows = sentence.split('\n')
        sentence = [dict(zip(column_names, row.split())) for row in rows]
        new_sentences.append(sentence)
    return new_sentences

### Loading dictionaries

The CoNLL 2000 files have three columns: The wordform, `form`, its part of speech, `pos`, and the tag denoting the syntactic group also called the chunk tag, `chunk`.

In [19]:
column_names = ['form', 'pos', 'chunk']

We load the corpus in a list of dictionaries

In [20]:
train_sentences = read_sentences(train_file)
train_corpus = split_rows(train_sentences, column_names)
train_corpus[:2]

[[{'form': 'Confidence', 'pos': 'NN', 'chunk': 'B-NP'},
  {'form': 'in', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'the', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'pound', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'is', 'pos': 'VBZ', 'chunk': 'B-VP'},
  {'form': 'widely', 'pos': 'RB', 'chunk': 'I-VP'},
  {'form': 'expected', 'pos': 'VBN', 'chunk': 'I-VP'},
  {'form': 'to', 'pos': 'TO', 'chunk': 'I-VP'},
  {'form': 'take', 'pos': 'VB', 'chunk': 'I-VP'},
  {'form': 'another', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'sharp', 'pos': 'JJ', 'chunk': 'I-NP'},
  {'form': 'dive', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'if', 'pos': 'IN', 'chunk': 'B-SBAR'},
  {'form': 'trade', 'pos': 'NN', 'chunk': 'B-NP'},
  {'form': 'figures', 'pos': 'NNS', 'chunk': 'I-NP'},
  {'form': 'for', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'September', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': ',', 'pos': ',', 'chunk': 'O'},
  {'form': 'due', 'pos': 'JJ', 'chunk': 'B-ADJP'},
  {'form': 'for', 'pos': 'IN', 'ch

## Baseline chunker

Most statistical algorithms for language processing start with a so-called baseline. The baseline performance corresponds to the application of a minimal technique that is used to assess the difficulty of a task and for comparison with further programs.

You will implement the baseline proposed by the organizers of the
        <a href="https://www.clips.uantwerpen.be/conll2000/chunking/">CoNLL 2000 shared task</a>, Sect. <i>Results</i>.
1. Read it;
2. In the report, you will tell what do you think of it.

### Auxiliary functions

A function to count the parts of speech

In [21]:
def count_pos(corpus):
    """
    Computes the part-of-speech distribution
    in a CoNLL 2000 file
    :param corpus:
    :return:
    """
    pos_cnt = {}
    for sentence in corpus:
        for row in sentence:
            if row['pos'] in pos_cnt:
                pos_cnt[row['pos']] += 1
            else:
                pos_cnt[row['pos']] = 1
    return pos_cnt

We first collect all the parts of speech and we count them. CoNLL uses the Penn Treebank tagset seen during the course, where `NN` means common noun, singular, `ÌN` means preposition, `DT` means determiner, etc.

In [22]:
pos_cnt = count_pos(train_corpus)
pos_cnt

{'NN': 30147,
 'IN': 22764,
 'DT': 18335,
 'VBZ': 4648,
 'RB': 6607,
 'VBN': 4763,
 'TO': 5081,
 'VB': 6017,
 'JJ': 13085,
 'NNS': 13619,
 'NNP': 19884,
 ',': 10770,
 'CC': 5372,
 'POS': 1769,
 '.': 8827,
 'VBP': 2868,
 'VBG': 3272,
 'PRP$': 1881,
 'CD': 8315,
 '``': 1531,
 "''": 1493,
 'VBD': 6745,
 'EX': 206,
 'MD': 2167,
 '#': 36,
 '(': 274,
 '$': 1750,
 ')': 281,
 'NNPS': 420,
 'PRP': 3820,
 'JJS': 374,
 'WP': 529,
 'RBR': 321,
 'JJR': 853,
 'WDT': 955,
 'WRB': 478,
 'RBS': 191,
 'PDT': 55,
 'RP': 83,
 ':': 1047,
 'FW': 38,
 'WP$': 35,
 'SYM': 6,
 'UH': 15}

### Chunk distribution

You will compute the chunk distribution for each part of speech. You will use the training file to derive the distribution and you will store the results in a dictionary that you will call `chunk_dist`. Below, you have an excerpt of the expected results:
```
{'JJR':
{'I-ADVP': 17, 'I-ADJP': 45, 'I-NP': 204, 'B-ADVP': 63,
'B-PP': 2, 'B-ADJP': 111, 'B-NP': 382, 'B-VP': 2,
'I-VP': 11, 'O': 16},
'CC':
{'B-ADVP': 3, 'O': 3676, 'I-VP': 104, 'B-CONJP': 6,
'I-ADVP': 30, 'I-UCP': 2, 'I-PP': 24, 'I-ADJP': 26,
'I-NP': 1409, 'B-ADJP': 2, 'B-NP': 18, 'B-PP': 70,
'I-PRT': 1, 'B-VP': 1},
'NN':
{'B-LST': 2, 'I-INTJ': 2, 'B-ADVP': 38, 'O': 37,
'I-ADVP': 11, 'B-INTJ': 1, 'I-UCP': 2, 'B-UCP': 2,
'I-VP': 77, 'B-PRT': 2, 'I-ADJP': 41, 'I-NP': 24456,
'B-ADJP': 44, 'B-NP': 5160, 'B-PP': 15, 'B-VP': 257},
...
```

In [23]:
chunk_dist={}

for sen in train_corpus:
    for rad in sen:
        if rad['pos'] in chunk_dist.keys():
            if rad['chunk'] in chunk_dist[rad['pos']].keys():
                chunk_dist[rad['pos']][rad['chunk']] +=1
            else:
                 chunk_dist[rad['pos']][rad['chunk']] =1 
        else:
            chunk_dist[rad['pos']]={rad['chunk']:1}

In [24]:
chunk_dist['NN']

{'B-NP': 5160,
 'I-NP': 24456,
 'B-VP': 257,
 'B-ADJP': 44,
 'B-ADVP': 38,
 'O': 37,
 'B-PP': 15,
 'I-ADVP': 11,
 'I-ADJP': 41,
 'I-VP': 77,
 'B-INTJ': 1,
 'B-LST': 2,
 'B-UCP': 2,
 'I-UCP': 2,
 'B-PRT': 2,
 'I-INTJ': 2}

### Selecting the POS-chunk associations

For each part of speech, select the best association. In the example above, you will have (NN, I-NP) as it is the most frequent association. You will store the results in a dictionary that you will call `pos_chunk`

In [25]:
pos_chunk={}
for poss in chunk_dist:
    max_chunk=max(chunk_dist[poss],key=chunk_dist[poss].get)
    pos_chunk[poss]=max_chunk

In [26]:
pos_chunk['NN']

'I-NP'

In [27]:
pos_chunk['JJR']

'B-NP'

### Prediction

Using the resulting associations, apply your chunker to the test file. You will write a `predict(model, corpus)` function, where `model` will be your associations and `corpus`, the test corpus. You will format the test corpus as a dictionary, where you will add a `pchunk` key for each row with a value that will correspond to the predicted chunk (`p` is for _predicted_).

In [28]:
def predict(model,cor):
    for sen in cor:
        for rad in sen:
            typnn=rad['pos']
            enannan=model[typnn]
            rad['pchunk']=enannan
    return cor

We load the test corpus

In [29]:
test_sentences = read_sentences(test_file)
test_corpus = split_rows(test_sentences, column_names)
test_corpus[:1]

[[{'form': 'Rockwell', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': 'International', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'Corp.', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': "'s", 'pos': 'POS', 'chunk': 'B-NP'},
  {'form': 'Tulsa', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'unit', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'said', 'pos': 'VBD', 'chunk': 'B-VP'},
  {'form': 'it', 'pos': 'PRP', 'chunk': 'B-NP'},
  {'form': 'signed', 'pos': 'VBD', 'chunk': 'B-VP'},
  {'form': 'a', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'tentative', 'pos': 'JJ', 'chunk': 'I-NP'},
  {'form': 'agreement', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'extending', 'pos': 'VBG', 'chunk': 'B-VP'},
  {'form': 'its', 'pos': 'PRP$', 'chunk': 'B-NP'},
  {'form': 'contract', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'with', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'Boeing', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': 'Co.', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'to', 'pos': 'TO', 'chunk': 'B-VP'},
  {'form':

We predict the groups. You should have added a `pchunk` key

In [30]:
predicted_test_corpus = predict(pos_chunk, test_corpus)
predicted_test_corpus[:1]

[[{'form': 'Rockwell', 'pos': 'NNP', 'chunk': 'B-NP', 'pchunk': 'I-NP'},
  {'form': 'International', 'pos': 'NNP', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'Corp.', 'pos': 'NNP', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': "'s", 'pos': 'POS', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'Tulsa', 'pos': 'NNP', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'unit', 'pos': 'NN', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'said', 'pos': 'VBD', 'chunk': 'B-VP', 'pchunk': 'B-VP'},
  {'form': 'it', 'pos': 'PRP', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'signed', 'pos': 'VBD', 'chunk': 'B-VP', 'pchunk': 'B-VP'},
  {'form': 'a', 'pos': 'DT', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'tentative', 'pos': 'JJ', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'agreement', 'pos': 'NN', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'extending', 'pos': 'VBG', 'chunk': 'B-VP', 'pchunk': 'B-VP'},
  {'form': 'its', 'pos': 'PRP$', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'c

### Accuracy

We can evaluate the performance of the baseline with the tag accuracy: the percentage of words that receive the correct tag.

In [31]:
def eval(predicted):
    """
    Evaluates the predicted chunk accuracy
    :param predicted:
    :return:
    """
    word_cnt = 0
    correct = 0
    for sentence in predicted:
        for row in sentence:
            word_cnt += 1
            if row['chunk'] == row['pchunk']:
                correct += 1
    return correct / word_cnt

In [32]:
accuracy = eval(predicted_test_corpus)
accuracy

0.7729066846782194

### The CoNLL evaluation

The accuracy is very misleading as it is biased by the most frequent tags. It is not a good way to evaluate chunking. Instead, CoNLL computes the F1 score of all the chunks with a specific evaluation script. This F1 score is the harmonic mean of precision and recall.

#### Saving the corpus

To use the CoNLL evaluation script, you will store your results in an output file that has four columns. The three first columns will be the input columns from the test file: 
* word, 
* part of speech, and 
* gold-standard chunk. 

You will append the predicted chunk as the 4th column. Your output file should look like the excerpt below:
```
Rockwell NNP B-NP I-NP
International NNP I-NP I-NP
Corp. NNP I-NP I-NP
's POS B-NP B-NP
Tulsa NNP I-NP I-NP
unit NN I-NP I-NP
said VBD B-VP B-VP
it PRP B-NP B-NP
```
In this corpus, the column separator is the space. In most recent corpora, the separator would be a tabulation.

You will use a `save_results(output_dict, keys, output_file)` function, where the keys will be `['form', 'pos', 'chunk', 'pchunk']`

In [33]:
keys = ['form', 'pos', 'chunk', 'pchunk']

In [34]:
def save_results(output_dict, keys, output_file):
    f_out = open(output_file, 'w')
    # We write the word (form), part of speech (pos),
    # gold-standard chunk (chunk), and predicted chunk (pchunk)
    for sentence in output_dict:
        for row in sentence:
            for key in keys:
                f_out.write(row[key] + ' ')
            f_out.write('\n')
        f_out.write('\n')
    f_out.close()
    return

In [35]:
save_results(predicted_test_corpus, keys, 'out')

The CoNLL 2000 evaluation script will use these two last columns, chunk and predicted chunk, to compute the performance.

### Evaluation Procedure

To evaluate your results, you have two options:
1. Use the original conlleval script here:  <a href="https://www.clips.uantwerpen.be/conll2000/chunking/"><tt>conlleval.txt</tt></a>.
2. Use a Python translation of it. 

You will use the second option and you will describe the results you obtained in your report

#### The Python translation of the `conlleval`script

Run the cell below to install the script from https://github.com/kaniblu/conlleval:

In [36]:
!pip install conlleval

Collecting conlleval
  Downloading conlleval-0.2-py3-none-any.whl (5.4 kB)
Installing collected packages: conlleval
Successfully installed conlleval-0.2


#### Evaluation

We compute the baseline score. Check that this score corresponds to the one reported in the CoNLL description.

In [37]:
import conlleval
lines = open('out').read().splitlines()
res = conlleval.evaluate(lines)
baseline_score = res['overall']['chunks']['evals']['f1']

In [38]:
baseline_score

0.770671072299583

### The official script

You may want to double-check your results with the original CoNLL script. It is more complex to use and this part is optional. I suggest not to try it before you are done with the program.

To run the original script, read these items:
* <tt>conlleval.txt</tt> is the official CoNLL Perl script. It expects the two last columns of the test set to be the manually assigned chunk (gold standard) and the predicted chunk.
* <tt>conlleval.txt</tt> was written for Unix and if you run Windows, you will have to use a terminal command. In the File menu of the notebook, select New and then Terminal.
* Start it like this: ` $ conlleval.txt <out` where the `out` file contains both the gold and predicted chunk tags. `conlleval.txt` is a Perl script.
* Perl is installed on most Unix distributions. If it is not installed on your machine, you need to install it. Make also sure that you have the execution rights. Otherwise change them with: `$ chmod +x conlleval.txt`
* The `conlleval.txt` script expects the new lines to be `\n` as in Unix. If you run your Python program on Windows, your new lines will be `\r\n`. To have the correct new lines, add this parameter to `open()`: `newline='\n’` like this: `f_out = open('out', ‘w’, newline='\n’)`
* The complete description of the CoNLL 2000 evaluation script is available here: [https://www.clips.uantwerpen.be/conll2000/chunking/output.html](https://www.clips.uantwerpen.be/conll2000/chunking/output.html)

## Using Machine Learning: A first ML program

In this exercise, you will apply and explore a machine-learning program.

The program that won the CoNLL 2000 shared task (Kudoh and Matsumoto, 2000) used a window of five words around the chunk tag to identify, $c_i$. They built a feature vector consisting of:
1. The values of the five words in this window: $w_{i-2}, w_{i-1}, w_{i}, w_{i+1}, w_{i+2}$
2. The values of the five parts of speech in this window: $t_{i-2}, t_{i-1}, t_{i}, t_{i+1}, t_{i+2}$
3. The values of the two previous chunk tags in the first part of the window: $c_{i-2}, c_{i-1}$

The two last parameters (3.) are said to be dynamic because the program computes them at run-time. Read [Kudoh and Matsumoto's paper](https://www.clips.uantwerpen.be/conll2000/pdf/14244kud.pdf) and the Yamcha (http://www.chasen.org/~taku/software/yamcha/) software site. We would call this architecture recurrent now.

You will start with a given code that uses the two first sets of features (1. and 2.) and add yourself the last one (3.) to improve the performance of your chunker. Kudoh and Matsumoto trained a classifier based on support vector machines. You will use logistic regression.

### Imports

In [47]:
import bs4
import os
import requests
from sklearn.feature_extraction import DictVectorizer
from sklearn import svm
from sklearn import linear_model
from sklearn import metrics
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
import time

### Feature extraction

#### Functions

A first function to extract features from one sentence

In [39]:
def extract_features_sent_static(sentence, w_size, feature_names):
    """
    Extract the features from one sentence
    returns X and y, where X is a list of dictionaries and
    y is a list of symbols
    :param sentence: string containing the CoNLL structure of a sentence
    :param w_size:
    :return:
    """

    # We pad the sentence to extract the context window more easily
    start = [{'form': '__bos__', 'pos': '__bos__', 'chunk': '__bos__'}]
    end = [{'form': '__eos__', 'pos': '__eos__', 'chunk': '__eos__'}]
    start *= w_size
    end *= w_size
    padded_sentence = start + sentence
    padded_sentence += end

    # We extract the features and the classes
    # X contains is a list of features, where each feature vector is a dictionary
    # y is the list of classes
    X = list()
    y = list()
    for i in range(len(padded_sentence) - 2 * w_size):
        # x is a row of X
        x = list()
        # The words in lower case
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j]['form'].lower())
        # The POS
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j]['pos'])
        # The chunks (Up to the word)
        """
        for j in range(w_size):
            feature_line.append(padded_sentence[i + j]['chunk'])
        """
        # We represent the feature vector as a dictionary
        X.append(dict(zip(feature_names, x)))
        # The classes are stored in a list
        y.append(padded_sentence[i + w_size]['chunk'])
    return X, y

And from all the sentences

In [40]:
def extract_features_static(sentences, w_size, feature_names):
    """
    Builds X matrix and y vector
    X is a list of dictionaries and y is a list
    :param sentences:
    :param w_size:
    :return:
    """
    X_l = []
    y_l = []
    for sentence in sentences:
        X, y = extract_features_sent_static(sentence, w_size, feature_names)
        X_l.extend(X)
        y_l.extend(y)
    return X_l, y_l

#### Applying the feature extraction

The size of the window and the names of the features

In [41]:
w_size = 2  # The size of the context window to the left and right of the word
feature_names = ['word_n2', 'word_n1', 'word', 'word_p1', 'word_p2',
                 'pos_n2', 'pos_n1', 'pos', 'pos_p1', 'pos_p2']

We read the corpus and format it as a dictionary

In [42]:
train_sentences = read_sentences(train_file)
train_corpus = split_rows(train_sentences, column_names)

In [43]:
train_corpus[:2]

[[{'form': 'Confidence', 'pos': 'NN', 'chunk': 'B-NP'},
  {'form': 'in', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'the', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'pound', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'is', 'pos': 'VBZ', 'chunk': 'B-VP'},
  {'form': 'widely', 'pos': 'RB', 'chunk': 'I-VP'},
  {'form': 'expected', 'pos': 'VBN', 'chunk': 'I-VP'},
  {'form': 'to', 'pos': 'TO', 'chunk': 'I-VP'},
  {'form': 'take', 'pos': 'VB', 'chunk': 'I-VP'},
  {'form': 'another', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'sharp', 'pos': 'JJ', 'chunk': 'I-NP'},
  {'form': 'dive', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'if', 'pos': 'IN', 'chunk': 'B-SBAR'},
  {'form': 'trade', 'pos': 'NN', 'chunk': 'B-NP'},
  {'form': 'figures', 'pos': 'NNS', 'chunk': 'I-NP'},
  {'form': 'for', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'September', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': ',', 'pos': ',', 'chunk': 'O'},
  {'form': 'due', 'pos': 'JJ', 'chunk': 'B-ADJP'},
  {'form': 'for', 'pos': 'IN', 'ch

In [44]:
X_dict, y = extract_features_static(train_corpus, w_size, feature_names)
X_dict[:2]

[{'word_n2': '__bos__',
  'word_n1': '__bos__',
  'word': 'confidence',
  'word_p1': 'in',
  'word_p2': 'the',
  'pos_n2': '__bos__',
  'pos_n1': '__bos__',
  'pos': 'NN',
  'pos_p1': 'IN',
  'pos_p2': 'DT'},
 {'word_n2': '__bos__',
  'word_n1': 'confidence',
  'word': 'in',
  'word_p1': 'the',
  'word_p2': 'pound',
  'pos_n2': '__bos__',
  'pos_n1': 'NN',
  'pos': 'IN',
  'pos_p1': 'DT',
  'pos_p2': 'NN'}]

In [45]:
y[:2]

['B-NP', 'B-PP']

### Feature encoding

In [49]:
# Vectorize the feature matrix and carry out a one-hot encoding
vec = DictVectorizer(sparse=True)
X = vec.fit_transform(X_dict)

### Training the model

In [50]:
classifier = linear_model.LogisticRegression()
model = classifier.fit(X, y)
model

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

### Predicting the test set

We read the sentences and create a dictionary

In [51]:
test_sentences = read_sentences(test_file)
test_corpus = split_rows(test_sentences, column_names)
test_corpus[:2]

[[{'form': 'Rockwell', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': 'International', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'Corp.', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': "'s", 'pos': 'POS', 'chunk': 'B-NP'},
  {'form': 'Tulsa', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'unit', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'said', 'pos': 'VBD', 'chunk': 'B-VP'},
  {'form': 'it', 'pos': 'PRP', 'chunk': 'B-NP'},
  {'form': 'signed', 'pos': 'VBD', 'chunk': 'B-VP'},
  {'form': 'a', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'tentative', 'pos': 'JJ', 'chunk': 'I-NP'},
  {'form': 'agreement', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'extending', 'pos': 'VBG', 'chunk': 'B-VP'},
  {'form': 'its', 'pos': 'PRP$', 'chunk': 'B-NP'},
  {'form': 'contract', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'with', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'Boeing', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': 'Co.', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'to', 'pos': 'TO', 'chunk': 'B-VP'},
  {'form':

We extract the features

In [52]:
X_test_dict, y_test = extract_features_static(test_corpus, w_size, feature_names)
X_test_dict[:2]

[{'word_n2': '__bos__',
  'word_n1': '__bos__',
  'word': 'rockwell',
  'word_p1': 'international',
  'word_p2': 'corp.',
  'pos_n2': '__bos__',
  'pos_n1': '__bos__',
  'pos': 'NNP',
  'pos_p1': 'NNP',
  'pos_p2': 'NNP'},
 {'word_n2': '__bos__',
  'word_n1': 'rockwell',
  'word': 'international',
  'word_p1': 'corp.',
  'word_p2': "'s",
  'pos_n2': '__bos__',
  'pos_n1': 'NNP',
  'pos': 'NNP',
  'pos_p1': 'NNP',
  'pos_p2': 'POS'}]

In [53]:
y_test[:2]

['B-NP', 'I-NP']

We vectorize the features

In [54]:
X_test = vec.transform(X_test_dict)  # Possible to add: .toarray()

And we predict the test set

In [55]:
y_test_predicted = classifier.predict(X_test)
y_test_predicted[:2]

array(['B-NP', 'I-NP'], dtype='<U7')

We now add the predicted chunks to the sentences

In [56]:
inx = 0
for sentence in test_corpus:
    for word in sentence:
        word['pchunk'] = y_test_predicted[inx]
        inx += 1

The index should be equal to the length of the prediction

In [57]:
print(inx)
len(y_test_predicted)

47377


47377

In [58]:
test_corpus[:2]

[[{'form': 'Rockwell', 'pos': 'NNP', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'International', 'pos': 'NNP', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'Corp.', 'pos': 'NNP', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': "'s", 'pos': 'POS', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'Tulsa', 'pos': 'NNP', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'unit', 'pos': 'NN', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'said', 'pos': 'VBD', 'chunk': 'B-VP', 'pchunk': 'B-VP'},
  {'form': 'it', 'pos': 'PRP', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'signed', 'pos': 'VBD', 'chunk': 'B-VP', 'pchunk': 'B-VP'},
  {'form': 'a', 'pos': 'DT', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'tentative', 'pos': 'JJ', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'agreement', 'pos': 'NN', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'extending', 'pos': 'VBG', 'chunk': 'B-VP', 'pchunk': 'B-VP'},
  {'form': 'its', 'pos': 'PRP$', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'c

In [59]:
save_results(test_corpus, keys, 'out')

#### Evaluating the performance

In [60]:
lines = open('out').read().splitlines()
res = conlleval.evaluate(lines)
simple_ml_score = res['overall']['chunks']['evals']['f1']

In [61]:
simple_ml_score

0.9157688948047087

### Questions on the ML program

Please read these questions and answer them in your report:
1. What is the feature vector that corresponds to the chunking program above? Is it the same Kudoh and Matsumoto used in their experiment?
2. What is the performance of the chunker?
3. Remove the lexical features (the words) from the feature vector and measure the performance. You should observe a decrease.
4. What is the classifier used in the program? 
5. As an optional task, you may increase the number of iterations, try two other classifiers from sklearn and measure their performance: decision trees, perceptron, support vector machines, etc. Be aware that support vector machines take a long time to train: up to one hour.

## Using Machine Learning: Adding all the features from Kudoh and Matsumoto

Complement the feature vector used in the previous section with the two dynamic features, $c_{i-2}, c_{i-1}$, and train a new model. You will need to write a new `extract_features_sent_dyn` and `predict` functions. 
In his experiments, your teacher obtained a F1 score of 92.65 with logistic regression and the default parameters from sklearn, i.e. `linear_model.LogisticRegression()`;

**A frequent mistake in the labs** is to use the gold-standard chunks from the test set. Be aware that  when you predict the test set, you do not know the dynamic features in advance and you must  not use the ones from the test file. You will use the two previous chunk tags that you have predicted.

You need to reach a global F1 score of 92 to pass this laboratory.

Write an `extract_features_sent_dyn(sentence, w_size, feature_names)` function to extract the features from one sentence.

In [62]:
def extract_features_sent_dyn(sentence, w_size, feature_names):
    """
    Extract the features from one sentence
    returns X and y, where X is a list of dictionaries and
    y is a list of symbols
    :param sentence: string containing the CoNLL structure of a sentence
    :param w_size:
    :return:
    """

    # We pad the sentence to extract the context window more easily
    start = [{'form': '__bos__', 'pos': '__bos__', 'chunk': '__bos__'}]
    end = [{'form': '__eos__', 'pos': '__eos__', 'chunk': '__eos__'}]
    start *= w_size
    end *= w_size
    padded_sentence = start + sentence
    padded_sentence += end
    
    
    # We extract the features and the classes
    # X contains is a list of features, where each feature vector is a dictionary
    # y is the list of classes
    X = list()
    y = list()
    for i in range(len(padded_sentence) - 2 * w_size):
        # x is a row of X
        x = list()
        # The words in lower case
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j]['form'].lower())
        # The POS
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j]['pos'])
        # The chunks (Up to the word)
     
        for j in range(w_size):
            x.append(padded_sentence[i + j]['chunk'])
   
        # We represent the feature vector as a dictionary
        X.append(dict(zip(feature_names, x)))
        # The classes are stored in a list
        y.append(padded_sentence[i + w_size]['chunk'])

    return X, y

We apply `extract_features_sent_dyn` to all the sentences

In [63]:
def extract_features_dyn(sentences, w_size, feature_names):
    """
    Builds X matrix and y vector
    X is a list of dictionaries and y is a list
    :param sentences:
    :param w_size:
    :return:
    """
    X_l = []
    y_l = []
    for sentence in sentences:
        X, y = extract_features_sent_dyn(sentence, w_size, feature_names)
        X_l.extend(X)
        y_l.extend(y)
    return X_l, y_l

In [64]:
feature_names_dyn = ['word_n2', 'word_n1', 'word', 'word_p1', 'word_p2',
                     'pos_n2', 'pos_n1', 'pos', 'pos_p1', 'pos_p2', 'chunk_n2',
                     'chunk_n1']

In [65]:
train_sentences = read_sentences(train_file)
train_corpus = split_rows(train_sentences, column_names)
train_corpus[:2]

[[{'form': 'Confidence', 'pos': 'NN', 'chunk': 'B-NP'},
  {'form': 'in', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'the', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'pound', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'is', 'pos': 'VBZ', 'chunk': 'B-VP'},
  {'form': 'widely', 'pos': 'RB', 'chunk': 'I-VP'},
  {'form': 'expected', 'pos': 'VBN', 'chunk': 'I-VP'},
  {'form': 'to', 'pos': 'TO', 'chunk': 'I-VP'},
  {'form': 'take', 'pos': 'VB', 'chunk': 'I-VP'},
  {'form': 'another', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'sharp', 'pos': 'JJ', 'chunk': 'I-NP'},
  {'form': 'dive', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'if', 'pos': 'IN', 'chunk': 'B-SBAR'},
  {'form': 'trade', 'pos': 'NN', 'chunk': 'B-NP'},
  {'form': 'figures', 'pos': 'NNS', 'chunk': 'I-NP'},
  {'form': 'for', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'September', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': ',', 'pos': ',', 'chunk': 'O'},
  {'form': 'due', 'pos': 'JJ', 'chunk': 'B-ADJP'},
  {'form': 'for', 'pos': 'IN', 'ch

In [66]:
X_dict, y = extract_features_dyn(train_corpus, w_size, feature_names_dyn)

In [67]:
X_dict[:3]

[{'word_n2': '__bos__',
  'word_n1': '__bos__',
  'word': 'confidence',
  'word_p1': 'in',
  'word_p2': 'the',
  'pos_n2': '__bos__',
  'pos_n1': '__bos__',
  'pos': 'NN',
  'pos_p1': 'IN',
  'pos_p2': 'DT',
  'chunk_n2': '__bos__',
  'chunk_n1': '__bos__'},
 {'word_n2': '__bos__',
  'word_n1': 'confidence',
  'word': 'in',
  'word_p1': 'the',
  'word_p2': 'pound',
  'pos_n2': '__bos__',
  'pos_n1': 'NN',
  'pos': 'IN',
  'pos_p1': 'DT',
  'pos_p2': 'NN',
  'chunk_n2': '__bos__',
  'chunk_n1': 'B-NP'},
 {'word_n2': 'confidence',
  'word_n1': 'in',
  'word': 'the',
  'word_p1': 'pound',
  'word_p2': 'is',
  'pos_n2': 'NN',
  'pos_n1': 'IN',
  'pos': 'DT',
  'pos_p1': 'NN',
  'pos_p2': 'VBZ',
  'chunk_n2': 'B-NP',
  'chunk_n1': 'B-PP'}]

You will now vectorize the training set

In [68]:
vecc=DictVectorizer(sparse=True)
Xmat=vecc.fit_transform(X_dict)

And create and fit the model

In [69]:
clasifierr=linear_model.LogisticRegression()
model=clasifierr.fit(Xmat,y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [70]:
model

LogisticRegression()

### Prediction

You will finally predict the test set. We load the corpus again.

In [71]:
test_sentences = read_sentences(test_file)
test_corpus = split_rows(test_sentences, column_names)
test_corpus[:2]

[[{'form': 'Rockwell', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': 'International', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'Corp.', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': "'s", 'pos': 'POS', 'chunk': 'B-NP'},
  {'form': 'Tulsa', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'unit', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'said', 'pos': 'VBD', 'chunk': 'B-VP'},
  {'form': 'it', 'pos': 'PRP', 'chunk': 'B-NP'},
  {'form': 'signed', 'pos': 'VBD', 'chunk': 'B-VP'},
  {'form': 'a', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'tentative', 'pos': 'JJ', 'chunk': 'I-NP'},
  {'form': 'agreement', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'extending', 'pos': 'VBG', 'chunk': 'B-VP'},
  {'form': 'its', 'pos': 'PRP$', 'chunk': 'B-NP'},
  {'form': 'contract', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'with', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'Boeing', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': 'Co.', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'to', 'pos': 'TO', 'chunk': 'B-VP'},
  {'form':

Let us extract the static features from one sentence

In [72]:
X_test_dict, y_test = extract_features_static([test_corpus[0]], w_size, feature_names)
X_test_dict[:2]

[{'word_n2': '__bos__',
  'word_n1': '__bos__',
  'word': 'rockwell',
  'word_p1': 'international',
  'word_p2': 'corp.',
  'pos_n2': '__bos__',
  'pos_n1': '__bos__',
  'pos': 'NNP',
  'pos_p1': 'NNP',
  'pos_p2': 'NNP'},
 {'word_n2': '__bos__',
  'word_n1': 'rockwell',
  'word': 'international',
  'word_p1': 'corp.',
  'word_p2': "'s",
  'pos_n2': '__bos__',
  'pos_n1': 'NNP',
  'pos': 'NNP',
  'pos_p1': 'NNP',
  'pos_p2': 'POS'}]

This $\mathbf{X}\_{\textrm{dict}}$ is incomplete. For the prediction, we need to reinject dynamically the two previously predicted tags to have the full feature vector. Write this code here. 

This part is probably the most difficult of the lab. You may want to write it first for one sentence, and then for the test corpus. The prediction will take a longer time and you may want to include a progress bar with this snippet: 
```
from tqdm import tqdm
for test_sentence in tqdm(test_corpus):
```

In [73]:
y_test_predicted_dyn = []

In [74]:
def predic(X_test_dict):
    # Vectorize the test sentence and one hot encoding
    X_test = vecc.transform(X_test_dict)
    # Predicts the chunks and returns numbers
    y_test_predicted = clasifierr.predict(X_test)
    return y_test_predicted

In [76]:
y_test_predicted_dyn = []

X_test_dict, y_test = extract_features_static(test_corpus, w_size, feature_names)  
ypre=[]                     
c1='__bos__'
c2='__bos__'
X_test_dict[0]['chunk_n2']=c2
X_test_dict[0]['chunk_n1']=c1
pred=predic(X_test_dict[0])
y_test_predicted_dyn.append(pred[0])
#ypre.append(pred[0])
for i in range (1,len(X_test_dict)):
    c2=c1
    
    c1=pred
    
    X_test_dict[i]['chunk_n2']=c2
    X_test_dict[i]['chunk_n1']=c1
    pred=predic(X_test_dict[i])
    y_test_predicted_dyn.append(pred[0])
 #   ypre.append(pred[0])

In [77]:
y_test_predicted_dyn[:3]

['B-NP', 'I-NP', 'I-NP']

In [78]:
inx = 0
for sentence in test_corpus:
    for word in sentence:
        word['pchunk'] = y_test_predicted_dyn[inx]
        inx += 1

In [79]:
save_results(test_corpus, keys, 'out')

#### Evaluation

In [80]:
lines = open('out').read().splitlines()
res = conlleval.evaluate(lines)
improved_ml_score = res['overall']['chunks']['evals']['f1']
improved_ml_score

0.9263263766658284

### Optional improvement

As an optional task, you can try to improve the score with beam search. If you know this technique, apply it using the probability output of logistic regression.

With the same classifier and a beam diameter of 5, your teacher obtained 92.87.

## Collecting the entities

You will now collect all the named entities from the training set meeting the two conditions:
1. Defined as NP chunks and 
2. Starting with a `NNP` (proper noun) or a `NNPS` (proper noun, plural) tag. 

As an example, in the first sentence of `train_corpus`, you will extract `('September', )` and `('July', 'and', 'August')`. You will set all the tuples in a set that you will call `ne_set`.

In [81]:
train_corpus[0]

[{'form': 'Confidence', 'pos': 'NN', 'chunk': 'B-NP'},
 {'form': 'in', 'pos': 'IN', 'chunk': 'B-PP'},
 {'form': 'the', 'pos': 'DT', 'chunk': 'B-NP'},
 {'form': 'pound', 'pos': 'NN', 'chunk': 'I-NP'},
 {'form': 'is', 'pos': 'VBZ', 'chunk': 'B-VP'},
 {'form': 'widely', 'pos': 'RB', 'chunk': 'I-VP'},
 {'form': 'expected', 'pos': 'VBN', 'chunk': 'I-VP'},
 {'form': 'to', 'pos': 'TO', 'chunk': 'I-VP'},
 {'form': 'take', 'pos': 'VB', 'chunk': 'I-VP'},
 {'form': 'another', 'pos': 'DT', 'chunk': 'B-NP'},
 {'form': 'sharp', 'pos': 'JJ', 'chunk': 'I-NP'},
 {'form': 'dive', 'pos': 'NN', 'chunk': 'I-NP'},
 {'form': 'if', 'pos': 'IN', 'chunk': 'B-SBAR'},
 {'form': 'trade', 'pos': 'NN', 'chunk': 'B-NP'},
 {'form': 'figures', 'pos': 'NNS', 'chunk': 'I-NP'},
 {'form': 'for', 'pos': 'IN', 'chunk': 'B-PP'},
 {'form': 'September', 'pos': 'NNP', 'chunk': 'B-NP'},
 {'form': ',', 'pos': ',', 'chunk': 'O'},
 {'form': 'due', 'pos': 'JJ', 'chunk': 'B-ADJP'},
 {'form': 'for', 'pos': 'IN', 'chunk': 'B-PP'},
 {'fo

You can write a two-pass procedure. For each sentence of the corpus:
1. In the first pass, you will collect the start indices of the noun groups (starting with a `B-NP`) which are also proper nouns (`NNP`or `NNPS`). For the first sentence, it will result in the list `[16, 30]`;
2. In the second pass, you will collect the segments (`B-NP` followed by a sequence of `I-NP`), starting at each index. For the first sentence, it will result in the tuples `('September',)`and `('July', 'and', 'August')`

Should you have a better solution, please use it.

In [183]:
ne_set=[]
for corpuss in train_corpus:
    ind=[]
    for i in range(len(corpuss)):       
            if (corpuss[i]['pos']=='NNP' or corpuss[i]['pos']=='NNPS') and corpuss[i]['chunk']=='B-NP' :
                ind.append(i)  
    for j in ind:
        temp2=[]
        temp=corpuss[j]['form']
        temp2.append(temp)
        k=j+1
        while k < (len(corpuss)) and corpuss[k]['chunk']=='I-NP':
            temp2.append(corpuss[k]['form'])
            k+=1 
        ne_set.append(tuple(temp2))
ne_set = set(ne_set)

In [185]:
len(ne_set)

4348

In [186]:
list(sorted(ne_set))[:25]

[('1-2-3', 'Release', '3'),
 ('67-year-old', 'Mrs.', 'Thi'),
 ('A.', 'Foster', 'Higgins', '&', 'Co.'),
 ('A.', 'Schulman'),
 ('A.F.', 'Delchamps', 'Jr.'),
 ('A.G.', 'Edwards', '&', 'Sons', 'Inc.'),
 ('A.P.', 'Green'),
 ('A.P.', 'Green', 'Industries'),
 ('A.P.', 'Green', 'Industries', 'Inc.'),
 ('ABC',),
 ('AC&R', 'ADVERTISING'),
 ('AC&R', 'Advertising'),
 ('AC&R\\/CCL', 'Advertising'),
 ('ADN',),
 ('AEW',),
 ('AGS', 'Computers'),
 ('AMR',),
 ('AMR', 'Corp.'),
 ('AMR', 'shareholders'),
 ('AMRO', 'Bank'),
 ('ANB', 'Investment', 'Management', 'Co.'),
 ('AON',),
 ('ARA', 'Services', 'Inc.'),
 ('ARCO',),
 ('ARTY', 'FAX')]

### Creating a small set

To run the subsequent experiments faster, you will limit the dataset to the entities starting with letter _K_. I chose this letter, because it corresponded to one of the smallest sets. You will call the resulting set: `ne_small_set`. Feel free to use the full set after you have completed this assignment.

In [187]:
ne_small_set=[]
for ne in ne_set:
    if ne[0][0].lower()=='k':
        ne_small_set.append(ne)

In [188]:
sorted(ne_small_set)

[('K', 'mart', 'Corp.', 'Chairman', 'Joseph', 'E.', 'Antonini'),
 ('K-H', 'Corp.'),
 ('KIM',),
 ('KLM', 'Royal', 'Dutch', 'Airlines'),
 ('KPMG', 'Peat', 'Marwick'),
 ('KTXL',),
 ('Kabul',),
 ('Kacy', 'McClelland'),
 ('Kaitaia',),
 ('Kajima',),
 ('Kansas',),
 ('Kansas', 'Power'),
 ('Kansas', 'and', 'Texas'),
 ('Karen', 'Olshan'),
 ('Kary', 'Moss'),
 ('Kasler', 'Corp.'),
 ('Kate', 'Michelman'),
 ('Kathe', 'Dillmann'),
 ('Kathie', 'Huff'),
 ('Kathie', 'Roberts'),
 ('Kathryn', 'McGrath'),
 ('Kathy', 'Stanwick'),
 ('Katonah',),
 ('Kawasaki', 'Steel'),
 ('Ke', 'Zaishuo'),
 ('Kean', 'forces'),
 ('Keefe', ',', 'Bruyette', '&', 'Woods', 'Inc.'),
 ('Keihin', 'Electric', 'Express', 'Railway', 'Co'),
 ('Keith', 'L.', 'Fogg'),
 ('Keith', 'Mulrooney'),
 ('Keizaikai',),
 ('Keizaikai', 'Corp.'),
 ('Kemper',),
 ('Kenneth', 'A.', 'Eldred'),
 ('Kenneth', 'Abraham'),
 ('Kenneth', 'H.', 'Olsen'),
 ('Kenneth', 'T.', 'Rosen'),
 ('Kent', 'Neal'),
 ('Kenton',),
 ('Kentucky',),
 ('Kentucky', 'Fried', 'Chicken',

## Resolving the entities

You will now implement a simple method to link the named entities from the previous exercise to unique pages  or identifiers in Wikipedia and Wikidata.

First, look at a few entities in your small set and find:
1. A few entities that you think are in wikipedia, 
2. Entities that will not be in wikipedia, and 
3. Entities that you think are ambiguous: An entity that may correspond to two or more things. 

You will describe your findings in the report.

### A function to lookup entities

Read the function below and try to understand what it means. You will describe it in your report.

In [189]:
def wikipedia_lookup(ner, base_url='https://en.wikipedia.org/wiki/'):
    try:
        url_en = base_url + ' '.join(ner)
        html_doc = requests.get(url_en).text
        parse_tree = bs4.BeautifulSoup(html_doc, 'html.parser')
        entity_id = parse_tree.find("a", {"accesskey": "g"})['href']
        head_id, entity_id = os.path.split(entity_id)
        return entity_id
    except:
        pass
        # print('Not found in: ', base_url)
    entity_id = 'UNK'
    return entity_id

Write a function to run the lookup and keep the resolved entities (only the resolved entities). You will call it `ne_ids_en`

In [195]:
from tqdm import tqdm
ne_ids_en=[]
for w in tqdm(sorted(ne_small_set)):
    if wikipedia_lookup(w)!='UNK':
        ne_ids_en.append((w,wikipedia_lookup(w)))

100%|██████████| 76/76 [00:33<00:00,  2.29it/s]


In [196]:
ne_ids_en

[(('KIM',), 'Q224736'),
 (('KLM', 'Royal', 'Dutch', 'Airlines'), 'Q181912'),
 (('KPMG', 'Peat', 'Marwick'), 'Q493751'),
 (('KTXL',), 'Q6339344'),
 (('Kabul',), 'Q5838'),
 (('Kaitaia',), 'Q257417'),
 (('Kajima',), 'Q1081154'),
 (('Kansas',), 'Q1558'),
 (('Kate', 'Michelman'), 'Q785671'),
 (('Katonah',), 'Q2888777'),
 (('Kawasaki', 'Steel'), 'Q6379829'),
 (('Kemper',), 'Q126993'),
 (('Kenneth', 'Abraham'), 'Q59268486'),
 (('Kenneth', 'H.', 'Olsen'), 'Q454315'),
 (('Kenton',), 'Q358393'),
 (('Kentucky',), 'Q1603'),
 (('Khost',), 'Q386682'),
 (('Kidder', 'Peabody'), 'Q3196386'),
 (('Kirin',), 'Q297659'),
 (('Kirin', 'Brewery'), 'Q13403399'),
 (('Kitchen',), 'Q43164'),
 (('Knoxville',), 'Q185582'),
 (('Kobe', 'Steel'), 'Q1730802'),
 (('Kodak',), 'Q486269'),
 (('Kraft', 'General', 'Foods'), 'Q327751'),
 (('Krenz',), 'Q21512656'),
 (('Kuala', 'Lumpur'), 'Q1865'),
 (('Kurds',), 'Q12223'),
 (('Kurt', 'Hager'), 'Q95367'),
 (('Kwek', 'Hong', 'Png'), 'Q10559663'),
 (('Ky',), 'Q225951'),
 (('Ky.',)

Sometimes, entities need a confirmation. You will apply the same resolution with the Swedish wikipedia this time, `https://sv.wikipedia.org/wiki/`.

In [197]:
from tqdm import tqdm
ne_ids_sv=[]
for w in tqdm(sorted(ne_small_set)):
    idd=wikipedia_lookup(w,'https://sv.wikipedia.org/wiki/')
    if id !='UNK':
        ne_ids_sv.append((w,idd))

100%|██████████| 76/76 [00:14<00:00,  5.37it/s]


In [198]:
ne_ids_sv

[(('K', 'mart', 'Corp.', 'Chairman', 'Joseph', 'E.', 'Antonini'), 'UNK'),
 (('K-H', 'Corp.'), 'UNK'),
 (('KIM',), 'Q224736'),
 (('KLM', 'Royal', 'Dutch', 'Airlines'), 'Q181912'),
 (('KPMG', 'Peat', 'Marwick'), 'UNK'),
 (('KTXL',), 'UNK'),
 (('Kabul',), 'Q5838'),
 (('Kacy', 'McClelland'), 'UNK'),
 (('Kaitaia',), 'UNK'),
 (('Kajima',), 'UNK'),
 (('Kansas',), 'Q1558'),
 (('Kansas', 'Power'), 'UNK'),
 (('Kansas', 'and', 'Texas'), 'UNK'),
 (('Karen', 'Olshan'), 'UNK'),
 (('Kary', 'Moss'), 'UNK'),
 (('Kasler', 'Corp.'), 'UNK'),
 (('Kate', 'Michelman'), 'UNK'),
 (('Kathe', 'Dillmann'), 'UNK'),
 (('Kathie', 'Huff'), 'UNK'),
 (('Kathie', 'Roberts'), 'UNK'),
 (('Kathryn', 'McGrath'), 'UNK'),
 (('Kathy', 'Stanwick'), 'UNK'),
 (('Katonah',), 'UNK'),
 (('Kawasaki', 'Steel'), 'UNK'),
 (('Ke', 'Zaishuo'), 'UNK'),
 (('Kean', 'forces'), 'UNK'),
 (('Keefe', ',', 'Bruyette', '&', 'Woods', 'Inc.'), 'UNK'),
 (('Keihin', 'Electric', 'Express', 'Railway', 'Co'), 'UNK'),
 (('Keith', 'L.', 'Fogg'), 'UNK'),
 ((

You will compute the intersection of the two sets. You will assign it to a list that you will sort and that you will call: `confirmed_ne_en_sv`.

In [207]:
confirmed_ne_en_sv = sorted(set(ne_ids_sv).intersection(set(ne_ids_en)))

In [208]:
confirmed_ne_en_sv

[(('KIM',), 'Q224736'),
 (('KLM', 'Royal', 'Dutch', 'Airlines'), 'Q181912'),
 (('Kabul',), 'Q5838'),
 (('Kansas',), 'Q1558'),
 (('Kenton',), 'Q358393'),
 (('Kentucky',), 'Q1603'),
 (('Khost',), 'Q386682'),
 (('Kirin',), 'Q297659'),
 (('Kodak',), 'Q486269'),
 (('Kuala', 'Lumpur'), 'Q1865')]

The first items in your list should look like:
```
[(('KIM',), 'Q224736'),
 (('KLM', 'Royal', 'Dutch', 'Airlines'), 'Q181912'),
 ...
]
```

## Submission

When you have written all the code and run all the cells, fill in your ID and as well as the name of the notebook.

In [211]:
STIL_ID = ["fa0188ni-s", "mo0708ho-s"] # Write your stil ids as a list
CURRENT_NOTEBOOK_PATH = os.path.join(os.getcwd(), 
                                     "4-chunker_solution.ipynb") # Write the name of your notebook

The submission code will send your answer. It consists of the baseline score, the improved machine-learning score, and the confirmed entities.

In [212]:
import json
ANSWER = json.dumps({'baseline_score': baseline_score,
                    'improved_ml_score': improved_ml_score,
                    'confirmed_ne_en_sv': confirmed_ne_en_sv})
ANSWER

'{"baseline_score": 0.770671072299583, "improved_ml_score": 0.9263263766658284, "confirmed_ne_en_sv": [[["KIM"], "Q224736"], [["KLM", "Royal", "Dutch", "Airlines"], "Q181912"], [["Kabul"], "Q5838"], [["Kansas"], "Q1558"], [["Kenton"], "Q358393"], [["Kentucky"], "Q1603"], [["Khost"], "Q386682"], [["Kirin"], "Q297659"], [["Kodak"], "Q486269"], [["Kuala", "Lumpur"], "Q1865"]]}'

Now the moment of truth:
1. Save your notebook and
2. Run the cells below

In [213]:
SUBMISSION_NOTEBOOK_PATH = CURRENT_NOTEBOOK_PATH + ".submission.bz2"

In [214]:
import bz2
ASSIGNMENT = 4
API_KEY = "f581ba347babfea0b8f2c74a3a6776a7"

# Copy and compress current notebook
with bz2.open(SUBMISSION_NOTEBOOK_PATH, mode="wb") as fout:
    with open(CURRENT_NOTEBOOK_PATH, "rb") as fin:
        fout.write(fin.read())

In [215]:
res = requests.post("https://vilde.cs.lth.se/edan20checker/submit", 
                    files={"notebook_file": open(SUBMISSION_NOTEBOOK_PATH, "rb")}, 
                    data={
                        "stil_id": STIL_ID,
                        "assignment": ASSIGNMENT,
                        "answer": ANSWER,
                        "api_key": API_KEY,
                    },
               verify=True)

# from IPython.display import display, JSON
res.json()

{'msg': None,
 'status': 'correct',
 'signature': 'deebb8ca24054b798a2eac0efa08b5ac813157215a028adaad6dae4ee227d153cf59f5a149aea331ddd349f82358ba414b722288129b790ab44c08e40c9e927a',
 'submission_id': '372f1b6b-20e7-4df7-95f6-e4c943ed20dc'}

## Turning in your assignment

Now your are done with the program. To complete this assignment, you will:
1. Write a short individual report on your program. Do not forget to answer all the question in the notebook.
2. Read the article, <a href="https://www.aclweb.org/anthology/C18-1139"><i>Contextual String Embeddings for Sequence Labeling</i></a> by Akbik et al. (2018) and outline the main differences between their system and yours. A LSTM is a type of recurrent neural network, while CRF is a sort of beam search. You will tell the performance they reach on the corpus you used in this laboratory.

Submit your report as well as your notebook (for archiving purposes) to Canvas: https://canvas.education.lu.se/. To write your report, you can either
1. Write directly your text in Canvas, or
2. Use Latex and Overleaf (www.overleaf.com). This will probably help you structure your text. You will then upload a PDF file in Canvas.

The submission deadline is October 8, 2021.