# Assignment #4: Extracting syntactic groups using machine-learning techniques

#### Author: Hicham Mohamad (hi8826mo-s)

## Table of Contents
1. [Loading the corpus](#t1)
2. [Baseline chunker](#t2)
3. [Using Machine Learning: A first ML program](#t3)
4. [Using Machine Learning: Adding all the features from Kudoh and Matsumoto](#t4)
5. [Collecting the entities](#t5)
6. [Resolving the entities](#t6)
7. [Submission](#t7)
8. [Reading](#t8)

In this assignment, you will create a system to extract **syntactic groups** from a text. You will apply it to the **CoNLL 2000** dataset. In addition, you will try to link a few extracted **named entities** to real things using wikipedia.

## Objectives

The objectives of this assignment are to:
* Write a program to detect **partial syntactic** structures
* Extract **named entities** and link them to real things using **Wikipedia**
* Understand the principles of **supervised machine learning** techniques applied to language processing
* Use a popular machine learning toolkit: **scikit-learn**
* Write a short report of 2 to 3 pages on the assignment

## Choosing a training and a test sets

* As annotated data and annotation scheme, you will use the data available from [CoNLL 2000](https://www.clips.uantwerpen.be/conll2000/chunking/).
* Download both the training and test sets and decompress them.
* Local copies are also available here: [train.txt](https://fileadmin.cs.lth.se/cs/Education/EDAN20/corpus/conll2000/train.txt) and [test.txt](https://fileadmin.cs.lth.se/cs/Education/EDAN20/corpus/conll2000/test.txt)
* Read the description of the CoNLL 2000 task

## Loading the corpus <a name='t1'/>

### The datasets

You may need to adjust the paths to load the datasets.

In [1]:
#train_file = '../../corpus/conll2000/train.txt'
train_file = 'conll2000/train.txt'
#test_file = '../../corpus/conll2000/test.txt'
test_file = 'conll2000/test.txt'

#### Reading the files

Read the functions below to load the datasets. They store the corpus in a **list of sentences**. Each sentence is a list of rows, where each row is a dictionary.

In [2]:
def read_sentences(file):
    """
    Creates a list of sentences from the corpus
    Each sentence is a string
    :param file:
    :return:
    """
    f = open(file).read().strip()
    sentences = f.split('\n\n')
    return sentences

In [3]:
def split_rows(sentences, column_names):
    """
    Creates a list of sentence where each sentence is a list of lines
    Each line is a dictionary of columns
    :param sentences:
    :param column_names:
    :return:
    """
    new_sentences = []
    for sentence in sentences:
        rows = sentence.split('\n')
        sentence = [dict(zip(column_names, row.split())) for row in rows]
        new_sentences.append(sentence)
    return new_sentences

### Loading dictionaries

**NOTE:** The train and test data consist of **three columns** separated by spaces. Each word has been put on a separate line and there is an empty line after each sentence. The first column contains the **current word**, the second its **part-of-speech** (pos) tag as derived by the Brill tagger and the third its **chunk** tag as derived from the WSJ corpus. 

The **chunk tags** contain the name of the chunk type, for example I-NP for noun phrase words and I-VP for verb phrase words. Most chunk types have **two types of chunk** tags, B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the chunk.

The CoNLL 2000 files have three columns

In [4]:
column_names = ['form', 'pos', 'chunk']

We load the corpus, **the training dataset**

In [5]:
# create a list of sentences
train_sentences = read_sentences(train_file)

# create a list of sentences
# each sentence is a list of lines
# each line is a dictionary of columns
train_corpus = split_rows(train_sentences, column_names)

train_corpus[:2]

[[{'form': 'Confidence', 'pos': 'NN', 'chunk': 'B-NP'},
  {'form': 'in', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'the', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'pound', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'is', 'pos': 'VBZ', 'chunk': 'B-VP'},
  {'form': 'widely', 'pos': 'RB', 'chunk': 'I-VP'},
  {'form': 'expected', 'pos': 'VBN', 'chunk': 'I-VP'},
  {'form': 'to', 'pos': 'TO', 'chunk': 'I-VP'},
  {'form': 'take', 'pos': 'VB', 'chunk': 'I-VP'},
  {'form': 'another', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'sharp', 'pos': 'JJ', 'chunk': 'I-NP'},
  {'form': 'dive', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'if', 'pos': 'IN', 'chunk': 'B-SBAR'},
  {'form': 'trade', 'pos': 'NN', 'chunk': 'B-NP'},
  {'form': 'figures', 'pos': 'NNS', 'chunk': 'I-NP'},
  {'form': 'for', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'September', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': ',', 'pos': ',', 'chunk': 'O'},
  {'form': 'due', 'pos': 'JJ', 'chunk': 'B-ADJP'},
  {'form': 'for', 'pos': 'IN', 'ch

## Baseline chunker <a name='t2'/>

Most **statistical algorithms** for language processing start with a so-called baseline. The **baseline performance** corresponds to the application of a minimal technique that is used to assess the difficulty of a task and for comparison with further programs.

You will implement the **baseline** proposed by the organizers of the
        <a href="https://www.clips.uantwerpen.be/conll2000/chunking/">CoNLL 2000 shared task</a>, Sect. <i>Results</i>.
1. Read it;
2. In the report you will tell what do you think of it.

### Auxiliary functions

A function to count the **parts of speech**

In [6]:
def count_pos(corpus):
    """
    Computes the part-of-speech distribution
    in a CoNLL 2000 file
    :param corpus:
    :return:
    """
    pos_cnt = {}
    for sentence in corpus:
        for row in sentence:
            if row['pos'] in pos_cnt:
                pos_cnt[row['pos']] += 1
            else:
                pos_cnt[row['pos']] = 1
    return pos_cnt

We first collect all the **parts of speech** (pos) and we count them.

In [7]:
pos_cnt = count_pos(train_corpus)
pos_cnt

{'NN': 30147,
 'IN': 22764,
 'DT': 18335,
 'VBZ': 4648,
 'RB': 6607,
 'VBN': 4763,
 'TO': 5081,
 'VB': 6017,
 'JJ': 13085,
 'NNS': 13619,
 'NNP': 19884,
 ',': 10770,
 'CC': 5372,
 'POS': 1769,
 '.': 8827,
 'VBP': 2868,
 'VBG': 3272,
 'PRP$': 1881,
 'CD': 8315,
 '``': 1531,
 "''": 1493,
 'VBD': 6745,
 'EX': 206,
 'MD': 2167,
 '#': 36,
 '(': 274,
 '$': 1750,
 ')': 281,
 'NNPS': 420,
 'PRP': 3820,
 'JJS': 374,
 'WP': 529,
 'RBR': 321,
 'JJR': 853,
 'WDT': 955,
 'WRB': 478,
 'RBS': 191,
 'PDT': 55,
 'RP': 83,
 ':': 1047,
 'FW': 38,
 'WP$': 35,
 'SYM': 6,
 'UH': 15}

### Chunk distribution

- You will compute the chunk distribution for each **part of speech**. You will use the **training file** to derive the distribution and you will store the results in a **dictionary**. Below, you have an excerpt of the expected results:
```
{'JJR':
{'I-ADVP': 17, 'I-ADJP': 45, 'I-NP': 204, 'B-ADVP': 63,
'B-PP': 2, 'B-ADJP': 111, 'B-NP': 382, 'B-VP': 2,
'I-VP': 11, 'O': 16},
'CC':
{'B-ADVP': 3, 'O': 3676, 'I-VP': 104, 'B-CONJP': 6,
'I-ADVP': 30, 'I-UCP': 2, 'I-PP': 24, 'I-ADJP': 26,
'I-NP': 1409, 'B-ADJP': 2, 'B-NP': 18, 'B-PP': 70,
'I-PRT': 1, 'B-VP': 1},
'NN':
{'B-LST': 2, 'I-INTJ': 2, 'B-ADVP': 38, 'O': 37,
'I-ADVP': 11, 'B-INTJ': 1, 'I-UCP': 2, 'B-UCP': 2,
'I-VP': 77, 'B-PRT': 2, 'I-ADJP': 41, 'I-NP': 24456,
'B-ADJP': 44, 'B-NP': 5160, 'B-PP': 15, 'B-VP': 257},
...
```

In [8]:
# Write your code here
# a dictionary to store the results in
chunk_dist = {k: {} for k in pos_cnt.keys( )}

# compute the chunk distribution for each part of speech
for sentence in train_corpus:
    for row in sentence:
        pos = row['pos']
        chunk = row['chunk']
        
        if chunk in chunk_dist[pos]:
            chunk_dist[pos][chunk] += 1
        else:
            chunk_dist[pos][chunk] = 1
        
            
#print(chunk_dist)
    

In [9]:
chunk_dist['NN']

{'B-NP': 5160,
 'I-NP': 24456,
 'B-VP': 257,
 'B-ADJP': 44,
 'B-ADVP': 38,
 'O': 37,
 'B-PP': 15,
 'I-ADVP': 11,
 'I-ADJP': 41,
 'I-VP': 77,
 'B-INTJ': 1,
 'B-LST': 2,
 'B-UCP': 2,
 'I-UCP': 2,
 'B-PRT': 2,
 'I-INTJ': 2}

### Selecting the POS-chunk associations

- For each part of speech, select the **best association**. In the example above, you will have (NN, I-NP) as it is the most frequent. You will store the results in a **dictionary** that you will call `pos_chunk`

In [10]:
# Write your code here
# a dictionary to store the best association in
pos_chunk = {} 

for pos in chunk_dist:
    most_freq = 0
    most_freq_chunk = ''
    for chunk in chunk_dist[pos]:
        if chunk_dist[pos][chunk] >= most_freq:
            most_freq = chunk_dist[pos][chunk]
            most_freq_chunk = chunk
        
    # select the best association for each pos    
    pos_chunk[pos] = most_freq_chunk
        
            
#print(pos_chunk)

In [11]:
pos_chunk['NN']

'I-NP'

### Prediction

- Using the resulting **associations**, apply your chunker to the **test file**. You will write a `predict(model, corpus)` function, where `model` will be your associations and `corpus`, the test corpus. You will format the test corpus as a **dictionary**, where you will add a `pchunk` key for each row with a value that will correspond to the **predicted chunk**.

In [12]:
# Write your code here
def predict(model, corpus):
    for sentence in corpus:
        for row in sentence:
            # add a pchunk key for each row
            # with a value that will correspond to the predicted chunk
            row['pchunk'] = model[row['pos']]
            
    return corpus
    

We load the **test corpus**

In [13]:
test_sentences = read_sentences(test_file)
test_corpus = split_rows(test_sentences, column_names)
test_corpus[:1]

[[{'form': 'Rockwell', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': 'International', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'Corp.', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': "'s", 'pos': 'POS', 'chunk': 'B-NP'},
  {'form': 'Tulsa', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'unit', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'said', 'pos': 'VBD', 'chunk': 'B-VP'},
  {'form': 'it', 'pos': 'PRP', 'chunk': 'B-NP'},
  {'form': 'signed', 'pos': 'VBD', 'chunk': 'B-VP'},
  {'form': 'a', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'tentative', 'pos': 'JJ', 'chunk': 'I-NP'},
  {'form': 'agreement', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'extending', 'pos': 'VBG', 'chunk': 'B-VP'},
  {'form': 'its', 'pos': 'PRP$', 'chunk': 'B-NP'},
  {'form': 'contract', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'with', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'Boeing', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': 'Co.', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'to', 'pos': 'TO', 'chunk': 'B-VP'},
  {'form':

We predict the groups. You should have added a `pchunk` key

In [14]:
# apply def predict(model, corpus) where model is the best association
predicted_test_corpus = predict(pos_chunk, test_corpus)
predicted_test_corpus[:1]

[[{'form': 'Rockwell', 'pos': 'NNP', 'chunk': 'B-NP', 'pchunk': 'I-NP'},
  {'form': 'International', 'pos': 'NNP', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'Corp.', 'pos': 'NNP', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': "'s", 'pos': 'POS', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'Tulsa', 'pos': 'NNP', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'unit', 'pos': 'NN', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'said', 'pos': 'VBD', 'chunk': 'B-VP', 'pchunk': 'B-VP'},
  {'form': 'it', 'pos': 'PRP', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'signed', 'pos': 'VBD', 'chunk': 'B-VP', 'pchunk': 'B-VP'},
  {'form': 'a', 'pos': 'DT', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'tentative', 'pos': 'JJ', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'agreement', 'pos': 'NN', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'extending', 'pos': 'VBG', 'chunk': 'B-VP', 'pchunk': 'B-VP'},
  {'form': 'its', 'pos': 'PRP$', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'c

### Accuracy

We can evaluate the **performance of the baseline** with the tag accuracy: the percentage of words that receive the correct tag.

In [15]:
def eval(predicted):
    """
    Evaluates the predicted chunk accuracy
    :param predicted:
    :return:
    """
    word_cnt = 0
    correct = 0
    for sentence in predicted:
        for row in sentence:
            word_cnt += 1
            if row['chunk'] == row['pchunk']:
                correct += 1
    return correct / word_cnt

In [16]:
accuracy = eval(predicted_test_corpus)
accuracy

0.7729066846782194

### The CoNLL evaluation

The accuracy is very misleading as it is **biased** by the most frequent tags. It is not a good way to evaluate chunking. Instead, CoNLL computes the **F1 score** of all the chunks with a specific **evaluation script**.

#### Saving the corpus

To use the **CoNLL evaluation script**, you will store your results in an **output file** that has four columns. The three first columns will be the input columns from the test file: 
* word, 
* part of speech, and 
* gold-standard chunk. 

You will append the **predicted chunk** as the 4th column. Your output file should look like the excerpt below:
```
Rockwell NNP B-NP I-NP
International NNP I-NP I-NP
Corp. NNP I-NP I-NP
's POS B-NP B-NP
Tulsa NNP I-NP I-NP
unit NN I-NP I-NP
said VBD B-VP B-VP
it PRP B-NP B-NP
```
The separator is the space.

You will use a `save_results(output_dict, keys, output_file)` function, where the keys will be `['form', 'pos', 'chunk', 'pchunk']`

In [17]:
keys = ['form', 'pos', 'chunk', 'pchunk']

In [18]:
def save_results(output_dict, keys, output_file):
    f_out = open(output_file, 'w')
    # We write the word (form), part of speech (pos),
    # gold-standard chunk (chunk), and predicted chunk (pchunk)
    for sentence in output_dict:
        for row in sentence:
            for key in keys:
                f_out.write(row[key] + ' ')
            f_out.write('\n')
        f_out.write('\n')
    f_out.close()
    return

In [19]:
save_results(predicted_test_corpus, keys, 'out')

The **CoNLL 2000 evaluation script** will use these two last columns, **chunk** and **predicted chunk**, to compute the performance.

### Evaluation

To evaluate your results, you have two options:
1. Use the original conlleval script here:  <a href="https://www.clips.uantwerpen.be/conll2000/chunking/"><tt>conlleval.txt</tt></a>.
2. Use a Python translation of it. 

You will use the second option and you will **describe the results you obtained in your report**.

#### The Python translation

Install the script with:
```
pip install conlleval
```
from https://github.com/kaniblu/conlleval
and run the cell below

In [20]:
import conlleval
lines = open('out').read().splitlines()

# use CoNLL 2000 evaluate
res = conlleval.evaluate(lines)
baseline_score = res['overall']['chunks']['evals']['f1']

In [21]:
baseline_score

0.770671072299583

### The official script

You may want to double-check your results with the **original CoNLL script**. It is more complex to use however:
* <tt>conlleval.txt</tt> is the official CoNLL Perl script. It expects the two last columns of the test set to be the manually assigned chunk (gold standard) and the predicted chunk.
* <tt>conlleval.txt</tt> was written for Unix and if you run Windows, you will have to use a terminal command. In the File menu of the notebook, select New and then Terminal.
* Start it like this: ` $ conlleval.txt <out` where the `out` file contains both the gold and predicted chunk tags. `conlleval.txt` is a Perl script.
* Perl is installed on most Unix distributions. If it is not installed on your machine, you need to install it. Make also sure that you have the execution rights. Otherwise change them with: `$ chmod +x conlleval.txt`
* The `conlleval.txt` script expects the new lines to be `\n` as in Unix. If you run your Python program on Windows, your new lines will be `\r\n`. To have the correct new lines, add this parameter to `open()`: `newline='\n’` like this: `f_out = open('out', ‘w’, newline='\n’)`
* The complete description of the CoNLL 2000 evaluation script is available here: [https://www.clips.uantwerpen.be/conll2000/chunking/output.html](https://www.clips.uantwerpen.be/conll2000/chunking/output.html)

## Using Machine Learning: A first ML program <a name='t3'/>

In this exercise, you will apply and explore a machine-learning program.

The program that won the CoNLL 2000 shared task (Kudoh and Matsumoto, 2000) used a **window of five words** around the chunk tag to identify, $c_i$. They built a **feature vector** consisting of:
1. The values of the **five words** in this window: $w_{i-2}, w_{i-1}, w_{i}, w_{i+1}, w_{i+2}$
2. The values of the **five pos** parts of speech in this window: $t_{i-2}, t_{i-1}, t_{i}, t_{i+1}, t_{i+2}$
3. The values of the **two previous chunk** tags in the first part of the window: $c_{i-2}, c_{i-1}$

The two last parameters (3.) are said to be **dynamic** because the program computes them at run-time. Read [Kudoh and Matsumoto's paper](http://www.clips.uantwerpen.be/conll2000/pdf/14244kud.pdf) and the [Yamcha](http://www.chasen.org/~taku/software/yamcha/) software site.

You will start with a given code that uses the two first sets of features (1. and 2.) and add yourself the last one (3.) to improve the performance of your chunker. Kudoh and Matsumoto trained a classifier based on **support vector machines**. You will use **logistic regression**.

### Imports

In [22]:
import bs4
import os
import requests
from sklearn.feature_extraction import DictVectorizer
from sklearn import svm
from sklearn import linear_model
from sklearn import metrics
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import GridSearchCV
import time

### Feature extraction

#### Functions

A first function to extract features **from one sentence**

In [23]:
def extract_features_sent_static(sentence, w_size, feature_names):
    """
    Extract the features from one sentence
    returns X and y, where X is a list of dictionaries and
    y is a list of symbols
    :param sentence: string containing the CoNLL structure of a sentence
    :param w_size:
    :return:
    """

    # We pad the sentence to extract the context window more easily
    start = [{'form': 'BOS', 'pos': 'BOS', 'chunk': 'BOS'}]
    end = [{'form': 'EOS', 'pos': 'EOS', 'chunk': 'EOS'}]
    start *= w_size
    end *= w_size
    padded_sentence = start + sentence
    padded_sentence += end

    # We extract the features and the classes
    # X contains is a list of features, where each feature vector is a dictionary
    # y is the list of classes
    X = list()
    y = list()
    for i in range(len(padded_sentence) - 2 * w_size):
        # x is a row of X
        x = list()
        
        # The words in lower case
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j]['form'].lower())
            
        # The POS
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j]['pos'])
            
        # The chunks (Up to the word)
        """
        for j in range(w_size):
            feature_line.append(padded_sentence[i + j]['chunk'])
        """
        # We represent the feature vector as a dictionary
        X.append(dict(zip(feature_names, x)))
        
        # The classes are stored in a list
        y.append(padded_sentence[i + w_size]['chunk'])
        
    return X, y

And **from all the sentences**

In [24]:
def extract_features_static(sentences, w_size, feature_names):
    """
    Builds X matrix and y vector
    X is a list of dictionaries and y is a list
    :param sentences:
    :param w_size:
    :return:
    """
    X_l = []
    y_l = []
    for sentence in sentences:
        X, y = extract_features_sent_static(sentence, w_size, feature_names)
        X_l.extend(X)
        y_l.extend(y)
    return X_l, y_l

#### Applying the feature extraction

The size of the window and the names of the features

In [25]:
w_size = 2  # The size of the context window to the left and right of the word
feature_names = ['word_n2', 'word_n1', 'word', 'word_p1', 'word_p2',
                 'pos_n2', 'pos_n1', 'pos', 'pos_p1', 'pos_p2']

We read the **training** corpus and format it as a **dictionary**

In [26]:
train_sentences = read_sentences(train_file)
train_corpus = split_rows(train_sentences, column_names)

In [27]:
train_corpus[:2]

[[{'form': 'Confidence', 'pos': 'NN', 'chunk': 'B-NP'},
  {'form': 'in', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'the', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'pound', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'is', 'pos': 'VBZ', 'chunk': 'B-VP'},
  {'form': 'widely', 'pos': 'RB', 'chunk': 'I-VP'},
  {'form': 'expected', 'pos': 'VBN', 'chunk': 'I-VP'},
  {'form': 'to', 'pos': 'TO', 'chunk': 'I-VP'},
  {'form': 'take', 'pos': 'VB', 'chunk': 'I-VP'},
  {'form': 'another', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'sharp', 'pos': 'JJ', 'chunk': 'I-NP'},
  {'form': 'dive', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'if', 'pos': 'IN', 'chunk': 'B-SBAR'},
  {'form': 'trade', 'pos': 'NN', 'chunk': 'B-NP'},
  {'form': 'figures', 'pos': 'NNS', 'chunk': 'I-NP'},
  {'form': 'for', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'September', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': ',', 'pos': ',', 'chunk': 'O'},
  {'form': 'due', 'pos': 'JJ', 'chunk': 'B-ADJP'},
  {'form': 'for', 'pos': 'IN', 'ch

In [28]:
X_dict, y = extract_features_static(train_corpus, w_size, feature_names)
X_dict[:2]

[{'word_n2': 'bos',
  'word_n1': 'bos',
  'word': 'confidence',
  'word_p1': 'in',
  'word_p2': 'the',
  'pos_n2': 'BOS',
  'pos_n1': 'BOS',
  'pos': 'NN',
  'pos_p1': 'IN',
  'pos_p2': 'DT'},
 {'word_n2': 'bos',
  'word_n1': 'confidence',
  'word': 'in',
  'word_p1': 'the',
  'word_p2': 'pound',
  'pos_n2': 'BOS',
  'pos_n1': 'NN',
  'pos': 'IN',
  'pos_p1': 'DT',
  'pos_p2': 'NN'}]

In [29]:
y[:2]

['B-NP', 'B-PP']

### Feature encoding

In [30]:
# Vectorize the feature matrix and carry out a one-hot encoding
vec = DictVectorizer(sparse=True)
X = vec.fit_transform(X_dict)

### Training the model

In [31]:
classifier = linear_model.LogisticRegression()
model = classifier.fit(X, y)
model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### Predicting the test set

We read the sentences and create a dictionary

In [32]:
test_sentences = read_sentences(test_file)
test_corpus = split_rows(test_sentences, column_names)
test_corpus[:2]

[[{'form': 'Rockwell', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': 'International', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'Corp.', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': "'s", 'pos': 'POS', 'chunk': 'B-NP'},
  {'form': 'Tulsa', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'unit', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'said', 'pos': 'VBD', 'chunk': 'B-VP'},
  {'form': 'it', 'pos': 'PRP', 'chunk': 'B-NP'},
  {'form': 'signed', 'pos': 'VBD', 'chunk': 'B-VP'},
  {'form': 'a', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'tentative', 'pos': 'JJ', 'chunk': 'I-NP'},
  {'form': 'agreement', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'extending', 'pos': 'VBG', 'chunk': 'B-VP'},
  {'form': 'its', 'pos': 'PRP$', 'chunk': 'B-NP'},
  {'form': 'contract', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'with', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'Boeing', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': 'Co.', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'to', 'pos': 'TO', 'chunk': 'B-VP'},
  {'form':

We extract the **features**

In [33]:
X_test_dict, y_test = extract_features_static(test_corpus, w_size, feature_names)
X_test_dict[:2]

[{'word_n2': 'bos',
  'word_n1': 'bos',
  'word': 'rockwell',
  'word_p1': 'international',
  'word_p2': 'corp.',
  'pos_n2': 'BOS',
  'pos_n1': 'BOS',
  'pos': 'NNP',
  'pos_p1': 'NNP',
  'pos_p2': 'NNP'},
 {'word_n2': 'bos',
  'word_n1': 'rockwell',
  'word': 'international',
  'word_p1': 'corp.',
  'word_p2': "'s",
  'pos_n2': 'BOS',
  'pos_n1': 'NNP',
  'pos': 'NNP',
  'pos_p1': 'NNP',
  'pos_p2': 'POS'}]

In [34]:
y_test[:2]

['B-NP', 'I-NP']

We **vectorize** the features

In [35]:
X_test = vec.transform(X_test_dict)  # Possible to add: .toarray()

And we **predict** the test set

In [36]:
y_test_predicted = classifier.predict(X_test)
y_test_predicted[:2]

array(['B-NP', 'I-NP'], dtype='<U7')

We now add the **predicted chunks** to the sentences

In [37]:
inx = 0
for sentence in test_corpus:
    for word in sentence:
        word['pchunk'] = y_test_predicted[inx]
        inx += 1

The **index** sould be equal to the length of the prediction

In [38]:
print(inx)
len(y_test_predicted)

47377


47377

In [39]:
test_corpus[:2]

[[{'form': 'Rockwell', 'pos': 'NNP', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'International', 'pos': 'NNP', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'Corp.', 'pos': 'NNP', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': "'s", 'pos': 'POS', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'Tulsa', 'pos': 'NNP', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'unit', 'pos': 'NN', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'said', 'pos': 'VBD', 'chunk': 'B-VP', 'pchunk': 'B-VP'},
  {'form': 'it', 'pos': 'PRP', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'signed', 'pos': 'VBD', 'chunk': 'B-VP', 'pchunk': 'B-VP'},
  {'form': 'a', 'pos': 'DT', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'tentative', 'pos': 'JJ', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'agreement', 'pos': 'NN', 'chunk': 'I-NP', 'pchunk': 'I-NP'},
  {'form': 'extending', 'pos': 'VBG', 'chunk': 'B-VP', 'pchunk': 'B-VP'},
  {'form': 'its', 'pos': 'PRP$', 'chunk': 'B-NP', 'pchunk': 'B-NP'},
  {'form': 'c

In [40]:
save_results(test_corpus, keys, 'out')

#### Evaluating the performance

In [41]:
lines = open('out').read().splitlines()
res = conlleval.evaluate(lines)
simple_ml_score = res['overall']['chunks']['evals']['f1']

In [42]:
simple_ml_score

0.915119639107748

### Question on the ML program

1. What is the **feature vector** that corresponds to the <tt>ml_chunker.py</tt> program? Is it the same Kudoh
    and Matsumoto used in their experiment?
2. What is the **performance** of the chunker?
3. Remove the lexical features (the words) from the feature vector and measure the performance. You should
    observe a decrease.
4. What is the **classifier** used in the program? 
5. As an optional task, you may try **two other classifiers from sklearn** and measure their performance: decision trees, perceptron, support vector machines, etc. Be aware that support vector machines take a long time to train: up to one hour.

## Using Machine Learning: Adding all the features from Kudoh and Matsumoto <a name='t4'/>

Complement the feature vector used in the previous section with the **two dynamic features**, $c_{i-2}, c_{i-1}$, and train a new model. You will need to write a new `extract_features_sent_dyn` and `predict` functions. 
In his experiments, your teacher obtained a F1 score of 92.65 with **logistic regression** and the default parameters from sklearn, i.e. `linear_model.LogisticRegression()`;

**A frequent mistake in the labs** is to use the gold-standard chunks from the test set. Be aware that  when you predict the test set, you do not know the dynamic features in advance and you must  not use the ones from the test file. You will **use the two previous chunk tags that you have predicted**.

You need to reach a **global F1 score** of 92 to pass this laboratory.

In [43]:
### write your code here
#  function to extract dynamic features from one sentence
def extract_features_sent_dyn(sentence, w_size, feature_names, prediction=False):
    """
    Extract the dynamic features from one sentence
    returns X and y, where X is a list of dictionaries and
    y is a list of symbols
    :param sentence: string containing the CoNLL structure of a sentence
    :param w_size:
    :return:
    """

    # We pad the sentence to extract the context window more easily
    start = [{'form': 'BOS', 'pos': 'BOS', 'chunk': 'BOS'}]
    end = [{'form': 'EOS', 'pos': 'EOS', 'chunk': 'EOS'}]
    start *= w_size
    end *= w_size
    padded_sentence = start + sentence
    padded_sentence += end

    # We extract the dynamic features and the classes
    # X contains is a list of features, where each feature vector is a dictionary
    # y is the list of classes
    X = list()
    y = list()
    for i in range(len(padded_sentence) - 2 * w_size):
        # x is a row of X
        x = list()
        
        # The words in lower case
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j]['form'].lower())
            
        # The POS
        for j in range(2 * w_size + 1):
            x.append(padded_sentence[i + j]['pos'])
            
        # The chunks (Up to the word)
        """
        for j in range(w_size):
            feature_line.append(padded_sentence[i + j]['chunk'])
        """
        if not prediction:
            for j in range(w_size):
                x.append(padded_sentence[i+j]['chunk'])
                
        # We represent the feature vector as a dictionary
        X.append(dict(zip(feature_names, x)))
        
        # The classes are stored in a list
        y.append(padded_sentence[i + w_size]['chunk'])
        
    return X, y

In [44]:
def extract_features_dyn(sentences, w_size, feature_names):
    """
    Builds X matrix and y vector
    X is a list of dictionaries and y is a list
    :param sentences:
    :param w_size:
    :return:
    """
    X_l = []
    y_l = []
    for sentence in sentences:
        X, y = extract_features_sent_dyn(sentence, w_size, feature_names)
        X_l.extend(X)
        y_l.extend(y)
    return X_l, y_l

In [45]:
feature_names_dyn = ['word_n2', 'word_n1', 'word', 'word_p1', 'word_p2',
                     'pos_n2', 'pos_n1', 'pos', 'pos_p1', 'pos_p2', 'chunk_n2',
                     'chunk_n1']

**NOTE:** The goal of this task is to come forward with machine learning methods which 
after a training phase can recognize the **chunk segmentation** of the test data as well as 
possible. The training data can be used for training the text chunker. 

The chunkers will be evaluated with the **F rate**, which is a combination of the **precision** 
and **recall** rates: $F = 2*precision*recall / (recall + precision)$ [Rij79]. The precision and recall numbers will be computed over all types of chunks.

In [46]:
train_sentences = read_sentences(train_file)
train_corpus = split_rows(train_sentences, column_names)
train_corpus[:2]

[[{'form': 'Confidence', 'pos': 'NN', 'chunk': 'B-NP'},
  {'form': 'in', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'the', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'pound', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'is', 'pos': 'VBZ', 'chunk': 'B-VP'},
  {'form': 'widely', 'pos': 'RB', 'chunk': 'I-VP'},
  {'form': 'expected', 'pos': 'VBN', 'chunk': 'I-VP'},
  {'form': 'to', 'pos': 'TO', 'chunk': 'I-VP'},
  {'form': 'take', 'pos': 'VB', 'chunk': 'I-VP'},
  {'form': 'another', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'sharp', 'pos': 'JJ', 'chunk': 'I-NP'},
  {'form': 'dive', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'if', 'pos': 'IN', 'chunk': 'B-SBAR'},
  {'form': 'trade', 'pos': 'NN', 'chunk': 'B-NP'},
  {'form': 'figures', 'pos': 'NNS', 'chunk': 'I-NP'},
  {'form': 'for', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'September', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': ',', 'pos': ',', 'chunk': 'O'},
  {'form': 'due', 'pos': 'JJ', 'chunk': 'B-ADJP'},
  {'form': 'for', 'pos': 'IN', 'ch

In [47]:
X_dict, y = extract_features_dyn(train_corpus, w_size, feature_names_dyn)

In [48]:
X_dict[:3]

[{'word_n2': 'bos',
  'word_n1': 'bos',
  'word': 'confidence',
  'word_p1': 'in',
  'word_p2': 'the',
  'pos_n2': 'BOS',
  'pos_n1': 'BOS',
  'pos': 'NN',
  'pos_p1': 'IN',
  'pos_p2': 'DT',
  'chunk_n2': 'BOS',
  'chunk_n1': 'BOS'},
 {'word_n2': 'bos',
  'word_n1': 'confidence',
  'word': 'in',
  'word_p1': 'the',
  'word_p2': 'pound',
  'pos_n2': 'BOS',
  'pos_n1': 'NN',
  'pos': 'IN',
  'pos_p1': 'DT',
  'pos_p2': 'NN',
  'chunk_n2': 'BOS',
  'chunk_n1': 'B-NP'},
 {'word_n2': 'confidence',
  'word_n1': 'in',
  'word': 'the',
  'word_p1': 'pound',
  'word_p2': 'is',
  'pos_n2': 'NN',
  'pos_n1': 'IN',
  'pos': 'DT',
  'pos_p1': 'NN',
  'pos_p2': 'VBZ',
  'chunk_n2': 'B-NP',
  'chunk_n1': 'B-PP'}]

You will now **vectorize the training set**

In [49]:
# Write your code
# Vectorize the feature matrix and carry out a one-hot encoding
vec_dyn = DictVectorizer(sparse=True)
X = vec_dyn.fit_transform(X_dict)

And fit the **dynamic model**

In [50]:
# Write your code
classifier = linear_model.LogisticRegression()
model = classifier.fit(X, y)

In [51]:
model

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### Prediction

You will finally predict the test set. We **load the corpus** again.

In [52]:
# We read the sentences and create a dictionary
test_sentences = read_sentences(test_file)
test_corpus = split_rows(test_sentences, column_names)
test_corpus[:2]

[[{'form': 'Rockwell', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': 'International', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'Corp.', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': "'s", 'pos': 'POS', 'chunk': 'B-NP'},
  {'form': 'Tulsa', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'unit', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'said', 'pos': 'VBD', 'chunk': 'B-VP'},
  {'form': 'it', 'pos': 'PRP', 'chunk': 'B-NP'},
  {'form': 'signed', 'pos': 'VBD', 'chunk': 'B-VP'},
  {'form': 'a', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'tentative', 'pos': 'JJ', 'chunk': 'I-NP'},
  {'form': 'agreement', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'extending', 'pos': 'VBG', 'chunk': 'B-VP'},
  {'form': 'its', 'pos': 'PRP$', 'chunk': 'B-NP'},
  {'form': 'contract', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'with', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'Boeing', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': 'Co.', 'pos': 'NNP', 'chunk': 'I-NP'},
  {'form': 'to', 'pos': 'TO', 'chunk': 'B-VP'},
  {'form':

Let us extract the **static features** from **one sentence**

In [53]:
X_test_dict, y_test = extract_features_static([test_corpus[0]], w_size, feature_names)
X_test_dict[:2]

[{'word_n2': 'bos',
  'word_n1': 'bos',
  'word': 'rockwell',
  'word_p1': 'international',
  'word_p2': 'corp.',
  'pos_n2': 'BOS',
  'pos_n1': 'BOS',
  'pos': 'NNP',
  'pos_p1': 'NNP',
  'pos_p2': 'NNP'},
 {'word_n2': 'bos',
  'word_n1': 'rockwell',
  'word': 'international',
  'word_p1': 'corp.',
  'word_p2': "'s",
  'pos_n2': 'BOS',
  'pos_n1': 'NNP',
  'pos': 'NNP',
  'pos_p1': 'NNP',
  'pos_p2': 'POS'}]

This $\mathbf{X}\_{\textrm{dict}}$ is incomplete. For the prediction, we need to **reinject dynamically the two previously predicted tags** to have the **full feature vector**. Write this code here. 

This part is probably the most difficult of the lab. You may want to write it first for **one sentence**, and then for the test corpus. The prediction will take a longer time and you may want to include a **progress bar** with this snippet: 
```
from tqdm import tqdm
for test_sentence in tqdm(test_corpus):
```

**NOTE:** Since the **chunk labels** are not given in the test data, they are decided
dynamically during the tagging of chunk labels. This technique can be regarded as a sort of **Dynamic Programming (DP) matching**, in which the best answer is searched by maximizing the
total certainty score for the combination of tags. 

#### Experiment1: Prediction for one sentence

y_test_predicted_dyn = []

#write your code here
#from tqdm import tqdm

#for test_sentence in tqdm(test_corpus):
#for test_sentence in test_corpus:

#extract the features dynamically
X_test_dict, y_test = extract_features_sent_dyn(test_sentence, w_size, 
                                                feature_names, True)
#X_test_dict, y_test = extract_features_dyn(test_corpus, w_size, feature_names)

test_data = X_test_dict.copy()

test_data[0]['chunk_n2'] = 'BOS'
test_data[0]['chunk_n1'] = 'BOS'
test_data[1]['chunk_n2'] = 'BOS'

data_length = len(test_data)

for i in range(data_length-1):

    vector = vec_dyn.transform(test_data[i])
    #print('vector X', vector.shape)
    chunk_prediction = model.predict(vector)[0]  # XXX TOP SECRET !
    #chunk_prediction = model.predict(vector)
    #print(chunk_prediction)
    #print(type(chunk_prediction))
    
    test_data[i+1]['chunk_n1'] = chunk_prediction
    
    if(i<data_length-2):
        test_data[i+2]['chunk_n2'] = chunk_prediction
        
y_test_predicted_dyn = model.predict(vec_dyn.transform(test_data))


In [461]:
#data_length

In [460]:
#print(len(y_test_predicted_dyn))

In [457]:
#X_test_dict[:4]

In [459]:
#y_test_predicted_dyn[:3]

#### Experiment2: Prediction for one sentence

In [56]:
y_test_predicted_dyn = []

In [57]:
#initialization of the two previously predicted chunks
y_test_predicted_labels = ['BOS', 'BOS']
#y_test_predicted_labels = []

X_test_dict, y_test = extract_features_sent_dyn(test_corpus[0], w_size, feature_names, True)
X_test_dict[:2]

    
#inject the two previously predicted tags
for word in X_test_dict:
    word['chunk_n2'] = y_test_predicted_labels[-2]
    word['chunk_n1'] = y_test_predicted_labels[-1]
    #print('chunks ', x)

        
    # vectorize the test sentence/features
    #vec_dyn = DictVectorizer(sparse=True)
    X_test = vec_dyn.transform(word)
        
    # predict the chunks in the test set
    predicted = model.predict(X_test)
    #predicted = model.predict(X_test)[0]
    #print(y_test_predicted_dyn[0])
    #print('predicted', predicted)
        
    
    #print('labels 1 and 2 before ', y_test_predicted_labels)
    y_test_predicted_dyn.append(predicted[0])
    
    # update the chunk labels c(i-2) and c(i-1)
    y_test_predicted_labels[0] = y_test_predicted_labels[1]
    y_test_predicted_labels[1] = predicted[0]
    #print('labels 1-2 after', y_test_predicted_labels)
        
 #append the predicted chunks as a last column
#y_test_predicted_labels = y_test_predicted_dyn[2:]

In [58]:
#X_test_dict[:4]

In [59]:
#print(len(y_test_predicted_dyn)) # 28
#print(y_test_predicted_dyn[:3])
# ['B-NP' 'I-NP' 'I-NP']

In [60]:
y_test_predicted_dyn = []

In [61]:
# write your code here
from tqdm import tqdm

for test_sentence in tqdm(test_corpus):
#for test_sentence in test_corpus:

    # initialize the two previously predicted chunks
    c = ['BOS', 'BOS']
    
    # extract the features dynamically
    X_test_dict, y_test = extract_features_sent_dyn(test_sentence, w_size, 
                                                    feature_names, True)
   # X_test_dict, y_test = extract_features_dyn(test_sentence, w_size, feature_names)
    
    # Iterate over the words in each sentence
    # inject the two previously predicted tags
    for word in X_test_dict:
        # overwrite the golden chunks from previous tag
        word['chunk_n2'] = c[-2]
        word['chunk_n1'] = c[-1]
        #print('chunks ', )word
        
        # vectorize the test sentence/features
        X_test = vec_dyn.transform(word)
        
        # predict the chunk in this iteration
        #y_test_predicted_dyn = classifier.predict(X_test)
        predicted_chunk = model.predict(X_test)
        
        # append the result
        y_test_predicted_dyn.append(predicted_chunk[0])
        
        # update the chunk labels for future use
        #y_test_predicted_labels.append(y_test_predicted_dyn[0])
        #c = predicted_chunk[0:2]
        c[-2] = c[-1]
        c[-1] = predicted_chunk[0]
        #print('The two previously predicted tags ', c)
        
    # append the predicted chunks as a last column
    #y_test_predicted_labels = y_test_predicted_dyn[2:]
    #print(y_test_predicted_labels)
    


100%|██████████████████████████████████████████████████████████████████████████████| 2012/2012 [00:22<00:00, 88.87it/s]


In [62]:
#X_test_dict[:4]

In [63]:
print(len(y_test_predicted_dyn))
y_test_predicted_dyn[:3]

47377


['B-NP', 'I-NP', 'I-NP']

In [64]:
print(len(X_test_dict))

28


In [65]:
inx = 0
for sentence in test_corpus:
    for word in sentence:
        word['pchunk'] = y_test_predicted_dyn[inx]
        inx += 1

In [66]:
save_results(test_corpus, keys, 'out')

#### Evaluation

In [67]:
lines = open('out').read().splitlines()
res = conlleval.evaluate(lines)
improved_ml_score = res['overall']['chunks']['evals']['f1']
improved_ml_score

0.9231961786642086

### Optional improvement

As an optional task, you can try to improve the score with **beam search**. If you know this technique, apply it using the probability output of **logistic regression**.

With the same classifier and a beam diameter of 5, your teacher obtained 92.87.

## Collecting the entities <a name='t5'/>

You will now collect all the **named entities** from the training set, defined as **NP chunks** and starting with a `NNP` (proper noun) or a `NNPS` (proper noun, plural) tag. As an example, in the first sentence of `train_corpus`, you will extract `('September', )` and `('July', 'and', 'August')`. You will set all the **tuples** in a **set** that you will call `ne_set`.

**NOTE:** Parts of speech are useful features for labeling **named entities**
like people or organizations in **information extraction**.

In [68]:
train_corpus[:10]

[[{'form': 'Confidence', 'pos': 'NN', 'chunk': 'B-NP'},
  {'form': 'in', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'the', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'pound', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'is', 'pos': 'VBZ', 'chunk': 'B-VP'},
  {'form': 'widely', 'pos': 'RB', 'chunk': 'I-VP'},
  {'form': 'expected', 'pos': 'VBN', 'chunk': 'I-VP'},
  {'form': 'to', 'pos': 'TO', 'chunk': 'I-VP'},
  {'form': 'take', 'pos': 'VB', 'chunk': 'I-VP'},
  {'form': 'another', 'pos': 'DT', 'chunk': 'B-NP'},
  {'form': 'sharp', 'pos': 'JJ', 'chunk': 'I-NP'},
  {'form': 'dive', 'pos': 'NN', 'chunk': 'I-NP'},
  {'form': 'if', 'pos': 'IN', 'chunk': 'B-SBAR'},
  {'form': 'trade', 'pos': 'NN', 'chunk': 'B-NP'},
  {'form': 'figures', 'pos': 'NNS', 'chunk': 'I-NP'},
  {'form': 'for', 'pos': 'IN', 'chunk': 'B-PP'},
  {'form': 'September', 'pos': 'NNP', 'chunk': 'B-NP'},
  {'form': ',', 'pos': ',', 'chunk': 'O'},
  {'form': 'due', 'pos': 'JJ', 'chunk': 'B-ADJP'},
  {'form': 'for', 'pos': 'IN', 'ch

You can write a **two-pass procedure**. For each sentence of the corpus:
1. In the first pass, you will collect the **start indices** of the noun groups which are also proper nouns. For the first sentence, it will result in the list `[16, 30]`;
2. In the second pass, you will collect **the segments**, starting at each index. For the first sentence, it will result in the tuples `('September',)`and `('July', 'and', 'August')`

Should you have a better solution, please use it.

#### Collect the start indices 1

In [91]:
startidx = []
ne_set = set()
#ne_set = []
#segment = ()
#inx = []

for sentence in train_corpus:
    idx = [i for i in range(len(sentence)) 
           if (sentence[i]['pos'] in ['NNP', 'NNPS']) 
           and (sentence[i]['chunk'] =='B-NP')]
    #print('idx ', idx)
    #if len(idx) != 0:
    startidx.append(idx)
    
    for i in idx:
        segment = ()
        #print(sentence[i])
        segment = (sentence[i]['form'],)
        #print('Hi', segment)
        #print(sentence[i+1]['chunk'])
        
        while i<(len(sentence)-1) and sentence[i+1]['chunk'] == 'I-NP':
            #if sentence[i+1]['chunk'] == 'I-NP':
            segment += (sentence[i+1]['form'],)
                #segment = (sentence[i]['form'], sentence[i+1]['form'])
                #segment = (sentence[i+1]['form'],)
            #else:
            #print(sentence[i]['form'])
            #segment = (sentence[i]['form'],)
            i += 1
    
        ne_set.add(segment)
        #ne_set.append(segment)
        #ne_set.update(segment)
            #print(inx)
            #ne_set.add(word['form'])

In [92]:
        
print(len(ne_set))

4348


In [93]:
list(ne_set)[:10]

[('Avdel',),
 ('London', 'brokers', 'UBS', 'Phillips', '&', 'Drew'),
 ('Reuben', 'Mark'),
 ('Nynex',),
 ('Foreign', 'Minister', 'Hans-Dietrich', 'Genscher'),
 ('Lehman', 'Management', 'Co'),
 ('Barron',),
 ('Papua', 'New', 'Guinea'),
 ('First', 'Gibraltar', 'Bank', 'F.S.B.'),
 ('Barry', 'R.', 'Ostrager')]

In [94]:
if ('Streetspeak',) in list(ne_set):
    print('yes')

yes


#### Collecting the start indices 2

In [85]:
ne_set2 = set()
startIndices = []
for sentence in train_corpus:    
    inx = []
    duplicates = []
    #print(len(train_corpus[0]))
    #for word in train_corpus[0]:
    res = [idx for idx, val in enumerate(sentence) 
                   if (sentence[idx]['pos'] in ['NNP', 'NNPS']) 
                   and (sentence[idx]['chunk'] =='B-NP')]
    #inx.append(res)
    startIndices.append(res)    

In [86]:
print(len(startIndices))
print(len(startidx))
print(startIndices[:10])
print(startidx[:10])

8936
8936
[[16, 30], [4], [], [], [20, 27], [20, 23, 26, 32], [37], [11, 22], [0, 6, 17], []]
[[16, 30], [4], [], [], [20, 27], [20, 23, 26, 32], [37], [11, 22], [0, 6, 17], []]


#sentence = train_corpus[0]
for inx in startIndices:
#for inx in startidx:
    #print(inx)
    s = startIndices.index(inx)
    #s = startidx.index(inx)
    #print(s)
    sentence = train_corpus[s]
    #segment = ()
    for i in inx:
        segment = ()
        #print(sentence[i])
        segment = (sentence[i]['form'],)
        #print('Hi', segment)
        #print(sentence[i+1]['chunk'])
        
        while i<(len(sentence)-1) and sentence[i+1]['chunk'] == 'I-NP':
            #if sentence[i+1]['chunk'] == 'I-NP':
            segment += (sentence[i+1]['form'],)
                #segment = (sentence[i]['form'], sentence[i+1]['form'])
                #segment = (sentence[i+1]['form'],)
            #else:
            #print(sentence[i]['form'])
            #segment = (sentence[i]['form'],)
            i += 1
    
        ne_set2.add(segment)
            #print(inx)
            #ne_set.add(word['form'])

#print(len(startIndices)) 
#print(sentence[startIndices[10]])
print(len(ne_set2))
list(ne_set2)[:10]

len(ne_set)

list(ne_set)[:10]

### Creating a small set

To run the subsequent experiments faster, you will **limit the dataset** to the entities starting with letter `K`. I chose this letter, because it corresponded to one of the smallest sets. You will call the resulting set: `ne_small_set`. Feel free to use the full set after you have completed this assignment.

In [102]:
print(list(ne_set)[0])
print('Hi ', str(list(ne_set)[0][0][0]))
print(str(list(ne_set)[0][0]))
if str(list(ne_set)[0][0][0]) == 'A':
    print('yes')
else:
    print('NO')

('Avdel',)
Hi  A
Avdel
yes


In [105]:
# Write your code here
#filter_result = list(filter(lambda name: len(name) <= 7, my_names))
ne_small_set = list(filter(lambda neK: str(neK[0][0]) == 'K', list(ne_set)))
len(ne_small_set)

76

In [104]:
ne_small_set

[('Keihin', 'Electric', 'Express', 'Railway', 'Co'),
 ('Keizaikai', 'Corp.'),
 ('Ko', 'Shioya'),
 ('Kenneth', 'H.', 'Olsen'),
 ('Keizaikai',),
 ('Kollmorgen',),
 ('KPMG', 'Peat', 'Marwick'),
 ('Kenneth', 'Abraham'),
 ('Kringle', 'fares'),
 ('Knoxville',),
 ('Kathie', 'Huff'),
 ('Kumagai-Gumi',),
 ('Kawasaki', 'Steel'),
 ('Krenz',),
 ('Kevin', 'Logan'),
 ('Ke', 'Zaishuo'),
 ('Ky.',),
 ('Keith', 'Mulrooney'),
 ('Kansas', 'and', 'Texas'),
 ('Kean', 'forces'),
 ('Kobe', 'Steel'),
 ('Kate', 'Michelman'),
 ('Kajima',),
 ('Kremlin', 'wrangling'),
 ('Kleinwort', 'Benson', 'Government', 'Securities', 'Inc.'),
 ('Kodak',),
 ('Kacy', 'McClelland'),
 ('Kroger', 'Co'),
 ('Kathy', 'Stanwick'),
 ('Kleinwort', 'Benson', 'North', 'America'),
 ('Kurds',),
 ('K', 'mart', 'Corp.', 'Chairman', 'Joseph', 'E.', 'Antonini'),
 ('Kansas',),
 ('Khost',),
 ('Kurt', 'Hager'),
 ('Kary', 'Moss'),
 ('Keefe', ',', 'Bruyette', '&', 'Woods', 'Inc.'),
 ('Kentucky', 'Fried', 'Chicken', 'stores'),
 ('Kidder', ',', 'Peabody

## Resolving the entities <a name='t6'/>

You will now implement a simple **method** to find the **named entities** from the previous exercise in **Wikipedia** and **Wikidata**.

First, look at a few entities in your set and find:
1. a few entities that you think are in wikipedia, 
2. entities that will not be in wikipedia, and 
3. entities that you think are ambiguous: An entity that may correspond to two or more things. 

You will describe your findings in the report.

### A function to lookup entities

- Read the function below and try to understand what it means. You will describe it in your report.

#### NOTE: Web Scraping 
When we **scrape the web**, we write code that sends a request to the server that’s hosting the page we specified: Wikipedia and Wikidata. Generally, our code downloads that page’s source code, just as a browser would. But instead of displaying the page visually, it filters through the page looking for **HTML elements** we’ve specified, and extracting whatever content we’ve instructed it to extract. using Python and the `Beautiful Soup` library is one of the most popular approaches to web scraping.

When we visit a web page, our web browser makes a **request** to a web server. This request is called a `GET` request, since we’re getting files from the server. The server then sends back files that tell our browser how to render the page for us. When we perform web scraping, we’re interested in the **main content** of the web page, so we look at the HTML.

The first thing we’ll need to do to scrape a web page is to download the page. We can download pages using the Python **requests library**. The requests library will make a `GET` request to a web server, which will download the HTML contents of a given web page for us.

Why Web Scraping — When is this Needed?
Web scraping is needed to unlock more powerful analysis when data isn’t available in an organized format. This could be useful for a variety of personal projects. You might, for example, want to scrape a sports website to analyze statistics associated with your favorite team.

But web scraping can also be important for **data analysts** and **data scientists** in a business context. There’s an awful lot of data out on the web that simply isn’t available unless you scrape it (or painstakingly copy it into a spreadsheet by hand for analysis). When that data might contain valuable insights for your company or your industry, you’ll have to turn to web scraping.

In [106]:
def wikipedia_lookup(ner, base_url='https://en.wikipedia.org/wiki/'):
    try:
        # specify the website and named entities recognition ner ?
        url_en = base_url + ' '.join(ner)
        # get/download the webpage using Requests module
        # then read the content of the server 
        html_doc = requests.get(url_en).text
        
        # web scraping using Beautiful Soup library
        # Parsing the page 
        parse_tree = bs4.BeautifulSoup(html_doc, 'html.parser')
        
        # search for items using CSS selectors
        # find the first instance of a tags 
        # a tags are links, and tell the browser to render a link to another web page. 
        # The href property of the tag determines where the link goes
        entity_id = parse_tree.find("a", {"accesskey": "g"})['href']
        
        head_id, entity_id = os.path.split(entity_id)
        return entity_id
    except:
        pass
        # print('Not found in: ', base_url)
    entity_id = 'UNK'
    return entity_id

- Write a function to run the lookup and keep the **resolved entities** (only the resolved entities). You will call it `ne_ids_en`

In [109]:
#Write your code here
#from tqdm import tqdm
ne_ids_en = []
for ne in tqdm(ne_small_set):
    ne_id = wikipedia_lookup(ne, base_url='https://en.wikipedia.org/wiki/')
    ne_ids_en.append(ne_id)

100%|██████████████████████████████████████████████████████████████████████████████████| 76/76 [00:49<00:00,  1.39it/s]


In [110]:
ne_ids_en

['UNK',
 'UNK',
 'UNK',
 'Q454315',
 'UNK',
 'UNK',
 'Q493751',
 'Q59268486',
 'UNK',
 'Q185582',
 'UNK',
 'UNK',
 'Q6379829',
 'Q21512656',
 'UNK',
 'UNK',
 'Q225951',
 'UNK',
 'UNK',
 'UNK',
 'Q1730802',
 'Q785671',
 'Q1081154',
 'UNK',
 'UNK',
 'Q486269',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'Q12223',
 'UNK',
 'Q1558',
 'Q386682',
 'Q95367',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'Q225951',
 'Q327751',
 'UNK',
 'Q1603',
 'Q745099',
 'UNK',
 'Q1865',
 'Q3196386',
 'Q2888777',
 'UNK',
 'UNK',
 'Q358393',
 'Q297659',
 'UNK',
 'Q6339344',
 'Q13403399',
 'UNK',
 'Q224736',
 'Q5838',
 'UNK',
 'UNK',
 'UNK',
 'Q181912',
 'UNK',
 'Q257417',
 'UNK',
 'Q10559663',
 'UNK',
 'Q43164',
 'UNK',
 'UNK',
 'UNK',
 'Q126993',
 'UNK',
 'UNK',
 'UNK']

Sometimes, entities need a **confirmation**. You will apply the resolution with the **Swedish wikipedia**.

In [111]:
# Write your code here
ne_ids_sv = []
for ne in tqdm(ne_small_set):
    ne_id = wikipedia_lookup(ne, base_url='https://sv.wikipedia.org/wiki/')
    ne_ids_sv.append(ne_id)

100%|██████████████████████████████████████████████████████████████████████████████████| 76/76 [00:44<00:00,  1.57it/s]


In [112]:
ne_ids_sv

['UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'Q232749',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'Q486269',
 'UNK',
 'Q153417',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'Q1558',
 'Q386682',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'Q12857502',
 'UNK',
 'Q1603',
 'UNK',
 'UNK',
 'Q1865',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'Q358393',
 'Q297659',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'Q224736',
 'Q5838',
 'UNK',
 'UNK',
 'UNK',
 'Q181912',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'Q952431',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK',
 'UNK']

- You will compute the **intersection** of the two sets. You will assign it to a **list** that you will **sort** and that you will call: `confirmed_ne_en_sv`.

In [146]:
# Write your code here
inter = [ne for ne in ne_ids_en if ne in ne_ids_sv and ne != 'UNK']
#inter.sort()
inter_inx = [ne_ids_en.index(ne) for ne in ne_ids_en if ne in ne_ids_sv and ne != 'UNK']
confirmed_ne = [ne_small_set[i] for i in inter_inx]
#confirmed_ne.sort()
#confirmed_ne_en_sv = [(ne_small_set[i],inter[i]) for i in inter_inx]
confirmed_ne_en_sv = list(zip(confirmed_ne,tuple(inter)))
confirmed_ne_en_sv = sorted(confirmed_ne_en_sv, key=lambda ne: ne[0])
#print(confirmed_ne_en_sv)

In [148]:
print(inter)
print(inter_inx)
print(confirmed_ne)
print('\n', confirmed_ne_en_sv)

['Q486269', 'Q1558', 'Q386682', 'Q1603', 'Q1865', 'Q358393', 'Q297659', 'Q224736', 'Q5838', 'Q181912']
[25, 32, 33, 43, 46, 51, 52, 57, 58, 62]
[('Kodak',), ('Kansas',), ('Khost',), ('Kentucky',), ('Kuala', 'Lumpur'), ('Kenton',), ('Kirin',), ('KIM',), ('Kabul',), ('KLM', 'Royal', 'Dutch', 'Airlines')]

 [(('KIM',), 'Q224736'), (('KLM', 'Royal', 'Dutch', 'Airlines'), 'Q181912'), (('Kabul',), 'Q5838'), (('Kansas',), 'Q1558'), (('Kenton',), 'Q358393'), (('Kentucky',), 'Q1603'), (('Khost',), 'Q386682'), (('Kirin',), 'Q297659'), (('Kodak',), 'Q486269'), (('Kuala', 'Lumpur'), 'Q1865')]


In [149]:
confirmed_ne_en_sv

[(('KIM',), 'Q224736'),
 (('KLM', 'Royal', 'Dutch', 'Airlines'), 'Q181912'),
 (('Kabul',), 'Q5838'),
 (('Kansas',), 'Q1558'),
 (('Kenton',), 'Q358393'),
 (('Kentucky',), 'Q1603'),
 (('Khost',), 'Q386682'),
 (('Kirin',), 'Q297659'),
 (('Kodak',), 'Q486269'),
 (('Kuala', 'Lumpur'), 'Q1865')]

The **first items** in your list should look like:
```
[(('KIM',), 'Q224736'),
 (('KLM', 'Royal', 'Dutch', 'Airlines'), 'Q181912'),
 ...
]
```

## Submission <a name='t7'/>

When you have written all the code and run all the cells, fill in your ID and as well as the name of the notebook.

In [150]:
STIL_ID = ["hi8826mo-s"] # Write your stil ids as a list
CURRENT_NOTEBOOK_PATH = os.path.join(os.getcwd(), 
                                     "4-Chunker_HichamMohamad.ipynb") # Write the name of your notebook

The submission code will send your answer. It consists of the baseline score, the improved machine-learning score, and the confirmed entities.

In [151]:
import json
ANSWER = json.dumps({'baseline_score': baseline_score,
                    'improved_ml_score': improved_ml_score,
                    'confirmed_ne_en_sv': confirmed_ne_en_sv})
ANSWER

'{"baseline_score": 0.770671072299583, "improved_ml_score": 0.9231961786642086, "confirmed_ne_en_sv": [[["KIM"], "Q224736"], [["KLM", "Royal", "Dutch", "Airlines"], "Q181912"], [["Kabul"], "Q5838"], [["Kansas"], "Q1558"], [["Kenton"], "Q358393"], [["Kentucky"], "Q1603"], [["Khost"], "Q386682"], [["Kirin"], "Q297659"], [["Kodak"], "Q486269"], [["Kuala", "Lumpur"], "Q1865"]]}'

Now the moment of truth:
1. Save your notebook and
2. Run the cells below

In [152]:
SUBMISSION_NOTEBOOK_PATH = CURRENT_NOTEBOOK_PATH + ".submission.bz2"

In [153]:
import bz2
ASSIGNMENT = 4
API_KEY = "f581ba347babfea0b8f2c74a3a6776a7"

# Copy and compress current notebook
with bz2.open(SUBMISSION_NOTEBOOK_PATH, mode="wb") as fout:
    with open(CURRENT_NOTEBOOK_PATH, "rb") as fin:
        fout.write(fin.read())

In [154]:
res = requests.post("https://vilde.cs.lth.se/edan20checker/submit", 
                    files={"notebook_file": open(SUBMISSION_NOTEBOOK_PATH, "rb")}, 
                    data={
                        "stil_id": STIL_ID,
                        "assignment": ASSIGNMENT,
                        "answer": ANSWER,
                        "api_key": API_KEY,
                    },
               verify=True)

# from IPython.display import display, JSON
res.json()

{'msg': None,
 'status': 'correct',
 'signature': 'fbe032d32551fb78f8d8ad3808f9f349cc84d400a572246141e6c13a24168ddd778bc15629f4642ea65aab153385a07e11a557e3f403171f59916570280f8df9',
 'submission_id': 'ff442bcc-eb41-492e-ad81-0b3e75be3de9'}

## Reading <a name='t8'/>

You will read the article, <a href="https://www.aclweb.org/anthology/C18-1139"><i>Contextual String
            Embeddings for Sequence Labeling</i></a> by Akbik et al. (2018)
            and you will outline the **main differences** between their system and yours. A LSTM is a type of
            **recurrent neural network**, while CRF is a sort of **beam search**. You will tell the performance
            they reach on the corpus you used in this laboratory.