# lab 3 - Extracting noun groups using machine learning techniques

## Objectives
The objectives of this assignment are to:

- Write a program to detect partial syntactic structures
- Understand the principles of supervised machine learning techniques applied to language processing
- Use a popular machine learning toolkit: scikit-learn
- Write a short report of 1 to 2 pages on the assignment

## Choosing a training and a test sets

In [14]:
from urllib.request import urlopen

b_train_text = urlopen("http://fileadmin.cs.lth.se/cs/Education/EDAN20/corpus/conll2000/train.txt").read() # Open file and read
train_text = str(b_train_text,'utf-8')
b_test_text = urlopen("http://fileadmin.cs.lth.se/cs/Education/EDAN20/corpus/conll2000/test.txt").read() # Open file and read
test_text = str(b_test_text,'utf-8')

In [20]:
print("---TEXT EXAMPLE TRAIN---\n",train_text[:200], "\n ---TEXT EXAMPLE TEST--- \n",test_text[:200])

---TEXT EXAMPLE TRAIN---
 Confidence NN B-NP
in IN B-PP
the DT B-NP
pound NN I-NP
is VBZ B-VP
widely RB I-VP
expected VBN I-VP
to TO I-VP
take VB I-VP
another DT B-NP
sharp JJ I-NP
dive NN I-NP
if IN B-SBAR
trade NN B-NP
figur 
 ---TEXT EXAMPLE TEST--- 
 Rockwell NNP B-NP
International NNP I-NP
Corp. NNP I-NP
's POS B-NP
Tulsa NNP I-NP
unit NN I-NP
said VBD B-VP
it PRP B-NP
signed VBD B-VP
a DT B-NP
tentative JJ I-NP
agreement NN I-NP
extending VBG B-


In [21]:
import sklearn

## Baseline

Most statistical algorithms for language processing start with a so-called baseline. The baseline figure corresponds to the application of a minimal technique that is used to assess the difficulty of a task and for comparison with further programs.

#### 1. Read the baseline proposed by the organizers of the CoNLL 2000 shared task (In the Results Sect.). What do you think of it?


They get pretty high score overall but no method applying advanced ml methods with deep neural networks. (Which is understandable since the conference was held at year 2000.) 

#### 2. Implement this baseline program. You may either create a completely new program or start from an existing program that you will modify. https://github.com/pnugues/ilppp/tree/master/programs/labs/chunking/chunker_python/


Complete the train function so that it computes the chunk distribution for each part of speech. You will use the train file to derive your distribution and you will store the results in a dictionary. Below, you have an excerpt of the expected results:


In [42]:
column_names = ['form', 'pos', 'chunk']

In [43]:
# train_corpus = conll_reader.read_sentences(train_text)
sentences_train = train_text.split('\n\n') 

In [44]:
# train_corpus = conll_reader.split_rows(train_corpus, column_names)
train_corpus = []
for sentence in sentences_train:
    rows = sentence.split('\n')
    sentence = [dict(zip(column_names, row.split())) for row in rows]
    train_corpus.append(sentence)

In [45]:
# train_corpus = conll_reader.read_sentences(train_text)
sentences_test = test_text.split('\n\n') 

In [46]:
# train_corpus = conll_reader.split_rows(train_corpus, column_names)
test_corpus = []
for sentence in sentences_test:
    rows = sentence.split('\n')
    sentence = [dict(zip(column_names, row.split())) for row in rows]
    test_corpus.append(sentence)

In [76]:
train_corpus[1]

[{'form': 'Chancellor', 'pos': 'NNP', 'chunk': 'O'},
 {'form': 'of', 'pos': 'IN', 'chunk': 'B-PP'},
 {'form': 'the', 'pos': 'DT', 'chunk': 'B-NP'},
 {'form': 'Exchequer', 'pos': 'NNP', 'chunk': 'I-NP'},
 {'form': 'Nigel', 'pos': 'NNP', 'chunk': 'B-NP'},
 {'form': 'Lawson', 'pos': 'NNP', 'chunk': 'I-NP'},
 {'form': "'s", 'pos': 'POS', 'chunk': 'B-NP'},
 {'form': 'restated', 'pos': 'VBN', 'chunk': 'I-NP'},
 {'form': 'commitment', 'pos': 'NN', 'chunk': 'I-NP'},
 {'form': 'to', 'pos': 'TO', 'chunk': 'B-PP'},
 {'form': 'a', 'pos': 'DT', 'chunk': 'B-NP'},
 {'form': 'firm', 'pos': 'NN', 'chunk': 'I-NP'},
 {'form': 'monetary', 'pos': 'JJ', 'chunk': 'I-NP'},
 {'form': 'policy', 'pos': 'NN', 'chunk': 'I-NP'},
 {'form': 'has', 'pos': 'VBZ', 'chunk': 'B-VP'},
 {'form': 'helped', 'pos': 'VBN', 'chunk': 'I-VP'},
 {'form': 'to', 'pos': 'TO', 'chunk': 'I-VP'},
 {'form': 'prevent', 'pos': 'VB', 'chunk': 'I-VP'},
 {'form': 'a', 'pos': 'DT', 'chunk': 'B-NP'},
 {'form': 'freefall', 'pos': 'NN', 'chunk': '

In [110]:
def count_pos(corpus):
    """
    Computes the part-of-speech distribution
    in a CoNLL 2000 file
    :param corpus:
    :return:
    """
    pos_cnt = {}
    for sentence in corpus:
        for row in sentence:
            if row == {}:
                continue
            if row['pos'] in pos_cnt:
                pos_cnt[row['pos']] += 1
            else:
                pos_cnt[row['pos']] = 1
    return pos_cnt

In [114]:
def train(corpus):
    """
    Computes the chunk distribution by pos
    The result is stored in a dictionary
    :param corpus:
    :return:
    """
    pos_cnt = count_pos(corpus)
    # We compute the chunk distribution by POS
    chunk_dist = {key: {} for key in pos_cnt.keys()}
    
    
    """
    Fill in code to compute the chunk distribution for each part of speech
    """

    # We determine the best association
    pos_chunk = {}
    """
    Fill in code so that for each part of speech, you select the most frequent chunk.
    You will build a dictionary with key values:
    pos_chunk[pos] = most frequent chunk for pos
    """
    return pos_chunk

In [112]:
model = train(new_sentences_train)

In [113]:
print(model)

{}


In [91]:
def predict(model, corpus):
    """
    Predicts the chunk from the part of speech
    Adds a pchunk column
    :param model:
    :param corpus:
    :return:
    """
    """
    We add a predicted chunk column: pchunk
    """
    for sentence in corpus:
        for row in sentence:
            print(model)
            print(row['pos'])
            row['pchunk'] = model[row['pos']]
    return corpus

In [92]:
predicted = predict(model, test_corpus)

{}
NNP


KeyError: 'NNP'