# Lab4-Assignment about Named Entity Recognition and Classification

This notebook describes the assignment of Lab 4 of the text mining course. We assume you have succesfully completed Lab1, Lab2 and Lab3 as welll. Especially Lab2 is important for completing this assignment.

**Learning goals**
* going from linguistic input format to representing it in a feature space
* working with pretrained word embeddings
* train a supervised classifier (SVM)
* evaluate a supervised classifier (SVM)
* learn how to interpret the system output and the evaluation results
* be able to propose future improvements based on the observed results


## Credits
This notebook was originally created by [Marten Postma](https://martenpostma.github.io) and [Filip Ilievski](http://ilievski.nl) and adapted by Piek vossen

## [Points: 18] Exercise 1 (NERC): Training and evaluating an SVM using CoNLL-2003

**[4 point] a) Load the CoNLL-2003 training data using the *ConllCorpusReader* and create for both *train.txt* and *test.txt*:**

    [2 points]  -a list of dictionaries representing the features for each training instances, e..g,
    ```
    [
    {'words': 'EU', 'pos': 'NNP'}, 
    {'words': 'rejects', 'pos': 'VBZ'},
    ...
    ]
    ```

    [2 points] -the NERC labels associated with each training instance, e.g.,
    dictionaries, e.g.,
    ```
    [
    'B-ORG', 
    'O',
    ....
    ]
    ```

In [90]:
from nltk.corpus.reader import ConllCorpusReader
### Adapt the path to point to the CONLL2003 folder on your local machine
train = ConllCorpusReader('/Users/lmps/github/ba-text-mining/lab_sessions/lab4/CONLL2003', 'train.txt', ['words', 'pos', 'ignore', 'chunk'])
training_features = []
training_gold_labels = []

for token, pos, ne_label in train.iob_words():
   a_dict = {
      'words': token, 'pos': pos
   }
   training_features.append(a_dict)
   training_gold_labels.append(ne_label)

In [91]:
### Adapt the path to point to the CONLL2003 folder on your local machine
test = ConllCorpusReader('/Users/lmps/github/ba-text-mining/lab_sessions/lab4/CONLL2003', 'test.txt', ['words', 'pos', 'ignore', 'chunk'])

test_features = []
test_gold_labels = []
for token, pos, ne_label in test.iob_words():
    a_dict = {
        'words': token, 'pos': pos
    }
    test_features.append(a_dict)
    test_gold_labels.append(ne_label)

**[2 points] b) provide descriptive statistics about the training and test data:**
* How many instances are in train and test?
* Provide a frequency distribution of the NERC labels, i.e., how many times does each NERC label occur?
* Discuss to what extent the training and test data is balanced (equal amount of instances for each NERC label) and to what extent the training and test data differ?

Tip: you can use the following `Counter` functionality to generate frequency list of a list:

In [71]:
from collections import Counter 
import pandas as pd

# How many instances are in train and test?
test_len = len(test.iob_words())
train_len = len(train.iob_words())
print("How many instances are in train and test?\n")
print("Number of instances in the training data is: {}\tPercentage is: {}\nNumber of instances in the test data is: {}\tPercentage is: {}\n".format(
    train_len, train_len/(test_len + train_len) * 100, test_len, test_len/(test_len + train_len) * 100))

# Provide a frequency distribution of the NERC labels
print("Provide a frequency distribution of the NERC labels\n")
train_labels = Counter(training_gold_labels)
test_labels = Counter(test_gold_labels)

print("test labels are: {}\ntrain labels are: {}".format(test_labels.items(), train_labels.items()))
test_sum = 38323 + 1668 + 1617 + 702 + 1661
train_sum = 6321+169578+3438+ 6600+7140
print("Test Data Instances For Each Label")
print("LOC % Test Data", ((test_labels['B-LOC']/test_sum) * 100))
print("PER % Test Data", ((test_labels['B-PER']/test_sum) * 100))
print("ORG % Test Data", ((test_labels['B-ORG']/test_sum) * 100))
print("MISC % Test Data", ((test_labels['B-MISC']/test_sum) * 100))
print("O % Test Data", ((test_labels['O']/test_sum)) * 100)
print()
print("Train Data Instances For Each Label")
print("LOC % Train Data", ((train_labels['B-LOC']/train_sum) * 100))
print("PER % Train Data", ((train_labels['B-PER']/train_sum) * 100))
print("ORG % Train Data", ((train_labels['B-ORG']/train_sum) * 100))
print("MISC % Train Data", ((train_labels['B-MISC']/train_sum) * 100))
print("O % Train Data", ((train_labels['O']/train_sum)) * 100)
print()
print("The instances for each label is quite similar for both training and test datasets. For both the test and training dataset, about 13% of the total instances are labeled LOC,PER,ORG and MISC, where it is almost equally distributed (~3%) for each label")

How many instances are in train and test?

Number of instances in the training data is: 203621	Percentage is: 81.43015964423968
Number of instances in the test data is: 46435	Percentage is: 100.18569840355761

Provide a frequency distribution of the NERC labels

test labels are: dict_items([('O', 38323), ('B-LOC', 1668), ('B-PER', 1617), ('I-PER', 1156), ('I-LOC', 257), ('B-MISC', 702), ('I-MISC', 216), ('B-ORG', 1661), ('I-ORG', 835)])
train labels are: dict_items([('B-ORG', 6321), ('O', 169578), ('B-MISC', 3438), ('B-PER', 6600), ('I-PER', 4528), ('B-LOC', 7140), ('I-ORG', 3704), ('I-MISC', 1155), ('I-LOC', 1157)])
Test Data Instances For Each Label
LOC % Test Data 3.7934092924882314
PER % Test Data 3.6774237565668284
ORG % Test Data 3.77748970912647
MISC % Test Data 1.5965067885651905
O % Test Data 87.15517045325328

Train Data Instances For Each Label
LOC % Train Data 3.69800649481813
PER % Train Data 3.418325331344489
ORG % Train Data 3.273823396883109
MISC % Train Data 1.78063674

**[2 points] c) Concatenate the train and test features (the list of dictionaries) into one list. Load it using the *DictVectorizer*. Afterwards, split it back to training and test.**

Tip: You’ve concatenated train and test into one list and then you’ve applied the DictVectorizer.
The order of the rows is maintained. You can hence use an index (number of training instances) to split the_array back into train and test. Do NOT use: `
from sklearn.model_selection import train_test_split` here.


In [94]:
from sklearn.feature_extraction import DictVectorizer
import numpy as np

In [126]:
vec = DictVectorizer()
concatenation = training_features + test_features
the_array = vec.fit_transform(concatenation)
# the_array = np.array(the_array)
print(the_array.shape)
train = the_array[:len(training_features)]
print(train.shape)
test = the_array[len(training_features):]
#print(test.shape)



(250056, 27361)
(203621, 27361)
(46435, 27361)


**[4 points] d) Train the SVM using the train features and labels and evaluate on the test data. Provide a classification report (sklearn.metrics.classification_report)**. The train (lin_clf.fit) might take a while. On my computer, it took 1min 53s, which is acceptable. Training models normally takes much longer. If it takes more than 5 minutes, you can use a subset for training. Describe the results:

* Which NERC labels does the classifier perform well on? Why do you think this is the case?
* Which NERC labels does the classifier perform poorly on? Why do you think this is the case?

In [97]:
from sklearn import svm

In [98]:
lin_clf = svm.LinearSVC()

In [108]:
##### [ YOUR CODE SHOULD GO HERE ]
model = lin_clf.fit(train, training_gold_labels)

In [131]:
from sklearn.metrics import classification_report

predicted = model.predict(test)
print(classification_report(predicted, test_gold_labels))

              precision    recall  f1-score   support

       B-LOC       0.78      0.81      0.79      1592
      B-MISC       0.66      0.78      0.72       596
       B-ORG       0.52      0.79      0.63      1088
       B-PER       0.44      0.86      0.58       821
       I-LOC       0.53      0.62      0.57       220
      I-MISC       0.59      0.57      0.58       223
       I-ORG       0.47      0.70      0.56       555
       I-PER       0.87      0.33      0.48      3028
           O       0.98      0.98      0.98     38312

    accuracy                           0.92     46435
   macro avg       0.65      0.72      0.65     46435
weighted avg       0.93      0.92      0.92     46435



**[6 points] e) Train a model that uses the embeddings of these words as inputs. Test again on the same data as in 2d. Generate a classification report and compare the results with the classifier you built in 2d.**

              precision    recall  f1-score   support

       B-LOC       0.78      0.81      0.79      1592
      B-MISC       0.66      0.78      0.72       596
       B-ORG       0.52      0.79      0.63      1088
       B-PER       0.44      0.86      0.58       821
       I-LOC       0.53      0.62      0.57       220
      I-MISC       0.59      0.57      0.58       223
       I-ORG       0.47      0.70      0.56       555
       I-PER       0.87      0.33      0.48      3028
           O       0.98      0.98      0.98     38312

    accuracy                           0.92     46435
   macro avg       0.65      0.72      0.65     46435
weighted avg       0.93      0.92      0.92     46435



## [Points: 10] Exercise 2 (NERC): feature inspection using the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus)
**[6 points] a. Perform the same steps as in the previous exercise. Make sure you end up for both the training part (*df_train*) and the test part (*df_test*) with:**
* the features representation using **DictVectorizer**
* the NERC labels in a list

Please note that this is the same setup as in the previous exercise:
* load both train and test using:
    * list of dictionaries for features
    * list of NERC labels
* combine train and test features in a list and represent them using one hot encoding
* train using the training features and NERC labels

In [None]:
##### Adapt the path to point to your local copy of NERC_datasets
path = '/Users/piek/Desktop/ONDERWIJS/data/nerc_datasets/kaggle/ner_v2.csv'
kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)

In [None]:
len(kaggle_dataset)

In [None]:
df_train = kaggle_dataset[:100000]
df_test = kaggle_dataset[100000:120000]
print(len(df_train), len(df_test))

In [None]:
df_train = kaggle_dataset[:100000]
df_test = kaggle_dataset[100000:120000]
print(len(df_train), len(df_test))

## End of this notebook