# Lab4-Assignment about Named Entity Recognition and Classification

This notebook describes the assignment of Lab 4 of the text mining course. We assume you have succesfully completed Lab1, Lab2 and Lab3 as welll. Especially Lab2 is important for completing this assignment.

**Learning goals**
* going from linguistic input format to representing it in a feature space
* working with pretrained word embeddings
* train a supervised classifier (SVM)
* evaluate a supervised classifier (SVM)
* learn how to interpret the system output and the evaluation results
* be able to propose future improvements based on the observed results


## Credits
This notebook was originally created by [Marten Postma](https://martenpostma.github.io) and [Filip Ilievski](http://ilievski.nl) and adapted by Piek vossen

## [Points: 18] Exercise 1 (NERC): Training and evaluating an SVM using CoNLL-2003

**[4 point] a) Load the CoNLL-2003 training data using the *ConllCorpusReader* and create for both *train.txt* and *test.txt*:**

    [2 points]  -a list of dictionaries representing the features for each training instances, e..g,
    ```
    [
    {'words': 'EU', 'pos': 'NNP'}, 
    {'words': 'rejects', 'pos': 'VBZ'},
    ...
    ]
    ```

    [2 points] -the NERC labels associated with each training instance, e.g.,
    dictionaries, e.g.,
    ```
    [
    'B-ORG', 
    'O',
    ....
    ]
    ```

In [1]:
from nltk.corpus.reader import ConllCorpusReader
### Adapt the path to point to the CONLL2003 folder on your local machine
train = ConllCorpusReader('/Users/bella/PycharmProjects/TextMining/ba-text-mining/lab_sessions/lab4/nerc_datasets/CONLL2003', 'train.txt', ['words', 'pos', 'ignore', 'chunk'])
training_features = []
training_gold_labels = []

for token, pos, ne_label in train.iob_words():
    a_dict = {
       # add features
        'words': token,
        'pos': pos
    }
    training_features.append(a_dict)
    training_gold_labels.append(ne_label)

In [2]:
### Adapt the path to point to the CONLL2003 folder on your local machine
train = ConllCorpusReader('/Users/bella/PycharmProjects/TextMining/ba-text-mining/lab_sessions/lab4/nerc_datasets/CONLL2003', 'test.txt', ['words', 'pos', 'ignore', 'chunk'])

test_features = []
test_gold_labels = []
for token, pos, ne_label in train.iob_words():
    a_dict = {
        # add features
        'words': token,
        'pos': pos
    }
    test_features.append(a_dict)
    test_gold_labels.append(ne_label)


**[2 points] b) provide descriptive statistics about the training and test data:**
* How many instances are in train and test?
* Provide a frequency distribution of the NERC labels, i.e., how many times does each NERC label occur?
* Discuss to what extent the training and test data is balanced (equal amount of instances for each NERC label) and to what extent the training and test data differ?

Tip: you can use the following `Counter` functionality to generate frequency list of a list:

In [3]:
from collections import Counter 
import numpy as np
import matplotlib.pyplot as plt

# number of instances
print("Instances in train:")
print(len(training_features))
print("Instances in test:")
print(len(test_features))

#  frequency distribution of the NERC labels

print("\nFrequency distribution in train:")
train_count = Counter(training_gold_labels)
# sort to facilitate compariso
train_count = dict( sorted(train_count.items(), key=lambda x: x[0].lower()) )
print(train_count.keys())
print(train_count.values())
# add proportion for comparison
list_train = list(train_count.values())
total_train = sum(list_train)
print([round(x / total_train, 3) for x in list_train])

print("Frequency distribution in test:")
test_count = Counter(test_gold_labels)
# sort to facilitate comparison
test_count = dict( sorted(test_count.items(), key=lambda x: x[0].lower()) )
print(test_count.keys())
print(test_count.values())
# add proportion for comparison
list_test = list(test_count.values())
total_test = sum(list_test)
print([round(x / total_test, 3) for x in list_test])

Instances in train:
203621
Instances in test:
46435

Frequency distribution in train:
dict_keys(['B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER', 'O'])
dict_values([7140, 3438, 6321, 6600, 1157, 1155, 3704, 4528, 169578])
[0.035, 0.017, 0.031, 0.032, 0.006, 0.006, 0.018, 0.022, 0.833]
Frequency distribution in test:
dict_keys(['B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG', 'I-PER', 'O'])
dict_values([1668, 702, 1661, 1617, 257, 216, 835, 1156, 38323])
[0.036, 0.015, 0.036, 0.035, 0.006, 0.005, 0.018, 0.025, 0.825]


### Discussion 1b
Within the training data there are 203621 instances. The frequency in the training data shows that the majority of labels are not named entities (O). The least common label is B-PER and B-ORG, which is a person and an organization respectively. Thus, in the training data we have a very uneven frequency distribution across the different labels. The test data similarly to the training data is also a very uneven label frequency distribution within the data. The most common label is O, which is not a named entity which is 0.825 that is over 100 times more frequent than the least common label. The least common label are again the B-PER and B-ORG like the training data.

As can be seen in the previous results (see proportions), the distributions in the training and test data are relatively similar. There are some small differences, e.g. the number of B-MISC is slightly higher in the case of the train data as compared to the test data, whereas B-ORG is more common in the test data.

**[2 points] c) Concatenate the train and test features (the list of dictionaries) into one list. Load it using the *DictVectorizer*. Afterwards, split it back to training and test.**

Tip: You’ve concatenated train and test into one list and then you’ve applied the DictVectorizer.
The order of the rows is maintained. You can hence use an index (number of training instances) to split the_array back into train and test. Do NOT use: `
from sklearn.model_selection import train_test_split` here.


In [4]:
from sklearn.feature_extraction import DictVectorizer

In [5]:
vec = DictVectorizer()
features_combined = training_features + test_features
train_and_test = vec.fit_transform(features_combined)#.toarray()
train_input = train_and_test[:len(training_features)]
test_input = train_and_test[len(training_features):]

**[4 points] d) Train the SVM using the train features and labels and evaluate on the test data. Provide a classification report (sklearn.metrics.classification_report).**
The train (*lin_clf.fit*) might take a while. On my computer, it took 1min 53s, which is acceptable. Training models normally takes much longer. If it takes more than 5 minutes, you can use a subset for training. Describe the results:
* Which NERC labels does the classifier perform well on? Why do you think this is the case?
* Which NERC labels does the classifier perform poorly on? Why do you think this is the case?

In [6]:
from sklearn import svm

In [7]:
lin_clf = svm.LinearSVC()

In [8]:
lin_clf.fit(train_input, training_gold_labels)

In [9]:
test_pred = lin_clf.predict(test_input)
from sklearn.metrics import classification_report
report = classification_report(test_gold_labels, test_pred)
print(report)

              precision    recall  f1-score   support

       B-LOC       0.81      0.78      0.79      1668
      B-MISC       0.78      0.66      0.72       702
       B-ORG       0.79      0.52      0.63      1661
       B-PER       0.86      0.44      0.58      1617
       I-LOC       0.62      0.53      0.57       257
      I-MISC       0.57      0.59      0.58       216
       I-ORG       0.70      0.47      0.56       835
       I-PER       0.33      0.87      0.48      1156
           O       0.98      0.98      0.98     38323

    accuracy                           0.92     46435
   macro avg       0.72      0.65      0.65     46435
weighted avg       0.94      0.92      0.92     46435



### Discussion 1d
* The classifier performs well on the labels O and B-LOC. This is because the labels O and B-LOC are the most common labels in the data. Thus, the classifier is able to predict these labels well. We can see that the F1-score for the 0 label is 0.98 and for the B-LOC it is .72. It is best to look at the F1-score due to the class imbalances we have in the data as seen in the previous results of the frequency distribution.
* The classifier performs poorly on the labels I-PER and I-ORG. The poor result of I-PER is relatively unexpected because one would expect that the two least represented data classes I-MISC and I-LOC would be the ones that are predicted poorly. I-PER has a very high recall so it is being overpredicted in comparison with the training labels. It is still one of the least represented class labels and so that could be why the f1-score is low, and maybe the two least represented classes I-MISC and I-LOC are rarely predicted in comparison to I-PER as seen with the lower recall scores.

**[6 points] e) Train a model that uses the embeddings of these words as inputs. Test again on the same data as in 2d. Generate a classification report and compare the results with the classifier you built in 2d.**

In [10]:
import gensim
word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format('/Users/bella/PycharmProjects/TextMining/ba-text-mining/lab_sessions/lab2/GoogleNews-vectors-negative300.bin.gz', binary=True)
# adapt the path to point to the local copy of the nerc_datasets folder
train = ConllCorpusReader('/Users/bella/PycharmProjects/TextMining/ba-text-mining/lab_sessions/lab4/nerc_datasets/CONLL2003',
                          'train.txt', # this will load the file 'train.txt', for the exercise you also need to load 'test.xt'
                          ['words', 'pos', 'ignore', 'chunk'])

input_vectors=[]
labels=[]
for token, pos, ne_label in train.iob_words():

    if token!='' and token!='DOCSTART':
        if token in word_embedding_model:
            vector=word_embedding_model[token]
        else:
            vector=[0]*300
        input_vectors.append(vector)
        labels.append(ne_label)



In [11]:
lin_clf2 = svm.LinearSVC()
lin_clf2.fit(input_vectors, labels)

In [12]:
train = ConllCorpusReader('/Users/bella/PycharmProjects/TextMining/ba-text-mining/lab_sessions/lab4/nerc_datasets/CONLL2003',
                          'test.txt', # this will load the file 'train.txt', for the exercise you also need to load 'test.xt'
                          ['words', 'pos', 'ignore', 'chunk'])

test_input_vectors=[]
test_labels=[]
for token, pos, ne_label in train.iob_words():

    if token!='' and token!='DOCSTART':
        if token in word_embedding_model:
            vector=word_embedding_model[token]
        else:
            vector=[0]*300
        test_input_vectors.append(vector)
        test_labels.append(ne_label)



In [13]:
test_pred2 = lin_clf2.predict(test_input_vectors)
report2 = classification_report(test_labels, test_pred2)
print(report2)

              precision    recall  f1-score   support

       B-LOC       0.76      0.80      0.78      1668
      B-MISC       0.72      0.70      0.71       702
       B-ORG       0.69      0.64      0.66      1661
       B-PER       0.75      0.67      0.71      1617
       I-LOC       0.51      0.42      0.46       257
      I-MISC       0.60      0.54      0.57       216
       I-ORG       0.48      0.33      0.39       835
       I-PER       0.59      0.50      0.54      1156
           O       0.97      0.99      0.98     38323

    accuracy                           0.93     46435
   macro avg       0.68      0.62      0.64     46435
weighted avg       0.92      0.93      0.92     46435



### Discussion 1e
The results of the classifier we built in 1d vs. 1e are relatively similar. In 1e we used vector word embeddings. In 1e we see that overall that the classifier in 1d performed marginally better although the weighted averages of the f1-scores for both classifiers were equal to 0.92. The accuracy in the embeddings performed marginall ybetter with a score of 0.93 compared to 1d classifier's 0.92. However, the macro-average for precision and recall in 1e were lower than 1d and the weighted average of precision was also lower in 1e than 1d. The weighted average wsa 0.01 better in 1e than 1d. This is because the classifier in 1e has significantly better recall scores in labels such as B-MISC with 0.70 instead of 0.66 and B-ORG with a score of 0.64 instead of 0.52. The decrease in precision averages in 1e in comparison tp 1d is most likely due to the fcat that the classifier in 1e under-predicted I-ORG and resulted in a precision score for that label of 0.48 in comparison to the 0.70 score from the classifier in 1d.


## [Points: 10] Exercise 2 (NERC): feature inspection using the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus)
**[6 points] a. Perform the same steps as in the previous exercise. Make sure you end up for both the training part (*df_train*) and the test part (*df_test*) with:**
* the features representation using **DictVectorizer**
* the NERC labels in a list

Please note that this is the same setup as in the previous exercise:
* load both train and test using:
    * list of dictionaries for features
    * list of NERC labels
* combine train and test features in a list and represent them using one hot encoding
* train using the training features and NERC labels

In [1]:
import pandas

In [2]:
##### Adapt the path to point to your local copy of NERC_datasets
path = '/Users/bella/PycharmProjects/TextMining/ba-text-mining/lab_sessions/lab4/nerc_datasets/kaggle/ner_v2.csv'
kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)



  kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)
Skipping line 281837: expected 25 fields, saw 34



In [3]:
len(kaggle_dataset)

1050795

In [10]:
kaggle_dataset["word"][2]
kaggle_dataset["pos"][2]
kaggle_dataset["tag"][2]

'O'

In [11]:
df_train = kaggle_dataset[:100000]
df_test = kaggle_dataset[100000:120000]
print(len(df_train), len(df_test))

100000 20000


In [12]:
from nltk.corpus.reader import ConllCorpusReader
### Adapt the path to point to the CONLL2003 folder on your local machine
#train = ConllCorpusReader('C:\\Users\\Gebruiker\\ba-text-mining-1\\lab_sessions\\lab4\\CONLL2003\\CONLL2003', 'train.txt', ['words', 'pos', 'ignore', 'chunk'])

training_features = []
training_gold_labels = []

for i in range(len(df_train)):
    token = df_train["word"][i]
    pos = df_train["pos"][i]
    tag = df_train["tag"][i]
    a_dict = {
       # add features
        'word': token,
        'pos': pos
    }
    training_features.append(a_dict)

In [13]:
df_test.reset_index(inplace=True, drop=True)
df_test

Unnamed: 0,id,lemma,next-lemma,next-next-lemma,next-next-pos,next-next-shape,next-next-word,next-pos,next-shape,next-word,...,prev-prev-lemma,prev-prev-pos,prev-prev-shape,prev-prev-word,prev-shape,prev-word,sentence_idx,shape,word,tag
0,100000,"""",death,to,TO,lowercase,to,NN,capitalized,Death,...,demonstr,NNS,capitalized,Demonstrators,lowercase,chanting,4544.0,punct,"""",O
1,100001,death,to,america,NNP,capitalized,America,TO,lowercase,to,...,chant,VBG,lowercase,chanting,punct,"""",4544.0,capitalized,Death,O
2,100002,to,america,"""",``,punct,"""",NNP,capitalized,America,...,"""",``,punct,"""",capitalized,Death,4544.0,lowercase,to,O
3,100003,america,"""",march,VBD,lowercase,marched,``,punct,"""",...,death,NN,capitalized,Death,lowercase,to,4544.0,capitalized,America,B-geo
4,100004,"""",march,through,IN,lowercase,through,VBD,lowercase,marched,...,to,TO,lowercase,to,capitalized,America,4544.0,punct,"""",I-geo
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,119995,reject,a,packag,NN,lowercase,package,DT,lowercase,a,...,voter,NNS,lowercase,voters,lowercase,narrowly,5469.0,lowercase,rejected,O
19996,119996,a,packag,of,IN,lowercase,of,NN,lowercase,package,...,narrowli,RB,lowercase,narrowly,lowercase,rejected,5469.0,lowercase,a,O
19997,119997,packag,of,measur,NNS,lowercase,measures,IN,lowercase,of,...,reject,VBD,lowercase,rejected,lowercase,a,5469.0,lowercase,package,O
19998,119998,of,measur,includ,VBG,lowercase,including,NNS,lowercase,measures,...,a,DT,lowercase,a,lowercase,package,5469.0,lowercase,of,O


In [14]:
from nltk.corpus.reader import ConllCorpusReader
### Adapt the path to point to the CONLL2003 folder on your local machine
#train = ConllCorpusReader('C:\\Users\\Gebruiker\\ba-text-mining-1\\lab_sessions\\lab4\\CONLL2003\\CONLL2003', 'train.txt', ['words', 'pos', 'ignore', 'chunk'])

test_features = []
test_gold_labels = []

for i in range(len(df_test)):
    token = df_test["word"][i]
    pos = df_test["pos"][i]
    tag = df_test["tag"][i]
    a_dict = {
       # add features
        'word': token,
        'pos': pos
    }
    test_features.append(a_dict)
    test_gold_labels.append(tag)

In [15]:
from collections import Counter
import numpy as np
import matplotlib.pyplot as plt

# number of instances
print("Instances in train:")
print(len(training_features))
print("Instances in test:")
print(len(test_features))

#  frequency distribution of the NERC labels

print("\nFrequency distribution in train:")
train_count = Counter(training_gold_labels)
# sort to facilitate compariso
train_count = dict( sorted(train_count.items(), key=lambda x: x[0].lower()) )
print(train_count.keys())
print(train_count.values())
# add proportion for comparison
list_train = list(train_count.values())
total_train = sum(list_train)
print([round(x / total_train, 3) for x in list_train])

print("Frequency distribution in test:")
test_count = Counter(test_gold_labels)
# sort to facilitate comparison
test_count = dict( sorted(test_count.items(), key=lambda x: x[0].lower()) )
print(test_count.keys())
print(test_count.values())
# add proportion for comparison
list_test = list(test_count.values())
total_test = sum(list_test)
print([round(x / total_test, 3) for x in list_test])

Instances in train:
100000
Instances in test:
20000

Frequency distribution in train:
dict_keys([])
dict_values([])
[]
Frequency distribution in test:
dict_keys(['B-art', 'B-geo', 'B-gpe', 'B-nat', 'B-org', 'B-per', 'B-tim', 'I-geo', 'I-gpe', 'I-nat', 'I-org', 'I-per', 'I-tim', 'O'])
dict_values([4, 741, 296, 8, 397, 333, 393, 156, 2, 4, 321, 319, 108, 16918])
[0.0, 0.037, 0.015, 0.0, 0.02, 0.017, 0.02, 0.008, 0.0, 0.0, 0.016, 0.016, 0.005, 0.846]


In [17]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
features_combined = training_features + test_features
train_and_test = vec.fit_transform(features_combined)  #.toarray()
train_input = train_and_test[:len(training_features)]
test_input = train_and_test[len(training_features):]


In [19]:
from sklearn import svm
lin_clf = svm.LinearSVC()
lin_clf.fit(train_input, training_gold_labels)

ValueError: Found input variables with inconsistent numbers of samples: [100000, 0]

**[4 points] b. Train and evaluate the model and provide the classification report:**
* use the SVM to predict NERC labels on the test data
* evaluate the performance of the SVM on the test data

Analyze the performance per NERC label.

## End of this notebook