# **By Timofei Polivanov, Sami Rahali**

# Lab4-Assignment about Named Entity Recognition and Classification

This notebook describes the assignment of Lab 4 of the text mining course. We assume you have succesfully completed Lab1, Lab2 and Lab3 as welll. Especially Lab2 is important for completing this assignment.

**Learning goals**
* going from linguistic input format to representing it in a feature space
* working with pretrained word embeddings
* train a supervised classifier (SVM)
* evaluate a supervised classifier (SVM)
* learn how to interpret the system output and the evaluation results
* be able to propose future improvements based on the observed results


## Credits
This notebook was originally created by [Marten Postma](https://martenpostma.github.io) and [Filip Ilievski](http://ilievski.nl) and adapted by Piek vossen

## [Points: 18] Exercise 1 (NERC): Training and evaluating an SVM using CoNLL-2003

**[4 point] a) Load the CoNLL-2003 training data using the *ConllCorpusReader* and create for both *train.txt* and *test.txt*:**

    [2 points]  -a list of dictionaries representing the features for each training instances, e..g,
    ```
    [
    {'words': 'EU', 'pos': 'NNP'}, 
    {'words': 'rejects', 'pos': 'VBZ'},
    ...
    ]
    ```

    [2 points] -the NERC labels associated with each training instance, e.g.,
    dictionaries, e.g.,
    ```
    [
    'B-ORG', 
    'O',
    ....
    ]
    ```

In [1]:
from nltk.corpus.reader import ConllCorpusReader
import numpy as np
import gensim
from sklearn import svm



### Adapt the path to point to the CONLL2003 folder on your local machine
train = ConllCorpusReader(r'nerc_datasets/CONLL2003', 'train.txt', ['words', 'pos', 'ignore', 'chunk'])

# training_features = []
# training_gold_labels = []

# for token, pos, ne_label in train.iob_words():
#     a_dict = {
#       token, pos, ne_label
#     }
   
word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format(r"../../GoogleNews-vectors-negative300.bin.gz", binary = True)

input_vectors=[]
labels=[]

for token, pos, ne_label in train.iob_words():
    if token!='' and token!='DOCSTART':
        if token in word_embedding_model:
            vector=word_embedding_model[token]
        else:
            vector=[0]*300
        input_vectors.append(vector)
        labels.append(ne_label)

input_vectors = []
labels = []

valid_tokens = [(token, ne_label) for token, pos, ne_label in train.iob_words() if token and token != 'DOCSTART']

num_tokens = len(valid_tokens)
input_vectors = np.zeros((num_tokens, 300))
labels = np.empty(num_tokens, dtype=object)

for i, (token, ne_label) in enumerate(valid_tokens):
    if token in word_embedding_model:
        input_vectors[i] = word_embedding_model[token]
    labels[i] = ne_label

In [2]:
### Adapt the path to point to the CONLL2003 folder on your local machine
test = ConllCorpusReader(r'nerc_datasets/CONLL2003', 'test.txt', ['words', 'pos', 'ignore', 'chunk'])

test_input_vectors=[]
test_labels=[]

for token, pos, ne_label in test.iob_words():
    if token!='' and token!='DOCSTART':
        if token in word_embedding_model:
            vector=word_embedding_model[token]
        else:
            vector=[0]*300
        test_input_vectors.append(vector)
        test_labels.append(ne_label)

test_input_vectors = []
test_labels = []

test_valid_tokens = [(token, ne_label) for token, pos, ne_label in test.iob_words() if token and token != 'DOCSTART']

test_num_tokens = len(test_valid_tokens)
test_input_vectors = np.zeros((test_num_tokens, 300))
test_labels = np.empty(test_num_tokens, dtype=object)

for i, (token, ne_label) in enumerate(test_valid_tokens):
    if token in word_embedding_model:
        test_input_vectors[i] = word_embedding_model[token]
    test_labels[i] = ne_label

**[2 points] b) provide descriptive statistics about the training and test data:**
* How many instances are in train and test?
* Provide a frequency distribution of the NERC labels, i.e., how many times does each NERC label occur?
* Discuss to what extent the training and test data is balanced (equal amount of instances for each NERC label) and to what extent the training and test data differ?

Tip: you can use the following `Counter` functionality to generate frequency list of a list:

In [3]:
from collections import Counter 

print('train data:')
print(f'Num of instances: {len(labels)}')
print(f'NERC labels frequency distribution: {Counter(labels)}')

print('test data:')
print(f'Num of instances: {len(test_labels)}')
print(f'NERC labels frequency distribution: {Counter(test_labels)}')

train data:
Num of instances: 203621
NERC labels frequency distribution: Counter({'O': 169578, 'B-LOC': 7140, 'B-PER': 6600, 'B-ORG': 6321, 'I-PER': 4528, 'I-ORG': 3704, 'B-MISC': 3438, 'I-LOC': 1157, 'I-MISC': 1155})
test data:
Num of instances: 46435
NERC labels frequency distribution: Counter({'O': 38323, 'B-LOC': 1668, 'B-ORG': 1661, 'B-PER': 1617, 'I-PER': 1156, 'I-ORG': 835, 'B-MISC': 702, 'I-LOC': 257, 'I-MISC': 216})


### Answer:

Both the training and test datasets are very imbalanced, with the 'O' label taking up more than 80% of instances in both datasets. Other labels also vary a lot in their amount, from around 200 to around 7000 instances.

But, the distribution of each label in the datasets is very similar, i.e. there is about the same amount of instances of a specific calss in the test dataset as in the training dataset.

So, the datasets are not balanced in terms of class distribution, but the datasets mirror each other well in composition.

**[2 points] c) Concatenate the train and test features (the list of dictionaries) into one list. Load it using the *DictVectorizer*. Afterwards, split it back to training and test.**

Tip: You’ve concatenated train and test into one list and then you’ve applied the DictVectorizer.
The order of the rows is maintained. You can hence use an index (number of training instances) to split the_array back into train and test. Do NOT use: `
from sklearn.model_selection import train_test_split` here.


In [4]:
from sklearn.feature_extraction import DictVectorizer

In [5]:
train_feature_dicts = []
train_labels = []

for token, pos, ne_label in train.iob_words():
    if token and token != 'DOCSTART':
        train_feature_dicts.append({'token': token, 'pos': pos})
        train_labels.append(ne_label)

test_feature_dicts = []
test_labels = []

for token, pos, ne_label in test.iob_words():
    if token and token != 'DOCSTART':
        test_feature_dicts.append({'token': token, 'pos': pos})
        test_labels.append(ne_label)

all_features = train_feature_dicts + test_feature_dicts

vec = DictVectorizer(sparse=True)
the_array = vec.fit_transform(all_features)

num = len(train_feature_dicts)
X_train = the_array[:num]
X_test = the_array[num:]

In [6]:
X_train.shape

(203621, 27361)

In [7]:
X_test.shape

(46435, 27361)

**[4 points] d) Train the SVM using the train features and labels and evaluate on the test data. Provide a classification report (sklearn.metrics.classification_report).**
The train (*lin_clf.fit*) might take a while. On my computer, it took 1min 53s, which is acceptable. Training models normally takes much longer. If it takes more than 5 minutes, you can use a subset for training. Describe the results:
* Which NERC labels does the classifier perform well on? Why do you think this is the case?
* Which NERC labels does the classifier perform poorly on? Why do you think this is the case?

In [8]:
from sklearn import svm

In [9]:
lin_clf = svm.LinearSVC(dual=False)

In [10]:
lin_clf.fit(X_train, train_labels)



In [11]:
from sklearn.metrics import classification_report

predictions = lin_clf.predict(X_test)
report = classification_report(test_labels, predictions)
print(report)

              precision    recall  f1-score   support

       B-LOC       0.81      0.77      0.79      1668
      B-MISC       0.78      0.66      0.71       702
       B-ORG       0.79      0.52      0.62      1661
       B-PER       0.87      0.44      0.58      1617
       I-LOC       0.62      0.53      0.57       257
      I-MISC       0.59      0.59      0.59       216
       I-ORG       0.66      0.48      0.55       835
       I-PER       0.33      0.87      0.48      1156
           O       0.99      0.98      0.98     38323

    accuracy                           0.92     46435
   macro avg       0.71      0.65      0.65     46435
weighted avg       0.94      0.92      0.92     46435



### Answer:

#### Which NERC labels does the classifier perform well on? Why do you think this is the case?

Labels that perform well are '0', 'B-LOC', and 'B-MISC', as they have a high f1-score (above 0.7). 0 performs well because of higher representation in the datasets, and because non-entity tokens are very distinct from entity tokens. B-LOC and B-MISC probably perform well because they are often capitalized, and maybe due to context clues. Also it's easier to detect such entities at the beginning.

#### Which NERC labels does the classifier perform poorly on? Why do you think this is the case?

Labels with poor performance are I-PER, I-ORG, and I-LOC, as they all have low f1-scores. For I-PER, the model has a lot of false positives, as shown by the low precision. Also inside tokens are harder to identify as they rely on previous tokens and context which apparently the linear classifier can't predict properly. ORG tokens can often overlap with common words, such as Apple for example, which can both be the technology company and the fruit, depending on the context. Also, I-LOC and I-MISC are underrepresented in the dataset.

**[6 points] e) Train a model that uses the embeddings of these words as inputs. Test again on the same data as in 2d. Generate a classification report and compare the results with the classifier you built in 2d.**

In [12]:
lin_clf_1 = svm.LinearSVC(dual=False)

lin_clf_1.fit(input_vectors, labels)

In [13]:
predictions_1 = lin_clf_1.predict(test_input_vectors)
report_1 = classification_report(test_labels, predictions_1)
print(report_1)

              precision    recall  f1-score   support

       B-LOC       0.76      0.80      0.78      1668
      B-MISC       0.72      0.70      0.71       702
       B-ORG       0.69      0.64      0.66      1661
       B-PER       0.75      0.67      0.71      1617
       I-LOC       0.51      0.42      0.46       257
      I-MISC       0.60      0.54      0.57       216
       I-ORG       0.48      0.33      0.39       835
       I-PER       0.59      0.50      0.54      1156
           O       0.97      0.99      0.98     38323

    accuracy                           0.93     46435
   macro avg       0.68      0.62      0.64     46435
weighted avg       0.92      0.93      0.92     46435



### Answer:

Recall improved for all of the B- classes. Precision for the I-PER class increased by a lot (0.33 -> 0.59), but on the other hand, recall decreased by a lot too (0.87 -> 0.5).  I- classes still underperform, some even performed worse, for example, the I-ORG class (0.55 -> 0.39). This may be due to the fact that embeddings are effective on the word level, but not on the sequence level, so the model still can't capture the context that depends on the meaning of the sequence as a whole, rather than individual words. An example of this is an organization name that consists of multiple words, and the low score for all I- classes, but especially I-ORG, supports this hypothesis.

## [Points: 10] Exercise 2 (NERC): feature inspection using the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus)
**[6 points] a. Perform the same steps as in the previous exercise. Make sure you end up for both the training part (*df_train*) and the test part (*df_test*) with:**
* the features representation using **DictVectorizer**
* the NERC labels in a list

Please note that this is the same setup as in the previous exercise:
* load both train and test using:
    * list of dictionaries for features
    * list of NERC labels
* combine train and test features in a list and represent them using one hot encoding
* train using the training features and NERC labels

In [14]:
import pandas

In [16]:
##### Adapt the path to point to your local copy of NERC_datasets
path = 'nerc_datasets/kaggle/ner_dataset.csv'
kaggle_dataset = pandas.read_csv(path, encoding="latin1")

In [17]:
len(kaggle_dataset)

1048575

In [18]:
data = kaggle_dataset.ffill()

df_train = data[:100000]
df_test = data[100000:120000]
print(len(df_train), len(df_test))

100000 20000


In [21]:
train_feature_dicts = []
train_labels = []

for _, row in df_train.iterrows():
    token = row['Word']
    pos = row['POS']
    ne_label = row['Tag']

    if token and token != 'DOCSTART':
        train_feature_dicts.append({'token': token, 'pos': pos})
        train_labels.append(ne_label)

In [22]:
test_feature_dicts = []
test_labels = []

for _, row in df_test.iterrows():
    token = row['Word']
    pos = row['POS']
    ne_label = row['Tag']

    if token and token != 'DOCSTART':
        test_feature_dicts.append({'token': token, 'pos': pos})
        test_labels.append(ne_label)

In [23]:
all_features = train_feature_dicts + test_feature_dicts

vec = DictVectorizer(sparse=True)
the_array = vec.fit_transform(all_features)

num = len(train_feature_dicts)
X_train = the_array[:num]
X_test = the_array[num:]

**[4 points] b. Train and evaluate the model and provide the classification report:**
* use the SVM to predict NERC labels on the test data
* evaluate the performance of the SVM on the test data

Analyze the performance per NERC label.

In [24]:
lin_clf_2 = svm.LinearSVC(dual=False)

lin_clf_2.fit(X_train, train_labels)

In [25]:
predictions_2 = lin_clf_2.predict(X_test)
report_2 = classification_report(test_labels, predictions_2)
print(report_2)

              precision    recall  f1-score   support

       B-art       0.00      0.00      0.00         4
       B-eve       0.00      0.00      0.00         0
       B-geo       0.80      0.76      0.78       741
       B-gpe       0.96      0.92      0.94       296
       B-nat       1.00      0.50      0.67         8
       B-org       0.64      0.51      0.57       397
       B-per       0.81      0.53      0.64       333
       B-tim       0.91      0.76      0.83       393
       I-art       0.00      0.00      0.00         0
       I-eve       0.00      0.00      0.00         0
       I-geo       0.74      0.50      0.60       156
       I-gpe       1.00      0.50      0.67         2
       I-nat       0.80      1.00      0.89         4
       I-org       0.65      0.44      0.53       321
       I-per       0.42      0.90      0.57       319
       I-tim       0.41      0.08      0.14       108
           O       0.98      0.99      0.99     16918

    accuracy              

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Answer:

Labels with good performance: B-gpe, B-geo, B-tim, O. B-gpe and B-geo probably perform well because they are distinct, as location names often are capitalized, and do not overlap with common words. B-tim describes tokens related to time, which are mostly strictly formatted, like dd-mm-yyyy, and as such, easy to recognize. O performs well for the same reasons as in the previous tests, these tokens are distinct, and overrepresented in the dataset.

Labels with poor performance: I-tim, I-org, I-geo, I-per, B-art, B-eve, I-art, I-eve. So some of these are just underrepresented, or outright missing from the dataset, such as B-eve or I-eve, so they have a precision and recall of 0.0. And the rest are all members of the I- token class, which suggests that the model can't handle multi-word phrases, as in the previous tests. This is for 2 reasons: the dataset doesn't provide any features to connect the token to the previous/next token, and also the linear model itself is too simple for this.

So, in conclusion, even though the overall accuracy is 0.94, it is actually skewed by the O class, which is overrepresented in the dataset. The macro average f1-score of 0.52 shows that there are imbalance issues, and this is supported by the frequency of the classes in the dataset, some are literally missing, and some have less than 100 instances. Additionally, the model doesn't grasp the context of the text, as it struggles with multi-word phrases, as explained before.

To imporve performance the issues with the dataset have to be fixed, and also some feature engineering has to be done to, for example, add features connecting the inside tokens to the previous/next token. And, a different model should be applied, that is more context-aware, such as CRF.

## End of this notebook