# Named Entity Recognition and Classification


## (NERC): Training and evaluating an SVM using CoNLL-2003

In [1]:
from nltk.corpus.reader import ConllCorpusReader
### Adapt the path to point to the CONLL2003 folder on your local machine
train = ConllCorpusReader('', 'train.txt', ['words', 'pos', 'ignore', 'chunk'])
training_features = []
training_gold_labels = []

for token, pos, ne_label in train.iob_words():
    a_dict = {
        'words':token,
        'pos':pos                
    }
    training_features.append(a_dict)      #eg:NN
    training_gold_labels.append(ne_label)    #eg: B_ORG
    

In [2]:
### Adapt the path to point to the CONLL2003 folder on your local machine
test = ConllCorpusReader('/Users/gergonagy/Desktop/Text Mining for AI/ba-text-mining-master/lab_sessions/lab4', 'test.txt', ['words', 'pos', 'ignore', 'chunk'])

test_features = []
test_gold_labels = []

for token, pos, ne_label in test.iob_words():
    a_dict = {
        'words':token,
        'pos':pos
    }
    test_features.append(a_dict)
    test_gold_labels.append(ne_label)

In [3]:
from collections import Counter
Counter(training_gold_labels)

Counter({'O': 169578,
         'B-LOC': 7140,
         'B-PER': 6600,
         'B-ORG': 6321,
         'I-PER': 4528,
         'I-ORG': 3704,
         'B-MISC': 3438,
         'I-LOC': 1157,
         'I-MISC': 1155})

The label 'O' (Outside of named entities) significantly outnumbers all other labels with 169,578 instances. 
This is common in NER datasets because a large portion of text does not belong to any named entity.
The next most frequent labels are 'B-LOC', 'B-PER, and 'B-ORG' with several thousand instances each, but still far less than 'O'.
The least common labels are 'I-MISC'and 'I-LOC' , indicating these specific types of entities are less frequently mentioned or are part of longer named entities.

In [4]:
Counter(test_gold_labels)

Counter({'O': 38323,
         'B-LOC': 1668,
         'B-ORG': 1661,
         'B-PER': 1617,
         'I-PER': 1156,
         'I-ORG': 835,
         'B-MISC': 702,
         'I-LOC': 257,
         'I-MISC': 216})

The test data shows a similar pattern, with 'O' being overwhelmingly the most common label. However, the total number of instances for each label is proportionally smaller compared to the training data.
The distribution among entity labels in the test data mirrors that of the training data, with 'B-LOC', 'B-ORG', and 'B-PER'being more common and 'I-MISC' and 'I-LOC' being less common.

 The proportions of label occurrences between the training and test data are relatively consistent, suggesting that the test set is representative of the training set in terms of label distribution.
Neither the training nor the test data is balanced, with a significant skew towards the 'O' label. This imbalance reflects the nature of natural language, where named entities constitute a small portion of the text.


In [5]:
from sklearn.feature_extraction import DictVectorizer

In [6]:
merged_list = test_features + training_features #the pos tag lists
vec = DictVectorizer()
the_array = vec.fit_transform(merged_list) # one-hot encoded matrix


In [7]:
# splitting back, based on len
test_onehot = the_array[:len(test_features)]
training_onehot = the_array[len(test_features):]

In [8]:
test_onehot

<46435x27361 sparse matrix of type '<class 'numpy.float64'>'
	with 92870 stored elements in Compressed Sparse Row format>

In [9]:
training_onehot

<203621x27361 sparse matrix of type '<class 'numpy.float64'>'
	with 407242 stored elements in Compressed Sparse Row format>

In [10]:
from sklearn import svm

In [11]:
lin_clf = svm.LinearSVC()

In [12]:
lin_clf.fit(training_onehot,training_gold_labels)



In [13]:
pred = lin_clf.predict(test_onehot)
pred

array(['O', 'O', 'I-PER', ..., 'O', 'B-PER', 'O'], dtype='<U6')

In [14]:
import sklearn
from sklearn import metrics
from sklearn.metrics import classification_report

In [15]:
print(classification_report(test_gold_labels,pred))

              precision    recall  f1-score   support

       B-LOC       0.81      0.78      0.79      1668
      B-MISC       0.78      0.66      0.72       702
       B-ORG       0.79      0.52      0.63      1661
       B-PER       0.86      0.44      0.58      1617
       I-LOC       0.62      0.53      0.57       257
      I-MISC       0.57      0.59      0.58       216
       I-ORG       0.70      0.47      0.56       835
       I-PER       0.33      0.87      0.48      1156
           O       0.98      0.98      0.98     38323

    accuracy                           0.92     46435
   macro avg       0.72      0.65      0.65     46435
weighted avg       0.94      0.92      0.92     46435



The classifier performs the best on the following entity labels based on precision: B-LOC, B-MISC, B-ORG and B-PER, with the location showing high values for both recall and precision, in the case of organization, person and miscellaneous entities the results show that the classifier has low recall values with high precision values, meaning that the classifier is not good at finding all of the entities of those types but it identifies them accurately once found. The highest recall and precision values belong to the O(Outside) tokens.

The classifier performs poorly on the I- (Inside) tags of tokens, indicating that the classifier is inaccurate in detecting the inside parts or continuation of named entity tokens.

The overall accuracy of the model is 0.92, with macro average scores around 0.65, which means the performance of the classifier drops when considering each entity type with equal weight.

In [16]:
lin_clf_emb = svm.LinearSVC()

In [17]:
import gensim

In [18]:
word_embedding_model = gensim.models.KeyedVectors.load_word2vec_format('/Users/gergonagy/Desktop/Text Mining for AI/GoogleNews-vectors-negative300.bin.gz', binary=True)  

In [19]:
training_emb = []
for token, pos, ne_label in train.iob_words():
    training_emb.append(token)


In [22]:
training_inp_emb=[]
for i in training_emb:
    if token in word_embedding_model:
        vector=word_embedding_model[token]
    else:
        vector=[0]*300
    training_inp_emb.append(vector)

In [23]:
test_emb = []
for token, pos, ne_label in test.iob_words():
    test_emb.append(token)

In [28]:
test_inp_emb=[]
for i in test_emb:
    if token in word_embedding_model:
        vector=word_embedding_model[token]
    else:
        vector=[0]*300
    test_inp_emb.append(vector)

In [107]:
lin_clf_emb.fit(training_inp_emb,training_gold_labels)



In [108]:
pred_2 = lin_clf_emb.predict(test_inp_emb)
pred_2

array(['O', 'O', 'O', ..., 'O', 'O', 'O'], dtype='<U6')

In [109]:
sklearn.metrics.classification_report(test_gold_labels,pred_2)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


'              precision    recall  f1-score   support\n\n       B-LOC       0.00      0.00      0.00      1668\n      B-MISC       0.00      0.00      0.00       702\n       B-ORG       0.00      0.00      0.00      1661\n       B-PER       0.00      0.00      0.00      1617\n       I-LOC       0.00      0.00      0.00       257\n      I-MISC       0.00      0.00      0.00       216\n       I-ORG       0.00      0.00      0.00       835\n       I-PER       0.00      0.00      0.00      1156\n           O       0.83      1.00      0.90     38323\n\n    accuracy                           0.83     46435\n   macro avg       0.09      0.11      0.10     46435\nweighted avg       0.68      0.83      0.75     46435\n'

## (NERC): feature inspection using the [Annotated Corpus for Named Entity Recognition](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus)


In [29]:
import pandas

In [42]:
##### Adapt the path to point to your local copy of NERC_datasets
path = '/Users/tothhannapanna/Desktop/Text Mining/ba-text-mining-master/lab_sessions/lab4/kaggle/ner_v2.csv'
kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)




  kaggle_dataset = pandas.read_csv(path, error_bad_lines=False)
b'Skipping line 281837: expected 25 fields, saw 34\n'


In [55]:
len(kaggle_dataset)
df_train.head()

Unnamed: 0,id,lemma,next-lemma,next-next-lemma,next-next-pos,next-next-shape,next-next-word,next-pos,next-shape,next-word,...,prev-prev-lemma,prev-prev-pos,prev-prev-shape,prev-prev-word,prev-shape,prev-word,sentence_idx,shape,word,tag
0,0,thousand,of,demonstr,NNS,lowercase,demonstrators,IN,lowercase,of,...,__start2__,__START2__,wildcard,__START2__,wildcard,__START1__,1.0,capitalized,Thousands,O
1,1,of,demonstr,have,VBP,lowercase,have,NNS,lowercase,demonstrators,...,__start1__,__START1__,wildcard,__START1__,capitalized,Thousands,1.0,lowercase,of,O
2,2,demonstr,have,march,VBN,lowercase,marched,VBP,lowercase,have,...,thousand,NNS,capitalized,Thousands,lowercase,of,1.0,lowercase,demonstrators,O
3,3,have,march,through,IN,lowercase,through,VBN,lowercase,marched,...,of,IN,lowercase,of,lowercase,demonstrators,1.0,lowercase,have,O
4,4,march,through,london,NNP,capitalized,London,IN,lowercase,through,...,demonstr,NNS,lowercase,demonstrators,lowercase,have,1.0,lowercase,marched,O


In [44]:
df_train = kaggle_dataset[:100000]
df_test = kaggle_dataset[100000:120000]
print(len(df_train), len(df_test))

100000 20000


In [59]:
def extract_features(df):
    features = []
    exclude_columns = ['tag', 'sentence_idx', 'id']            # tag -> different list, sentence_idx and id -> not necessary
    for index, row in df.iterrows():
        feature = {col: row[col] for col in df.columns if col not in exclude_columns}
        features.append(feature)
    return features

def extract_labels(df):
    return df['tag'].tolist()                    # extracts the 'tag' column from the dataframe


train_features = extract_features(df_train)
train_labels = extract_labels(df_train)
test_features = extract_features(df_test)
test_labels = extract_labels(df_test)


In [60]:
from sklearn.feature_extraction import DictVectorizer

# combining train and test to vectorize
vec = DictVectorizer()
all_features = train_features + test_features  
all_features_vectorized = vec.fit_transform(all_features)

# splitting the back into train and test sets
train_features_vectorized = all_features_vectorized[:len(train_features)]
test_features_vectorized = all_features_vectorized[len(train_features):]


In [62]:
#from sklearn import svm
#from sklearn.metrics import classification_report

# training 
lin_clf = svm.LinearSVC()
lin_clf.fit(train_features_vectorized, train_labels)

# predicting on the test set
pred = lin_clf.predict(test_features_vectorized)

# eval
print(classification_report(test_labels, pred))




              precision    recall  f1-score   support

       B-art       0.00      0.00      0.00         4
       B-eve       0.00      0.00      0.00         0
       B-geo       0.87      0.84      0.85       741
       B-gpe       0.87      0.94      0.90       296
       B-nat       0.80      0.50      0.62         8
       B-org       0.73      0.66      0.70       397
       B-per       0.81      0.81      0.81       333
       B-tim       0.93      0.84      0.88       393
       I-geo       0.97      0.96      0.97       156
       I-gpe       0.67      1.00      0.80         2
       I-nat       1.00      1.00      1.00         4
       I-org       0.95      0.93      0.94       321
       I-per       0.95      0.98      0.96       319
       I-tim       0.98      0.87      0.92       108
           O       0.99      0.99      0.99     16918

    accuracy                           0.97     20000
   macro avg       0.77      0.76      0.76     20000
weighted avg       0.97   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


High-Performing Entities:
Entities like B-geo (geographical entities), B-gpe (geopolitical entities), I-geo (continuation of geographical entities), I-org (continuation of organizations), I-per (continuation of persons), and I-tim (continuation of time) show high precision and recall scores,  showing that the model is very effective at identifying and classifying them accurately.
The O class, representing tokens outside of named entities, also shows high precision and recall, which is critical for the model's overall accuracy since these tokens likely make up a large portion of the dataset.
Low-Performing Entities:
Some specific entity types like B-art (artifacts) and B-eve (events) have zero precision and recall, indicating the model was unable to correctly identify these entities in the test set. The reason could be very small number of examples in the training data or difficulty in distinguishing these entities based on the features provided to the model.


The macro average for precision, recall, and F1-score is around 0.76, indicating that when you consider each entity type equally, the performance is lower. This metric highlights the model's struggle with less frequently occurring or more challenging entity types. The weighted average accounts for label imbalance by weighting the scores by the number of true instances for each label, resulting in scores close to the overall accuracy. This metric shows high performance (around 0.97).


## End of this notebook