# 4-2. **Named Entity Recognition**

The subtask of information extraction that seeks to locate and classify named entity mentions in unstructured text into pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

It is very common that sequence modeling, such as HMM, MEMM, CRF, is applied to named entity prediction. In this lab, we will train and test the named entity prediction with sequence modeling, such as CRF.

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

In [2]:
!wget https://raw.githubusercontent.com/kimtwan/NLP_lecture/master/data/ner_dataset.csv

--2023-10-17 15:19:08--  https://raw.githubusercontent.com/kimtwan/NLP_lecture/master/data/ner_dataset.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15208261 (15M) [text/plain]
Saving to: ‘ner_dataset.csv’


2023-10-17 15:19:08 (123 MB/s) - ‘ner_dataset.csv’ saved [15208261/15208261]



In [3]:
# read IOB tagged NER dataset as dataframe
df = pd.read_csv('ner_dataset.csv', encoding = 'ISO-8859-1')
df.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O


In the data, you can see the different types of entities:
* geo = Geographical Entity
* org = Organization
* per = Person
* gpe = Geopolitical Entity
* tim = Time indicator
* art = Artifact
* eve = Event
* nat = Natural Phenomenon

## Data Preprocessing
There are too many NaN values in ‘Sentence #” column, fill NaN by preceding values.
We have 47595 sentences that contain 35172 unique words and tagged by 17 tags.

In [4]:
df = df.fillna(method='ffill')
df['Sentence #'].nunique(), df.Word.nunique(), df.Tag.nunique()

(47959, 35172, 17)

In [5]:
df.groupby('Tag').size().reset_index(name='counts')

Unnamed: 0,Tag,counts
0,B-art,402
1,B-eve,308
2,B-geo,37644
3,B-gpe,15870
4,B-nat,201
5,B-org,20143
6,B-per,16990
7,B-tim,20333
8,I-art,297
9,I-eve,253


We will now train a CRF model for named entity recognition using sklearn-crfsuite on our dataset. As mentioned before, MEMM or CRF is often used for labeling or parsing of sequential data for named entity recognition.

In [6]:
!pip install -q -U sklearn_crfsuite

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m993.5/993.5 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25h

##Conditional random fields (CRF)

In [7]:
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite.utils import flatten
from collections import Counter

In [8]:
# Retrieving sentences with their POS and tags.
class SentenceGetter(object):
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(),
                                                           s['POS'].values.tolist(),
                                                           s['Tag'].values.tolist())]
        self.grouped = self.data.groupby('Sentence #').apply(agg_func)
        self.sentences = [s for s in self.grouped]

    def get_next(self):
        try:
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None
getter = SentenceGetter(df)
sentences = getter.sentences

In [9]:
# or simply..
grp = df.groupby('Sentence #').apply(lambda s: [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(),
                                                           s['POS'].values.tolist(),
                                                           s['Tag'].values.tolist())])

sentences = [s for s in grp]

We extract more features (word parts, simplified POS tags, lower/title/upper flags, features of nearby words) and convert them to sklearn-crfsuite format — each sentence should be converted to a list of dicts. The following code were taken from [sklearn-crfsuites official site](https://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html).

In [10]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

def sent2tokens(sent):
    return [token for token, postag, label in sent]

In [11]:
#data splitting for training and testing
X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=0)

In [12]:
# train a CRF model for named entity recognition using sklearn-crfsuite
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
# to prevent 'keep_tempfiles' AttributeError in scikit-learn >= 0.24
try:
    crf.fit(X_train, y_train) # This will take about 3 mins
except AttributeError:
    pass

Because tag “O” (outside) is the most common tag and it will make our results look much better than they actual are. So we remove tag “O” when we evaluate classification metrics.

In [13]:
y = df.Tag.values
classes = np.unique(y)
classes = classes.tolist()
classes.pop() # pop the last item, which is 'O'
classes

['B-art',
 'B-eve',
 'B-geo',
 'B-gpe',
 'B-nat',
 'B-org',
 'B-per',
 'B-tim',
 'I-art',
 'I-eve',
 'I-geo',
 'I-gpe',
 'I-nat',
 'I-org',
 'I-per',
 'I-tim']

In [14]:
#evaluation
y_pred = crf.predict(X_test)
y_test_flat = flatten(y_test)
y_pred_flat = flatten(y_pred)

In [None]:
print(classification_report(y_test_flat, y_pred_flat, labels=classes))

              precision    recall  f1-score   support

       B-art       0.45      0.12      0.19       143
       B-eve       0.59      0.42      0.49       106
       B-geo       0.86      0.91      0.88     12447
       B-gpe       0.97      0.94      0.95      5284
       B-nat       0.82      0.42      0.56        78
       B-org       0.80      0.73      0.76      6615
       B-per       0.85      0.82      0.84      5652
       B-tim       0.93      0.88      0.90      6856
       I-art       0.11      0.03      0.05       105
       I-eve       0.38      0.22      0.28        93
       I-geo       0.82      0.81      0.81      2520
       I-gpe       0.91      0.62      0.74        69
       I-nat       1.00      0.43      0.61        23
       I-org       0.82      0.80      0.81      5597
       I-per       0.85      0.90      0.87      5674
       I-tim       0.84      0.74      0.79      2207

   micro avg       0.86      0.85      0.85     53469
   macro avg       0.75   

The following shows what our classifier learned. It is very likely that the beginning of a geographical entity (B-geo) will be followed by a token inside geographical entity (I-geo), but transitions to inside of an organization name (I-org) from tokens with other labels are penalized hugely.

In [15]:
# use the dictionary like a count list to get the sorted result
Counter(crf.transition_features_).most_common(10)

[(('B-nat', 'I-nat'), 6.934503),
 (('I-art', 'I-art'), 6.260215),
 (('B-art', 'I-art'), 5.881224),
 (('I-eve', 'I-eve'), 5.847777),
 (('B-eve', 'I-eve'), 5.586673),
 (('I-tim', 'I-tim'), 5.204188),
 (('I-org', 'I-org'), 4.782243),
 (('I-gpe', 'I-gpe'), 4.699609),
 (('B-tim', 'I-tim'), 4.636703),
 (('B-org', 'I-org'), 4.282602)]

In [16]:
def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print('%-6s -> %-7s %0.6f' % (label_from, label_to, weight))
print('Top likely transitions:')
print_transitions(Counter(crf.transition_features_).most_common(20))
print('\nTop unlikely transitions:')
print_transitions(Counter(crf.transition_features_).most_common()[-20:])

Top likely transitions:
B-nat  -> I-nat   6.934503
I-art  -> I-art   6.260215
B-art  -> I-art   5.881224
I-eve  -> I-eve   5.847777
B-eve  -> I-eve   5.586673
I-tim  -> I-tim   5.204188
I-org  -> I-org   4.782243
I-gpe  -> I-gpe   4.699609
B-tim  -> I-tim   4.636703
B-org  -> I-org   4.282602
O      -> O       3.813956
B-per  -> I-per   3.698815
I-geo  -> I-geo   3.685166
B-gpe  -> I-gpe   3.597376
B-geo  -> I-geo   3.516476
I-per  -> I-per   3.245863
I-nat  -> I-nat   2.954009
I-geo  -> B-art   1.973397
O      -> B-tim   1.748999
O      -> B-per   1.620428

Top unlikely transitions:
I-org  -> I-geo   -4.259782
I-org  -> I-per   -4.327937
B-geo  -> B-geo   -4.426926
B-per  -> I-org   -4.427218
B-geo  -> I-gpe   -4.435073
B-per  -> I-geo   -4.466408
B-tim  -> B-tim   -4.518613
B-org  -> I-geo   -4.575173
B-geo  -> I-per   -4.793920
B-org  -> I-per   -5.036090
B-geo  -> I-org   -5.070524
B-gpe  -> I-geo   -5.210003
B-gpe  -> I-org   -5.287803
B-gpe  -> B-gpe   -5.607401
O      -> I-per  

##Named Entity Recognition with Spacy

The following lines show how to build named entity recognizer with [SpaCy](https://spacy.io/), to identify the names of things, such as persons, organizations, or locations. SpaCy’s named entity recognition has been trained on the [OntoNotes 5 corpus](https://catalog.ldc.upenn.edu/LDC2013T19) and it supports the following entity types: https://spacy.io/api/annotation#section-named-entities

In [17]:
import spacy

In [18]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m103.3 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [19]:
import spacy
from spacy import displacy
from collections import Counter
import en_core_web_sm

In [20]:
# loading pre-trained model of NER
nlp = en_core_web_sm.load()

In [21]:
!pip install -q wikipedia
import wikipedia

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone


We will extract a wikipedia page (with OpenAI, https://en.wikipedia.org/wiki/OpenAI) to test NER with Spacy.
There are 573 entities in the page.

In [22]:
# getting wikipedia page of Open AI
wikip = wikipedia.page('OpenAI')
article = nlp(wikip.content)
len(article.ents)

582

In [23]:
# count the number of entitie types found from wikipedia page
labels = [x.label_ for x in article.ents]
Counter(labels)

Counter({'GPE': 88,
         'NORP': 5,
         'ORG': 189,
         'PRODUCT': 11,
         'PERSON': 102,
         'DATE': 85,
         'FAC': 7,
         'MONEY': 17,
         'CARDINAL': 48,
         'ORDINAL': 5,
         'WORK_OF_ART': 8,
         'TIME': 3,
         'LOC': 3,
         'PERCENT': 6,
         'EVENT': 2,
         'LANGUAGE': 3})

In [24]:
# getting the top 10 words recognised as named entity
items = [x.text for x in article.ents]
Counter(items).most_common(10)

[('OpenAI', 74),
 ('AI', 41),
 ('Microsoft', 13),
 ('GPT-2', 13),
 ('GPT-3', 13),
 ('OpenAI Global', 9),
 ('Google', 8),
 ('Sam Altman', 7),
 ('Musk', 7),
 ('API', 7)]

In [25]:
sentences = [x for x in article.sents]
print(sentences[0])

OpenAI is an American artificial intelligence (AI) organization consisting of the non-profit OpenAI, Inc. registered in Delaware and its for-profit subsidiary corporation OpenAI Global, LLC.


In [26]:
type(sentences[0])

spacy.tokens.span.Span

In [27]:
# display each tag of the sentence
print([(x, x.ent_iob_, x.ent_type_) for x in sentences[0]])

[(OpenAI, 'B', 'GPE'), (is, 'O', ''), (an, 'O', ''), (American, 'B', 'NORP'), (artificial, 'O', ''), (intelligence, 'O', ''), ((, 'O', ''), (AI, 'B', 'ORG'), (), 'O', ''), (organization, 'O', ''), (consisting, 'O', ''), (of, 'O', ''), (the, 'O', ''), (non, 'O', ''), (-, 'O', ''), (profit, 'O', ''), (OpenAI, 'B', 'ORG'), (,, 'I', 'ORG'), (Inc., 'I', 'ORG'), (registered, 'O', ''), (in, 'O', ''), (Delaware, 'B', 'GPE'), (and, 'O', ''), (its, 'O', ''), (for, 'O', ''), (-, 'O', ''), (profit, 'O', ''), (subsidiary, 'O', ''), (corporation, 'O', ''), (OpenAI, 'B', 'PRODUCT'), (Global, 'I', 'PRODUCT'), (,, 'O', ''), (LLC, 'B', 'ORG'), (., 'O', '')]


In [28]:
# display whole sentences using render()
displacy.render(nlp(str(sentences)), jupyter=True, style='ent')

In [29]:
!python -m spacy download ko_core_news_sm

Collecting ko-core-news-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/ko_core_news_sm-3.6.0/ko_core_news_sm-3.6.0-py3-none-any.whl (14.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m14.7/14.7 MB[0m [31m55.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: ko-core-news-sm
Successfully installed ko-core-news-sm-3.6.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('ko_core_news_sm')


In [30]:
import ko_core_news_sm
nlp = ko_core_news_sm.load()

In [31]:
wikipedia.set_lang('ko')

# Excercise

Get a Korean wikipedia page(https://ko.wikipedia.org/wiki/오픈AI), and display its named entities using displacy.render().

In [None]:
# Please complete this