# Named Entity Identification (NEI) using SVM
**Problem statement:** Label each word in the input sentence as NE/non-NE  
**Assumptions:** 
- The tags `B-PER (1), I-PER (2), B-ORG (3), I-ORG (4) B-LOC (5), I-LOC (6) B-MISC (7), I-MISC (8)` are taken to be as a NE seperately.  
For example, the sentence `The Delhi High Court ...` will have ground truth tags as `The_0 Delhi_1 High_1 Court_1 ...` instead of `The_0 (Delhi High Court)_1 ...`

## Install Dependencies

In [1]:
! pip install datasets

Collecting datasets
  Downloading datasets-1.12.1-py3-none-any.whl (270 kB)
[K     |████████████████████████████████| 270 kB 5.3 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.7.4.post0-cp37-cp37m-manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 46.3 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.9.0-py3-none-any.whl (123 kB)
[K     |████████████████████████████████| 123 kB 57.1 MB/s 
Collecting huggingface-hub<0.1.0,>=0.0.14
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.4 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 61.5 MB/s 
Collecting async-timeout<4.0,>=3.0
  Downloading async_timeout-3.0.1-py3-none-any.whl (8.2 kB)
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.6.3-cp37-cp37m-manylinux2014_x86_64.whl (294 kB)
[K     |█████████████████████

In [2]:
import nltk
nltk.download('punkt')
nltk.download("stopwords")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Start

In [68]:
%reset -f

## Imports

In [69]:
import numpy as np
from sklearn.svm import SVC
from string import punctuation
from tqdm.notebook import tqdm
from datasets import load_dataset
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

## Constants

In [70]:
SEED = 0
D = 6 # number of features used
SW = stopwords.words("english")
PUNCT = list(punctuation)

## Functions

### Data

In [71]:
def createData(data):

    words = [] # stores the str
    features = [] # feature array, one vector per word in the corpus
    labels = [] # labels (0/1)

    for d in tqdm(data):

        tokens = d["tokens"]
        tags = d["ner_tags"]

        l = len(tokens)
        for i in range(l):

            x = vectorize(w = tokens[i], scaled_position = (i/l))

            if tags[i] > 0:
                y = 1
            else:
                y = 0

            features.append(x)
            labels.append(y)

        words += tokens

    words = np.asarray(words, dtype = "object")
    features = np.asarray(features, dtype = np.float32)
    labels = np.asarray(labels, dtype = np.float32)

    return words, features, labels

### Model

#### Feature Engineering (word $w$ (`str`) $\to$ feature vector $x \in \mathbb{R}^d$)
- Capitalization [`0/1`]
- Is all caps (eg., acronyms like 'USA') [`0/1`]
- Length of the token [`int`]
- Is stopword (using NLTK's english stopword list, 179 stopwords) [`0/1`]
- Is punctuation [`0/1`]
- (Scaled) position in sentence [`float`]

In [72]:
def vectorize(w, scaled_position):
    # w : str : a token

    v = np.zeros(D).astype(np.float32)

    # If first character in uppercase
    if w[0].isupper():
        title = 1
    else:
        title = 0

    # All characters in uppercase
    if w.isupper():
        allcaps = 1
    else:
        allcaps = 0

    # Is stopword
    if w.lower() in SW:
        sw = 1
    else:
        sw = 0

    # Is punctuation
    if w in PUNCT:
        punct = 1
    else:
        punct = 0

    # Build vector
    v[0] = title
    v[1] = allcaps
    v[2] = len(w)
    v[3] = sw
    v[4] = punct
    v[5] = scaled_position

    return v

In [73]:
def infer(model, scaler, s):
    # s: sentence

    tokens = word_tokenize(s)
    features = []

    l = len(tokens)
    for i in range(l):
        f = vectorize(w = tokens[i], scaled_position = (i/l))
        features.append(f)

    features = np.asarray(features, dtype = np.float32)

    scaled = scaler.transform(features)

    pred = model.predict(scaled)

    return pred, tokens, features

## Data (CoNLL 2003) [[huggingface]](https://huggingface.co/datasets/conll2003) [[original]](https://www.clips.uantwerpen.be/conll2003/ner/)
Has labels for persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups (4 classes).

In [74]:
data = load_dataset("conll2003") # of type datasets.dataset_dict.DatasetDict
data_train = data["train"] # 14,041 rows (type: datasets.arrow_dataset.Dataset)
data_val   = data["validation"] # 3250 rows
data_test  = data["test"] # 3453 rows

# columns: 'id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'

Reusing dataset conll2003 (/root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6)


  0%|          | 0/3 [00:00<?, ?it/s]

In [75]:
words_train, X_train, y_train = createData(data_train)
words_val, X_val, y_val       = createData(data_val)
words_test, X_test, y_test    = createData(data_test)

  0%|          | 0/14041 [00:00<?, ?it/s]

  0%|          | 0/3250 [00:00<?, ?it/s]

  0%|          | 0/3453 [00:00<?, ?it/s]

In [76]:
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(203621, 6)
(51362, 6)
(46435, 6)


In [77]:
# Print some examples of named entities in y_val
nes = words_val[y_val == 1]
for ne in np.random.choice(nes, size = 15):
    print(ne)

Kraft
M.
Glenn
Old
Joel
Hampshire
CHISINAU
Tse-Tung
Chemical
Staunton
DUBLIN
American
Cricket
Tour
Francis


In [78]:
# Standardize the features such that all features contribute equally to the distance metric computation of the SVM
scaler = StandardScaler()

# Fit only on the training data (i.e. compute mean and std)
scaler = scaler.fit(X_train)

# Use the train data fit values to scale val and test
X_train = scaler.transform(X_train)
X_val   = scaler.transform(X_val)
X_test  = scaler.transform(X_test)

In [79]:
model = SVC(C = 1.0, kernel = "linear", class_weight = "balanced", random_state = SEED, verbose = True)
model.fit(X_train, y_train)

[LibSVM]

SVC(C=1.0, break_ties=False, cache_size=200, class_weight='balanced', coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=0, shrinking=True, tol=0.001,
    verbose=True)

In [80]:
y_pred_val = model.predict(X_val)

In [81]:
print(classification_report(y_true = y_val, y_pred = y_pred_val))

              precision    recall  f1-score   support

         0.0       0.99      0.96      0.98     42759
         1.0       0.82      0.97      0.89      8603

    accuracy                           0.96     51362
   macro avg       0.91      0.96      0.93     51362
weighted avg       0.96      0.96      0.96     51362



In [82]:
# A few examples

examples = [
    "Delhi is the capital of India.",
    "US Vice President Kamala Harris, PM Modi talk up Indo-US ties at 1st in-person meeting.",
    "Covid-19 India Live News: National Task Force drops Ivermectin, HCQ drugs from Covid-19 treatment protocol; India logs 31,382 new cases.",
    "US Rules Out Adding India Or Japan To Security Alliance With Australia And UK" # all words are capitalized
]

for e in examples:
    pred, tokens, features = infer(model, scaler, e)
    annotated = []
    for w, p in zip(tokens, pred):
        annotated.append(f"{w}_{int(p)}")
    print(" ".join(annotated))
    print()

Delhi_1 is_0 the_0 capital_0 of_0 India_1 ._0

US_1 Vice_1 President_1 Kamala_1 Harris_1 ,_0 PM_1 Modi_1 talk_0 up_0 Indo-US_1 ties_0 at_0 1st_0 in-person_0 meeting_0 ._0

Covid-19_1 India_1 Live_1 News_1 :_0 National_1 Task_1 Force_1 drops_0 Ivermectin_1 ,_0 HCQ_1 drugs_0 from_0 Covid-19_1 treatment_0 protocol_0 ;_0 India_1 logs_0 31,382_0 new_0 cases_0 ._0

US_1 Rules_1 Out_0 Adding_1 India_1 Or_0 Japan_1 To_0 Security_1 Alliance_1 With_0 Australia_1 And_0 UK_1



## References
1. [Kapociute-Dzikiene, J., Nøklestad, A., Johannessen, J. B., & Krupavicius, A. (2013). Exploring features for named entity recognition in lithuanian text corpus.](https://aclanthology.org/W13-5611.pdf)
2. [Král, P. (2011). Features for named entity recognition in czech.](https://www.researchgate.net/publication/256605620_Features_for_named_entity_recognition_in_Czech_language)
3. [Malarkodi, C. S., & Devi, S. L. (2020, May). A Deeper Study on Features for Named Entity Recognition. In Proceedings of the WILDRE5–5th Workshop on Indian Language Data: Resources and Evaluation (pp. 66-72).](https://aclanthology.org/2020.wildre-1.12.pdf)