# Named Entity Recognition

This notebook describes named entity recognition for code mix data experimenting with different machine learning classification algorithms with word, character and lexical features. The algorithms used for NER in this notebook are Decision tree, Long Short-Term Memory (LSTM), and Conditional Random Field (CRF). I have implemented the model using scikit-learn and keras library.

# DATA-SET 

In this notebook we are training our NER model for code mix data mainly Hindi-English. We have taken Hindi-English code-mixed tweets containing tweets from last 8 years (on themes like legislative issues, sports, etc from the Indian subcontinent point of view). For more information of Dataset refer to http://aclweb.org/anthology/W18-2405

In [1]:
import pandas as pd
import numpy as np
import csv,sys,re

### Load the NER dataset into a Pandas DataFrame

First, the data is loaded into a Pandas DataFrame. This can be done easily using the read_csv function, specifying that the separator is a comma. It's also useful to keep the blank lines, which are helpful later for determining the sentence breaks.

Once the data is loaded into a DataFrame, Now we have easy access to columns allows a couple of useful things to be done - group the data by the "ne" column to see the distributions of each tag, and extract the classes (disregarding 'O' and blank lines with NaN values) as a list.

In [2]:

ner_data = pd.read_csv("annotatedData.csv", sep=",", header=None, skip_blank_lines=False, encoding="utf-8")
ner_data = ner_data[1:]
ner_data.columns = ["sen_num", "word", "tag"]

# Explore thbe distribution of NE tags in the dataset
tag_distribution = ner_data.groupby("tag").size().reset_index(name='counts')
print(tag_distribution)

     tag  counts
0  B-Loc     762
1  B-Org    1432
2  B-Per    2138
3  I-Loc      31
4  I-Org      90
5  I-Per     554
6  Other   63499


In [3]:
# Extract the useful classes (not 'O' or NaN values) as a list
classes = list(filter(lambda x: x not in ["O", np.nan], list(ner_data["tag"].unique())))

print(classes)

['B-Per', 'Other', 'B-Org', 'I-Org', 'B-Loc', 'I-Per', 'I-Loc']


# Feature Extraction

### Features corresponding to every word

The feature set consists of word, character and lexical level information like char N-Grams of Gram size 2 and 3 for suffixes, patterns for punctuation, emoticons, numbers, numbers inside strings, social media specific characters like ‘#’, ‘@’ and
also previous tag information, and the same all features of the previous and next tokens are used as context features

In [4]:
 def word2features(sent, i,word2idx,tag2idx,word2Suff2idx,word3Suff2idx,wordLower2idx,binaryIdx):
        word = sent[i][0]  
        features = {
            'bias': 1.0,
            'word': word2idx[word],
            'word.lower()': wordLower2idx[word.lower()],
            'word[-3:]': word3Suff2idx[word[-3:]],
            'word[-2:]': word2Suff2idx[word[-2:]],
            'word.isupper()': binaryIdx[str(word.isupper())],
            'word.istitle()': binaryIdx[str(word.istitle())],
            'word.isdigit()': binaryIdx[str(word.isdigit())],
            'word.startsWith#()': binaryIdx[str(word.startswith("#"))],
            'word.startsWith@()': binaryIdx[str(word.startswith("@"))],
            'word.1stUpper()': binaryIdx[str(word[0].isupper())],
            'word.isAlpha()': binaryIdx[str(word.isalpha())],
            'word.Tag': tag2idx[sent[i][1]],
        }
        if i > 0:
            word1 = sent[i-1][0]
            features.update({
                '-1:word': word2idx[word1],
                '-1:word.lower()': wordLower2idx[word1.lower()],
                '-1:word.istitle()': binaryIdx[str(word1.istitle())],
                '-1:word.isupper()': binaryIdx[str(word1.isupper())],
                '-1:word.istitle()': binaryIdx[str(word1.istitle())],
                '-1:word.isdigit()': binaryIdx[str(word1.isdigit())],
                '-1:word.startsWith#()': binaryIdx[str(word1.startswith("#"))],
                '-1:word.startsWith@()': binaryIdx[str(word1.startswith("@"))],
                '-1:word.1stUpper()': binaryIdx[str(word1[0].isupper())],
                '-1:word.isAlpha()': binaryIdx[str(word1.isalpha())],
            })
        else:
            features['BOS'] = binaryIdx[str("True")]

        if i < len(sent)-1:
            word1 = sent[i+1][0]
            features.update({
                '+1:word': word2idx[word1],
                '+1:word.lower()': wordLower2idx[word1.lower()],
                '+1:word.istitle()': binaryIdx[str(word1.istitle())],
                '+1:word.isupper()': binaryIdx[str(word1.isupper())],
                '+1:word.istitle()': binaryIdx[str(word1.istitle())],
                '+1:word.isdigit()': binaryIdx[str(word1.isdigit())],
                '+1:word.startsWith#()': binaryIdx[str(word1.startswith("#"))],
                '+1:word.startsWith@()': binaryIdx[str(word1.startswith("@"))],
                '+1:word.1stUpper()': binaryIdx[str(word1[0].isupper())],
                '+1:word.isAlpha()': binaryIdx[str(word1.isalpha())],
            })
        else:
            features['EOS'] = binaryIdx[str("True")]

        return features

In [5]:
class SentenceGetter(object):
            
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(),
                                                           s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sent").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None


In [6]:
def sent2features(sent,word2idx,tag2idx,word2Suff2idx,word3Suff2idx,wordLower2idx,binaryIdx):
#    print (sent)
    return list(word2features(sent, i,word2idx,tag2idx,word2Suff2idx,word3Suff2idx,wordLower2idx,binaryIdx) for i in range(len(sent)))

In [21]:
def sent2labels(sent):
    return [label for token, label in sent]

Features vector corresponding to each sentence uses following features:
- Character N-Grams
- Word N-Gram
- Capitalization
- Mentions and Hashtags
- Numbers in String
- Previous Word Tag
- Common Symbols

In [7]:
def numericFeatures():
    data = pd.read_csv("annotatedData.csv", encoding="latin1")
    data = data.fillna(method="ffill")

    words = list(set(data["Word"].values))
    words.append("ENDPAD")
    tags = list(set(data["Tag"].values))
#     print (words)
    print (tags)



    max_len = 50
    word2idx = {w: i for i, w in enumerate(words)}
    tag2idx = {t: i for i, t in enumerate(tags)}
    word2Suff2idx = {w[-2:]: i for i, w in enumerate(words)}
    word3Suff2idx = {w[-3:]: i for i, w in enumerate(words)}
    wordLower2idx = {w.lower(): i for i, w in enumerate(words)}
    binaryIdx = {"True": 1, "False": 0}

#     print (binaryIdx[str("False")])

    # X = [[binaryIdx[str(w[5]] for w in s] for s in features]


    getter = SentenceGetter(data)
    # sent = getter.get_next()
    sentences = getter.sentences
    #print (sentences)
    
    X = list(sent2features(s,word2idx,tag2idx,word2Suff2idx,word3Suff2idx,wordLower2idx,binaryIdx) for s in sentences)
    #print (pd.DataFrame(X[0]))
    return X

In [8]:
featureVec = numericFeatures()
csv_columns = ['+1:word', '+1:word.1stUpper()', '+1:word.isAlpha()', '+1:word.isdigit()', '+1:word.istitle()','+1:word.isupper()', '+1:word.lower()', '+1:word.startsWith#()', '+1:word.startsWith@()', 'BOS', '-1:word', '-1:word.1stUpper()', '-1:word.isAlpha()', '-1:word.isdigit()', '-1:word.istitle()', '-1:word.isupper()','-1:word.lower()', '-1:word.startsWith#()', '-1:word.startsWith@()', 'EOS', 'bias', 'word', 'word.1stUpper()', 'word.isAlpha()', 'word.isdigit()', 'word.istitle()','word.isupper()', 'word.lower()', 'word.startsWith#()', 'word.startsWith@()', 'word[-2:]', 'word[-3:]', 'word.Tag']
print(len(csv_columns))

with open('featureVector.csv', 'w') as ofile:
    writer = csv.DictWriter(ofile, csv_columns)
    writer.writeheader()
    for sen in featureVec:
        for word in sen:
            # print d
            writer.writerow(word)

['I-Per', 'B-Per', 'I-Org', 'B-Loc', 'Other', 'I-Loc', 'B-Org']
33


# Language Model Classifier

we have experimented different classifier for identifying language. Further we will determine the effect of each feature and
parameter of different models by performing several experiments with some set of feature vectors at a time and all at a time simultaneously changing the values of the parameters of our language classifier models like maximum depth of the tree for Decision tree model, etc. 

In [9]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_predict
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier


X = pd.read_csv('./featureVector.csv')

y = X['word.Tag']

# removing the Tag column from X to keep it as feature only.
X.drop('word.Tag', axis=1, inplace=True)

# handelling the NaN and inf values in the dataset
X=X.astype('float32')
y=y.astype('float32')
X = np.nan_to_num(X)


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

dtc = DecisionTreeClassifier(max_depth=32, class_weight=dict([{0:1,1:1}, {0:1,1:50}, {0:1,1:18},{0:1,1:1940}, {0:1,1:70},{0:1,1:3},{0:1,1:25}]))
gnb = GaussianNB()
svm = SVC(gamma='auto')
clf = RandomForestClassifier(max_depth=10)

# fit
dtc.fit(X_train, y_train)
gnb.fit(X_train, y_train)
#svm.fit(X_train, y_train)
clf.fit(X_train, y_train)

# predict
y_pred = dtc.predict(X_test)
target_names = ['I-Loc', 'B-Org', 'I-Per', 'Other', 'B-Per', 'I-Org', 'B-Loc']

# print
print ("Results for Decision tree..")

print(classification_report(y_test, y_pred, target_names=target_names))


# f1 score
score = f1_score(y_pred, y_test, average='weighted')
print( "Decision Tree F1 score: {:.2f}".format(score))


print ("Results for Naive Bayes...")
y_pred = gnb.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))

# f1 score
score = f1_score(y_pred, y_test, average='weighted')
print ("Naive Bayes F1 score: {:.2f}".format(score))


# print
print ("SVM")

print(classification_report(y_test, y_pred, target_names=target_names))


# f1 score
score = f1_score(y_pred, y_test, average='weighted')
print( "SVM F1 score: {:.2f}".format(score))



print( "Results for Random Forest...")
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))

# f1 score
score = f1_score(y_pred, y_test, average='weighted')
print ("random Forest F1 score: {:.2f}".format(score))

# # Cross validation on Data
# pred = cross_val_predict(estimator=dtc, X=X, y=y, cv=5)
# print(classification_report(pred, y, target_names))



Results for Decision tree..
              precision    recall  f1-score   support

       I-Loc       0.38      0.39      0.39       153
       B-Org       0.67      0.65      0.66       645
       I-Per       0.20      0.22      0.21        23
       Other       0.51      0.52      0.52       202
       B-Per       0.97      0.97      0.97     16653
       I-Org       0.42      0.50      0.45        10
       B-Loc       0.61      0.62      0.62       350

   micro avg       0.94      0.94      0.94     18036
   macro avg       0.54      0.55      0.55     18036
weighted avg       0.94      0.94      0.94     18036

Decision Tree F1 score: 0.94
Results for Naive Bayes...
              precision    recall  f1-score   support

       I-Loc       0.07      0.24      0.11       153
       B-Org       0.28      0.48      0.35       645
       I-Per       0.01      0.30      0.01        23
       Other       0.07      0.35      0.11       202
       B-Per       0.97      0.79      0.87     

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [45]:
from sklearn_crfsuite import CRF
from sklearn.model_selection import cross_val_predict
from sklearn_crfsuite.metrics import flat_classification_report

data = pd.read_csv("annotatedData.csv", encoding="latin1")
data = data.fillna(method="ffill")

getter = SentenceGetter(data)
sentences = getter.sentences

X1 = numericFeatures()
Y1 = [sent2labels(s) for s in sentences]



crf = CRF(algorithm='l2sgd',
          c2 = 0.1,
          max_iterations = 1000,
          all_possible_transitions = False)


pred = cross_val_predict(estimator = crf, X = X1, y = Y1, cv = 2)
report = flat_classification_report(y_pred = pred, y_true = Y1)
print(report)

crf.fit(X1, Y1)

['I-Per', 'B-Per', 'I-Org', 'B-Loc', 'Other', 'I-Loc', 'B-Org']
              precision    recall  f1-score   support

       B-Loc       0.00      0.00      0.00       795
       B-Org       0.00      0.00      0.00      1528
       B-Per       0.03      0.42      0.05      2362
       I-Loc       0.00      0.00      0.00        31
       I-Org       0.00      0.00      0.00        96
       I-Per       0.00      0.00      0.00       571
       Other       0.91      0.50      0.65     66760

   micro avg       0.48      0.48      0.48     72143
   macro avg       0.13      0.13      0.10     72143
weighted avg       0.85      0.48      0.60     72143



CRF(algorithm='l2sgd', all_possible_states=None,
  all_possible_transitions=False, averaging=None, c=None, c1=None, c2=0.1,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=1000,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=False)

In [49]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding, TimeDistributed, Dropout

dataset = pd.read_csv('featureVector.csv', header=0)
val = dataset.values
val=val.astype('float32')
val = np.nan_to_num(val)

X = val[:,:32]
Y = val[:,32]

# print X.shape, Y.shape

X = np.reshape(X, (X.shape[0], X.shape[1], 1))
print(X.shape)

model = Sequential()
model.add(LSTM(100, input_shape=(32, 1)))
model.add(Dropout(0.3))
model.add(Dense(7,activation='softmax')) #7 class classification.
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])
model.fit(X, Y, epochs=5, batch_size=32, validation_split = 0.2, verbose=1)

model.summary()

Using TensorFlow backend.


(72143, 32, 1)
Train on 57714 samples, validate on 14429 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 100)               40800     
_________________________________________________________________
dropout_1 (Dropout)          (None, 100)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 7)                 707       
Total params: 41,507
Trainable params: 41,507
Non-trainable params: 0
_________________________________________________________________


In [50]:
from keras.layers import Bidirectional

model1 = Sequential()
model1.add(Bidirectional(LSTM(100, input_shape=(32, 1))))
model1.add(Dropout(0.3))
model1.add(Dense(7,activation='softmax')) #7 class classification.
model1.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])
model1.fit(X, Y, epochs=5, batch_size=32, validation_split = 0.2, verbose=1)

model1.summary()

Train on 57714 samples, validate on 14429 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bidirectional_1 (Bidirection (None, 200)               81600     
_________________________________________________________________
dropout_2 (Dropout)          (None, 200)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 7)                 1407      
Total params: 83,007
Trainable params: 83,007
Non-trainable params: 0
_________________________________________________________________
