# Named Entity Recognition

This notebook describes named entity recognition for code mix data experimenting with different machine learning classification algorithms with word, character and lexical features. The algorithms used for NER in this notebook are Decision tree, Long Short-Term Memory (LSTM), and Conditional Random Field (CRF). I have implemented the model using scikit-learn and keras library.

# DATA-SET 

In this notebook we are training our NER model for code mix data mainly Hindi-English. We have taken Hindi-English code-mixed tweets containing tweets from last 8 years (on themes like legislative issues, sports, etc from the Indian subcontinent point of view). For more information of Dataset refer to http://aclweb.org/anthology/W18-2405

In [1]:
import pandas as pd
import numpy as np
import csv,sys,re

### Load the NER dataset into a Pandas DataFrame

First, the data is loaded into a Pandas DataFrame. This can be done easily using the read_csv function, specifying that the separator is a comma. It's also useful to keep the blank lines, which are helpful later for determining the sentence breaks.

Once the data is loaded into a DataFrame, Now we have easy access to columns allows a couple of useful things to be done - group the data by the "ne" column to see the distributions of each tag, and extract the classes (disregarding 'O' and blank lines with NaN values) as a list.

In [2]:

ner_data = pd.read_csv("./Twitterdata/annotatedData.csv", sep=",", header=None, skip_blank_lines=False, encoding="utf-8")
ner_data = ner_data[1:]
ner_data.columns = ["sen_num", "word", "tag"]

# Explore thbe distribution of NE tags in the dataset
tag_distribution = ner_data.groupby("tag").size().reset_index(name='counts')
print(tag_distribution)

     tag  counts
0  B-Loc     762
1  B-Org    1432
2  B-Per    2138
3  I-Loc      31
4  I-Org      90
5  I-Per     554
6  Other   63499


In [3]:
# Extract the useful classes (not 'O' or NaN values) as a list
classes = list(filter(lambda x: x not in ["O", np.nan], list(ner_data["tag"].unique())))

print(classes)

['B-Per', 'Other', 'B-Org', 'I-Org', 'B-Loc', 'I-Per', 'I-Loc']


# Feature Extraction

### Features corresponding to every word

The feature set consists of word, character and lexical level information like char N-Grams of Gram size 2 and 3 for suffixes, patterns for punctuation, emoticons, numbers, numbers inside strings, social media specific characters like ‘#’, ‘@’ and
also previous tag information, and the same all features of the previous and next tokens are used as context features

In [5]:
 def word2features(sent, i,word2idx,tag2idx,word2Suff2idx,word3Suff2idx,wordLower2idx,binaryIdx):
        word = sent[i][0]  
        features = {
            'bias': 1.0,
            'word': word2idx[word],
            'word.lower()': wordLower2idx[word.lower()],
            'word[-3:]': word3Suff2idx[word[-3:]],
            'word[-2:]': word2Suff2idx[word[-2:]],
            'word.isupper()': binaryIdx[str(word.isupper())],
            'word.istitle()': binaryIdx[str(word.istitle())],
            'word.isdigit()': binaryIdx[str(word.isdigit())],
            'word.startsWith#()': binaryIdx[str(word.startswith("#"))],
            'word.startsWith@()': binaryIdx[str(word.startswith("@"))],
            'word.1stUpper()': binaryIdx[str(word[0].isupper())],
            'word.isAlpha()': binaryIdx[str(word.isalpha())],
            'word.Tag': tag2idx[sent[i][1]],
        }
        if i > 0:
            word1 = sent[i-1][0]
            features.update({
                '-1:word': word2idx[word1],
                '-1:word.lower()': wordLower2idx[word1.lower()],
                '-1:word.istitle()': binaryIdx[str(word1.istitle())],
                '-1:word.isupper()': binaryIdx[str(word1.isupper())],
                '-1:word.istitle()': binaryIdx[str(word1.istitle())],
                '-1:word.isdigit()': binaryIdx[str(word1.isdigit())],
                '-1:word.startsWith#()': binaryIdx[str(word1.startswith("#"))],
                '-1:word.startsWith@()': binaryIdx[str(word1.startswith("@"))],
                '-1:word.1stUpper()': binaryIdx[str(word1[0].isupper())],
                '-1:word.isAlpha()': binaryIdx[str(word1.isalpha())],
            })
        else:
            features['BOS'] = binaryIdx[str("True")]

        if i < len(sent)-1:
            word1 = sent[i+1][0]
            features.update({
                '+1:word': word2idx[word1],
                '+1:word.lower()': wordLower2idx[word1.lower()],
                '+1:word.istitle()': binaryIdx[str(word1.istitle())],
                '+1:word.isupper()': binaryIdx[str(word1.isupper())],
                '+1:word.istitle()': binaryIdx[str(word1.istitle())],
                '+1:word.isdigit()': binaryIdx[str(word1.isdigit())],
                '+1:word.startsWith#()': binaryIdx[str(word1.startswith("#"))],
                '+1:word.startsWith@()': binaryIdx[str(word1.startswith("@"))],
                '+1:word.1stUpper()': binaryIdx[str(word1[0].isupper())],
                '+1:word.isAlpha()': binaryIdx[str(word1.isalpha())],
            })
        else:
            features['EOS'] = binaryIdx[str("True")]

        return features

In [4]:
class SentenceGetter(object):
            
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(),
                                                           s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sent").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None


In [6]:
def sent2features(sent,word2idx,tag2idx,word2Suff2idx,word3Suff2idx,wordLower2idx,binaryIdx):
#    print (sent)
    return list(word2features(sent, i,word2idx,tag2idx,word2Suff2idx,word3Suff2idx,wordLower2idx,binaryIdx) for i in range(len(sent)))

Features vector corresponding to each sentence uses following features:
- Character N-Grams
- Word N-Gram
- Capitalization
- Mentions and Hashtags
- Numbers in String
- Previous Word Tag
- Common Symbols

In [7]:
def numericFeatures():
    data = pd.read_csv("./Twitterdata/annotatedData.csv", encoding="latin1")
    data = data.fillna(method="ffill")

    words = list(set(data["Word"].values))
    words.append("ENDPAD")
    tags = list(set(data["Tag"].values))
#     print (words)
    print (tags)



    max_len = 50
    word2idx = {w: i for i, w in enumerate(words)}
    tag2idx = {t: i for i, t in enumerate(tags)}
    word2Suff2idx = {w[-2:]: i for i, w in enumerate(words)}
    word3Suff2idx = {w[-3:]: i for i, w in enumerate(words)}
    wordLower2idx = {w.lower(): i for i, w in enumerate(words)}
    binaryIdx = {"True": 1, "False": 0}

#     print (binaryIdx[str("False")])

    # X = [[binaryIdx[str(w[5]] for w in s] for s in features]


    getter = SentenceGetter(data)
    # sent = getter.get_next()
    sentences = getter.sentences
    #print (sentences)
    
    X = list(sent2features(s,word2idx,tag2idx,word2Suff2idx,word3Suff2idx,wordLower2idx,binaryIdx) for s in sentences)
    #print (pd.DataFrame(X[0]))
    return X

In [8]:
featureVec = numericFeatures()
csv_columns = ['+1:word', '+1:word.1stUpper()', '+1:word.isAlpha()', '+1:word.isdigit()', '+1:word.istitle()','+1:word.isupper()', '+1:word.lower()', '+1:word.startsWith#()', '+1:word.startsWith@()', 'BOS', '-1:word', '-1:word.1stUpper()', '-1:word.isAlpha()', '-1:word.isdigit()', '-1:word.istitle()', '-1:word.isupper()','-1:word.lower()', '-1:word.startsWith#()', '-1:word.startsWith@()', 'EOS', 'bias', 'word', 'word.1stUpper()', 'word.isAlpha()', 'word.isdigit()', 'word.istitle()','word.isupper()', 'word.lower()', 'word.startsWith#()', 'word.startsWith@()', 'word[-2:]', 'word[-3:]', 'word.Tag']
print(len(csv_columns))

with open('featureVector.csv', 'w') as ofile:
    writer = csv.DictWriter(ofile, csv_columns)
    writer.writeheader()
    for sen in featureVec:
        for word in sen:
            # print d
            writer.writerow(word)

['I-Per', 'Other', 'B-Loc', 'I-Loc', 'I-Org', 'B-Org', 'B-Per']
33


# Language Model Classifier

we have experimented different classifier for identifying language. Further we will determine the effect of each feature and
parameter of different models by performing several experiments with some set of feature vectors at a time and all at a time simultaneously changing the values of the parameters of our language classifier models like maximum depth of the tree for Decision tree model, etc. 

In [14]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_predict
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier


X = pd.read_csv('./featureVector.csv')

y = X['word.Tag']

# removing the Tag column from X to keep it as feature only.
X.drop('word.Tag', axis=1, inplace=True)

# handelling the NaN and inf values in the dataset
X=X.astype('float32')
y=y.astype('float32')
X = np.nan_to_num(X)


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

dtc = DecisionTreeClassifier(max_depth=32, class_weight=dict([{0:1,1:1}, {0:1,1:50}, {0:1,1:18},{0:1,1:1940}, {0:1,1:70},{0:1,1:3},{0:1,1:25}]))
gnb = GaussianNB()
svm = SVC(gamma='auto')
clf = RandomForestClassifier(max_depth=10)

# fit
dtc.fit(X_train, y_train)
gnb.fit(X_train, y_train)
#svm.fit(X_train, y_train)
clf.fit(X_train, y_train)

# predict
y_pred = dtc.predict(X_test)
target_names = ['I-Loc', 'B-Org', 'I-Per', 'Other', 'B-Per', 'I-Org', 'B-Loc']

# print
print ("Results for Decision tree..")

print(classification_report(y_test, y_pred, target_names=target_names))


# f1 score
score = f1_score(y_pred, y_test, average='weighted')
print( "Decision Tree F1 score: {:.2f}".format(score))


print ("Results for Naive Bayes...")
y_pred = gnb.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))

# f1 score
score = f1_score(y_pred, y_test, average='weighted')
print ("Naive Bayes F1 score: {:.2f}".format(score))


# print
print ("SVM")

print(classification_report(y_test, y_pred, target_names=target_names))


# f1 score
score = f1_score(y_pred, y_test, average='weighted')
print( "SVM F1 score: {:.2f}".format(score))



print( "Results for Random Forest...")
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=target_names))

# f1 score
score = f1_score(y_pred, y_test, average='weighted')
print ("random Forest F1 score: {:.2f}".format(score))

# # Cross validation on Data
# pred = cross_val_predict(estimator=dtc, X=X, y=y, cv=5)
# print(classification_report(pred, y, target_names))



Results for Decision tree..
              precision    recall  f1-score   support

       I-Loc       0.34      0.33      0.34       153
       B-Org       0.97      0.97      0.97     16653
       I-Per       0.48      0.47      0.47       202
       Other       0.00      0.00      0.00        10
       B-Per       0.15      0.22      0.18        23
       I-Org       0.57      0.58      0.57       350
       B-Loc       0.63      0.61      0.62       645

   micro avg       0.94      0.94      0.94     18036
   macro avg       0.45      0.45      0.45     18036
weighted avg       0.94      0.94      0.94     18036

Decision Tree F1 score: 0.94
Results for Naive Bayes...
              precision    recall  f1-score   support

       I-Loc       0.06      0.16      0.08       153
       B-Org       0.97      0.77      0.86     16653
       I-Per       0.06      0.27      0.09       202
       Other       0.00      0.30      0.01        10
       B-Per       0.01      0.22      0.01     

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
