# Assignment 4: Named entity recognition

Create a model for Named Entity Recognition for dataset CoNLL 2002.  
Your quality metric = f1_macro

In your solution you should use: RandomForest, Gradient Boosting (xgboost, lightgbm, catboost)   
Tutorials:  
1. https://github.com/Microsoft/LightGBM/tree/master/examples/python-guide
1. https://github.com/catboost/tutorials 

More baselines you beat - better your score
 
baseline 1 [3 points]: 0.0604      random labels  
baseline 2 [5 points]: 0.3966      PoS features + logistic regression  
baseline 3 [8 points]: 0.8122      word2vec cbow embedding + baseline 2 + svm    

[1 point] using feature engineering (creating features not presented in the baselines)

! Your results must be reproducible. You should explicitly set all seeds random_states in yout model.  
! Remember to use proper training pipeline.  

bonus, think about:  
1. [1 point] Why did we select f1 score with macro averaging as our classification quality measure? What other metrics are suitable?   

Мы берем f-score т.к. у нас несбалансированный датасет. Метрики типа Accuracy будет не очень круто использовать, ибо у нас перевес класс O (0.85) от всего датасета. Почему именно f1-score с макро усреднением? Потому что эта метрика позволяет считать f-score для каждого класса независимо, и затем усредняет результат. f-score с микро усреднением будет учитывать вес для каждой оценки класса в зависимости от кол-ва элементов в классе (а нам это не очень нужно, ибо доминирующий класс у нас - O). Еще мы можем использовать AUC, или метрики с макро усреднением.

In [0]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


SEED=1338

In [2]:
df = pd.read_csv('ner_short.csv', index_col=0)
df.head()

Unnamed: 0,next-next-pos,next-next-word,next-pos,next-word,pos,prev-pos,prev-prev-pos,prev-prev-word,prev-word,sentence_idx,word,tag
0,NNS,demonstrators,IN,of,NNS,__START1__,__START2__,__START2__,__START1__,1.0,Thousands,O
1,VBP,have,NNS,demonstrators,IN,NNS,__START1__,__START1__,Thousands,1.0,of,O
2,VBN,marched,VBP,have,NNS,IN,NNS,Thousands,of,1.0,demonstrators,O
3,IN,through,VBN,marched,VBP,NNS,IN,of,demonstrators,1.0,have,O
4,NNP,London,IN,through,VBN,VBP,NNS,demonstrators,have,1.0,marched,O


In [3]:
# number of sentences
df.sentence_idx.max()

1500.0

In [4]:
# class distribution
df.tag.value_counts(normalize=True )

O        0.852828
B-geo    0.027604
B-gpe    0.020935
B-org    0.020247
I-per    0.017795
B-tim    0.016927
B-per    0.015312
I-org    0.013937
I-geo    0.005383
I-tim    0.004247
B-art    0.001376
I-gpe    0.000837
I-art    0.000748
B-eve    0.000628
I-eve    0.000508
B-nat    0.000449
I-nat    0.000239
Name: tag, dtype: float64

In [0]:
# sentence length
tdf = df.set_index('sentence_idx')
tdf['length'] = df.groupby('sentence_idx').tag.count()
df = tdf.reset_index(drop=False)

In [0]:
# encode categorial variables
le = LabelEncoder()
df['pos'] = le.fit_transform(df.pos)
df['next-pos'] = le.fit_transform(df['next-pos'])
df['next-next-pos'] = le.fit_transform(df['next-next-pos'])
df['prev-pos'] = le.fit_transform(df['prev-pos'])
df['prev-prev-pos'] = le.fit_transform(df['prev-prev-pos'])

In [7]:
df.head()

Unnamed: 0,sentence_idx,next-next-pos,next-next-word,next-pos,next-word,pos,prev-pos,prev-prev-pos,prev-prev-word,prev-word,word,tag,length
0,1.0,18,demonstrators,9,of,18,39,40,__START2__,__START1__,Thousands,O,48
1,1.0,33,have,18,demonstrators,9,18,39,__START1__,Thousands,of,O,48
2,1.0,32,marched,33,have,18,9,18,Thousands,of,demonstrators,O,48
3,1.0,9,through,32,marched,33,18,9,of,demonstrators,have,O,48
4,1.0,16,London,9,through,32,33,18,demonstrators,have,marched,O,48


In [8]:
# splitting
y = LabelEncoder().fit_transform(df.tag)

df_train, df_test, y_train, y_test = model_selection.train_test_split(df, y, stratify=y, 
                                                                      test_size=0.25, random_state=SEED, shuffle=True)
print('train', df_train.shape[0])
print('test', df_test.shape[0])

train 50155
test 16719


In [0]:
# some wrappers to work with word2vec
from gensim.models.word2vec import Word2Vec
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import TransformerMixin
from collections import defaultdict

   
class Word2VecWrapper(TransformerMixin):
    def __init__(self, window=5,negative=5, size=100, iter=100, is_cbow=False, random_state=SEED):
        self.window_ = window
        self.negative_ = negative
        self.size_ = size
        self.iter_ = iter
        self.is_cbow_ = is_cbow
        self.w2v = None
        self.random_state = random_state
        
    def get_size(self):
        return self.size_

    def fit(self, X, y=None):
        """
        X: list of strings
        """
        sentences_list = [x.split() for x in X]
        self.w2v = Word2Vec(sentences_list, 
                            window=self.window_,
                            negative=self.negative_, 
                            size=self.size_, 
                            iter=self.iter_,
                            sg=not self.is_cbow_, seed=self.random_state)

        return self
    
    def has(self, word):
        return word in self.w2v

    def transform(self, X):
        """
        X: a list of words
        """
        if self.w2v is None:
            raise Exception('model not fitted')
        return np.array([self.w2v[w] if w in self.w2v else np.zeros(self.size_) for w in X ])
    


In [10]:
%%time
# here we exploit that word2vec is an unsupervised learning algorithm
# so we can train it on the whole dataset (subject to discussion)

sentences_list = [x.strip() for x in ' '.join(df.word).split('.')]

w2v_cbow = Word2VecWrapper(window=5, negative=5, size=300, iter=300, is_cbow=True, random_state=SEED)
w2v_cbow.fit(sentences_list)

CPU times: user 41.7 s, sys: 388 ms, total: 42.1 s
Wall time: 23.6 s


In [11]:
%%time
# baseline 1 
# random labels
from sklearn.preprocessing import OneHotEncoder
from sklearn.dummy import DummyClassifier


columns = ['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']

model = Pipeline([
    ('enc', OneHotEncoder()),
    ('est', DummyClassifier(random_state=SEED)),
])

model.fit(df_train[columns], y_train)

print('train', metrics.f1_score(y_train, model.predict(df_train[columns]), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(df_test[columns]), average='macro'))


train 0.058085999075392906
test 0.05733334797725607
CPU times: user 117 ms, sys: 13 ms, total: 130 ms
Wall time: 139 ms


In [12]:
%%time
# baseline 2 
# pos features + one hot encoding + logistic regression
from sklearn.preprocessing import OneHotEncoder


columns = ['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']

model = Pipeline([
    ('enc', OneHotEncoder()),
    ('est', LogisticRegressionCV(Cs=5, cv=5, n_jobs=-1, scoring='f1_macro', 
                             penalty='l2', solver='newton-cg', multi_class='multinomial', random_state=SEED)),
])

model.fit(df_train[columns], y_train)

print('train', metrics.f1_score(y_train, model.predict(df_train[columns]), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(df_test[columns]), average='macro'))

train 0.462228322587577
test 0.3869903611637815
CPU times: user 21.4 s, sys: 189 ms, total: 21.6 s
Wall time: 17min 31s


In [13]:
%%time
# baseline 3
# use word2vec cbow embedding + baseline 2 + svm
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.svm import LinearSVC
import scipy.sparse as sp

embeding = w2v_cbow
encoder_pos = OneHotEncoder()
X_train = sp.hstack([
    embeding.transform(df_train.word),
    embeding.transform(df_train['next-word']),
    embeding.transform(df_train['next-next-word']),
    embeding.transform(df_train['prev-word']),
    embeding.transform(df_train['prev-prev-word']),
    encoder_pos.fit_transform(df_train[['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])
])
X_test = sp.hstack([
    embeding.transform(df_test.word),
    embeding.transform(df_test['next-word']),
    embeding.transform(df_test['next-next-word']),
    embeding.transform(df_test['prev-word']),
    embeding.transform(df_test['prev-prev-word']),
    encoder_pos.transform(df_test[['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])
])

model = model_selection.GridSearchCV(LinearSVC(penalty='l2', multi_class='ovr', random_state=SEED), 
                                    {'C': np.logspace(-4, 0, 5)}, 
                                    cv=3, scoring='f1_macro', n_jobs=-1, verbose=1)
model.fit(X_train, y_train)

print('train', metrics.f1_score(y_train, model.predict(X_train), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(X_test), average='macro'))

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed: 14.2min finished


train 0.954575529712029
test 0.8269826514181652
CPU times: user 2min 40s, sys: 1.38 s, total: 2min 42s
Wall time: 16min 53s


# And now my code

In [0]:
columns = ['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']

In [15]:
%%time 
#simple OneHotEncoder + RandomForest
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
import scipy.sparse as sp
from sklearn.model_selection import  GridSearchCV

#df_train, df_test, y_train, y_test = model_selection.train_test_split(df[columns], y, stratify=y, 
#
#                                                                      test_size=0.25, random_state=SEED, shuffle=True)
hyper_params = { 
    'n_estimators': [100, 125],
    'max_depth' : [12, 18],
    'criterion': ['gini', 'entropy'],
    'class_weight' :[None, 'balanced']
}

model = Pipeline([
    ('enc', OneHotEncoder()),
    ('clf', GridSearchCV(RandomForestClassifier(random_state=SEED), 
                         param_grid=hyper_params, 
                         scoring='f1_macro',
                         verbose=1,
                         n_jobs=-1))
])

model.fit(df_train[columns], y_train)

print('train', metrics.f1_score(y_train, model.predict(df_train[columns]), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(df_test[columns]), average='macro'))

Fitting 3 folds for each of 16 candidates, totalling 48 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 out of  48 | elapsed:  2.5min finished


train 0.6133386284740426
test 0.5158293900073481
CPU times: user 9.49 s, sys: 156 ms, total: 9.65 s
Wall time: 2min 36s


In [16]:
best_params = model['clf'].best_params_
best_params

{'class_weight': 'balanced',
 'criterion': 'entropy',
 'max_depth': 18,
 'n_estimators': 100}

In [17]:
%%time
#OneHotEncoder + w2v_cbow + RF
import scipy.sparse as sp

df_train, df_test, y_train, y_test = model_selection.train_test_split(df, y, stratify=y, 
                                                                     test_size=0.25, random_state=SEED, shuffle=True)

embeding = w2v_cbow
encoder_pos = OneHotEncoder()
X_train = sp.hstack([
    embeding.transform(df_train.word),
    embeding.transform(df_train['next-word']),
    embeding.transform(df_train['next-next-word']),
    embeding.transform(df_train['prev-word']),
    embeding.transform(df_train['prev-prev-word']),
    encoder_pos.fit_transform(df_train[['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])
])

X_test = sp.hstack([
    embeding.transform(df_test.word),
    embeding.transform(df_test['next-word']),
    embeding.transform(df_test['next-next-word']),
    embeding.transform(df_test['prev-word']),
    embeding.transform(df_test['prev-prev-word']),
    encoder_pos.transform(df_test[['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])
])

#Now I use RF with the params from previous step because with GridSearch it takes a lot of time
model = RandomForestClassifier(n_estimators=best_params['n_estimators'], 
                               max_depth = best_params['max_depth'],
                               criterion=best_params['criterion'],
                               class_weight = best_params['class_weight'])
model.fit(X_train, y_train)

print('train', metrics.f1_score(y_train, model.predict(X_train), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(X_test), average='macro'))

train 0.9569947337927774
test 0.8447522167790877
CPU times: user 9min 8s, sys: 661 ms, total: 9min 9s
Wall time: 9min 9s


## Some new features

Добавим к каждому токену следующие признаки:

1) Тип капитализации (1 - Слово; 2 - слово; 3 - СЛОВО; 4 - остальное) - category

2) Наличие/отсутсвие цифр. - bool

3) Наличие/отсутствие специальных символов - bool

4) Длина в символах - int

5)Первое слово в предложении или нет - bool

In [0]:
df.insert(2, 'capitalisation', 'other')
df.insert(3, 'digits',  False)
df.insert(4, 'symbols', False)
df.insert(5, 'length_word', 0)
df.insert(6, 'position', False)

In [19]:
df.head()

Unnamed: 0,sentence_idx,next-next-pos,capitalisation,digits,symbols,length_word,position,next-next-word,next-pos,next-word,pos,prev-pos,prev-prev-pos,prev-prev-word,prev-word,word,tag,length
0,1.0,18,other,False,False,0,False,demonstrators,9,of,18,39,40,__START2__,__START1__,Thousands,O,48
1,1.0,33,other,False,False,0,False,have,18,demonstrators,9,18,39,__START1__,Thousands,of,O,48
2,1.0,32,other,False,False,0,False,marched,33,have,18,9,18,Thousands,of,demonstrators,O,48
3,1.0,9,other,False,False,0,False,through,32,marched,33,18,9,of,demonstrators,have,O,48
4,1.0,16,other,False,False,0,False,London,9,through,32,33,18,demonstrators,have,marched,O,48


In [20]:
#Изменим типы нетокорые типы колонок на категориальные.
df.astype({'pos': 'category', 'capitalisation':'category', 'tag':'category'}).dtypes

sentence_idx       float64
next-next-pos        int64
capitalisation    category
digits                bool
symbols               bool
length_word          int64
position              bool
next-next-word      object
next-pos             int64
next-word           object
pos               category
prev-pos             int64
prev-prev-pos        int64
prev-prev-word      object
prev-word           object
word                object
tag               category
length               int64
dtype: object

In [0]:
import string
def find_digits(word):
  '''Функция ищет цифры в токене'''
  digits = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
  if len(set(digits) & set(word)) != 0:
      return True 
  else:
    return False

def find_cap(word):
  '''Функция смотрит тип капитализации в токене'''
  if word.istitle():
    return 'Capit'
  elif word.islower():
    return 'capit'
  elif word.isupper():
    return 'CAPIT'
  return 'other'

def find_symbols(word):
  '''Функция смотрит, есть ли специальные символы в токене'''
  if len(set(string.punctuation) & set(word)) == 0:
    return False
  else:
    return True
  
def first_position(sentence_id, word_id): 
  """Функция отпределяет начало твита или нет"""
  if word_id == 0:
    return True
  elif df.sentence_idx[word_id-1] != sentence_id:
    return True
  else:
    return False

In [0]:
for i, row in df.iterrows():
  digit = find_digits(row.word)# Есть цифры или нет
  df.at[i, 'digits'] = digit
    
  length = len(row.word) # Длина
  df.at[i,'length_word'] = length
    
  cap = find_cap(row.word) # Капитализация
  df.at[i, 'capitalisation'] = cap
    
  sym = find_symbols(row.word) #Наличие специальных символов
  df.at[i, 'symbols'] = sym
    
  first = first_position(row.sentence_idx, i) #Первый элемент в твите или нет
  df.at[i, 'position'] = first

In [23]:
df.head()

Unnamed: 0,sentence_idx,next-next-pos,capitalisation,digits,symbols,length_word,position,next-next-word,next-pos,next-word,pos,prev-pos,prev-prev-pos,prev-prev-word,prev-word,word,tag,length
0,1.0,18,Capit,False,False,9,True,demonstrators,9,of,18,39,40,__START2__,__START1__,Thousands,O,48
1,1.0,33,capit,False,False,2,False,have,18,demonstrators,9,18,39,__START1__,Thousands,of,O,48
2,1.0,32,capit,False,False,13,False,marched,33,have,18,9,18,Thousands,of,demonstrators,O,48
3,1.0,9,capit,False,False,4,False,through,32,marched,33,18,9,of,demonstrators,have,O,48
4,1.0,16,capit,False,False,7,False,London,9,through,32,33,18,demonstrators,have,marched,O,48


In [25]:
%%time
# OneHotEncoder + POS features + my features + w2v_cbow + RandomForest with best_params
df_train, df_test, y_train, y_test = model_selection.train_test_split(df, y, stratify=y, 
                                                                      test_size=0.25, random_state=SEED, shuffle=True)
embeding = w2v_cbow
encoder_pos = OneHotEncoder()
X_train = sp.hstack([
    embeding.transform(df_train.word),
    embeding.transform(df_train['next-word']),
    embeding.transform(df_train['next-next-word']),
    embeding.transform(df_train['prev-word']),
    embeding.transform(df_train['prev-prev-word']),
    encoder_pos.fit_transform(df_train[['capitalisation', 'pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])
])

X_test = sp.hstack([
    embeding.transform(df_test.word),
    embeding.transform(df_test['next-word']),
    embeding.transform(df_test['next-next-word']),
    embeding.transform(df_test['prev-word']),
    embeding.transform(df_test['prev-prev-word']),
    encoder_pos.transform(df_test[['capitalisation', 'pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])
])

model = RandomForestClassifier(n_estimators=best_params['n_estimators'], 
                               max_depth = best_params['max_depth'],
                               criterion=best_params['criterion'],
                               class_weight = best_params['class_weight'])
model.fit(X_train, y_train)

print('train', metrics.f1_score(y_train, model.predict(X_train), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(X_test), average='macro'))

train 0.977223528588559
test 0.8589386985821288
CPU times: user 9min 2s, sys: 239 ms, total: 9min 3s
Wall time: 9min 3s


Бейзлайны побиты!

А теперь попробуем еще раз с GridSearchCV (потому что захотелось)

In [0]:
def model_pipline(model, X, y, hyper_params, SEED=SEED): #This is from my homework2
  '''
  Функция выполняет пайплайн с семинара: 
  обучение без регуляризации, затем с регуляризацией на валидационном датасете.
  Через GridSearchCV подбирается лучший вариант модели.
  
  Возвращает список f1_macro score: 
  до регуляризации, лучшая оценка на валидации, оценка на тестовых данных
  Возвращает лучшие параметры модели и лучшую модель
  '''
  f1_scores = []

  X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, stratify=y, 
                                                                      test_size=0.25, random_state=SEED, shuffle=True)

  model.fit(X_train, y_train)
  f1_scores.append(metrics.f1_score(y_train, model.predict(X_train), average='macro'))

  print('Simple fit done.')
  grid_search = GridSearchCV(model, cv=3, param_grid=hyper_params, scoring='f1_macro', verbose=1, n_jobs=-1)
  grid_search.fit(X_train, y_train)
  print('GridSearchCV done.')
  f1_scores.append(grid_search.best_score_)
  
  model = grid_search.best_estimator_
  f1_scores.append(metrics.f1_score(y_test, model.predict(X_test), average='macro'))

  best_parms = grid_search.best_params_

  return f1_scores, best_parms, model

In [0]:
embeding = w2v_cbow
encoder_pos = OneHotEncoder()
X = sp.hstack([
    embeding.transform(df.word),
    embeding.transform(df['next-word']),
    embeding.transform(df['next-next-word']),
    embeding.transform(df['prev-word']),
    embeding.transform(df['prev-prev-word']),
    encoder_pos.fit_transform(df[['capitalisation', 'pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])
])

In [32]:
%%time
#Если что, у меня эта ячейка запускалась час :)
hyper_params = {'n_estimators' : [100, 300]} #I don't use more params because it is time consuming
scores, best_prms, best_estimator = model_pipline(RandomForestClassifier(), X=X, y=y, hyper_params=hyper_params, SEED=SEED)
print('Train f1_score {}; valid f1_score {}; test f1_score {}'.format(scores[0], scores[1], scores[2]))
print('Best hyperparams: {}'.format(best_prms))

Simple fit done.
Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed: 45.5min finished


GridSearchCV done.
Train f1_score 0.9905000160691169; valid f1_score 0.7759169058979338; test f1_score 0.9091273092985467
Best hyperparams: {'n_estimators': 300}
CPU times: user 27min 42s, sys: 1.72 s, total: 27min 43s
Wall time: 1h 13min 11s
