# Assignment 4: Named entity recognition

Create a model for Named Entity Recognition for dataset CoNLL 2002.  
Your quality metric = f1_macro

In your solution you should use: RandomForest, Gradient Boosting (xgboost, lightgbm, catboost)   
Tutorials:  
1. https://github.com/Microsoft/LightGBM/tree/master/examples/python-guide
1. https://github.com/catboost/tutorials 

More baselines you beat - better your score
 
baseline 1 [3 points]: 0.0604      random labels  
baseline 2 [5 points]: 0.3966      PoS features + logistic regression  
baseline 3 [8 points]: 0.8122      word2vec cbow embedding + baseline 2 + svm    

[1 point] using feature engineering (creating features not presented in the baselines)

! Your results must be reproducible. You should explicitly set all seeds random_states in yout model.  
! Remember to use proper training pipeline.  

bonus, think about:  
1. [1 point] Why did we select f1 score with macro averaging as our classification quality measure? What other metrics are suitable?   

In [0]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics

import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

SEED=1337
np.random.seed(SEED)

In [2]:
df = pd.read_csv('ner_short.csv', index_col=0)
df.head(n=20)

Unnamed: 0,next-next-pos,next-next-word,next-pos,next-word,pos,prev-pos,prev-prev-pos,prev-prev-word,prev-word,sentence_idx,word,tag
0,NNS,demonstrators,IN,of,NNS,__START1__,__START2__,__START2__,__START1__,1.0,Thousands,O
1,VBP,have,NNS,demonstrators,IN,NNS,__START1__,__START1__,Thousands,1.0,of,O
2,VBN,marched,VBP,have,NNS,IN,NNS,Thousands,of,1.0,demonstrators,O
3,IN,through,VBN,marched,VBP,NNS,IN,of,demonstrators,1.0,have,O
4,NNP,London,IN,through,VBN,VBP,NNS,demonstrators,have,1.0,marched,O
5,TO,to,NNP,London,IN,VBN,VBP,have,marched,1.0,through,O
6,VB,protest,TO,to,NNP,IN,VBN,marched,through,1.0,London,B-geo
7,DT,the,VB,protest,TO,NNP,IN,through,London,1.0,to,O
8,NN,war,DT,the,VB,TO,NNP,London,to,1.0,protest,O
9,IN,in,NN,war,DT,VB,TO,to,protest,1.0,the,O


In [3]:
# number of sentences
df.sentence_idx.max()

1500.0

In [4]:
# class distribution
df.tag.value_counts(normalize=True)

O        0.852828
B-geo    0.027604
B-gpe    0.020935
B-org    0.020247
I-per    0.017795
B-tim    0.016927
B-per    0.015312
I-org    0.013937
I-geo    0.005383
I-tim    0.004247
B-art    0.001376
I-gpe    0.000837
I-art    0.000748
B-eve    0.000628
I-eve    0.000508
B-nat    0.000449
I-nat    0.000239
Name: tag, dtype: float64

In [5]:
np.unique(df['tag'])

array(['B-art', 'B-eve', 'B-geo', 'B-gpe', 'B-nat', 'B-org', 'B-per',
       'B-tim', 'I-art', 'I-eve', 'I-geo', 'I-gpe', 'I-nat', 'I-org',
       'I-per', 'I-tim', 'O'], dtype=object)

In [0]:
# sentence length
tdf = df.set_index('sentence_idx')
tdf['length'] = df.groupby('sentence_idx').tag.count()
df = tdf.reset_index(drop=False)

In [0]:
# encode categorial variables

le = LabelEncoder()
df['pos'] = le.fit_transform(df.pos)
df['next-pos'] = le.fit_transform(df['next-pos'])
df['next-next-pos'] = le.fit_transform(df['next-next-pos'])
df['prev-pos'] = le.fit_transform(df['prev-pos'])
df['prev-prev-pos'] = le.fit_transform(df['prev-prev-pos'])

In [8]:
df.head()

Unnamed: 0,sentence_idx,next-next-pos,next-next-word,next-pos,next-word,pos,prev-pos,prev-prev-pos,prev-prev-word,prev-word,word,tag,length
0,1.0,18,demonstrators,9,of,18,39,40,__START2__,__START1__,Thousands,O,48
1,1.0,33,have,18,demonstrators,9,18,39,__START1__,Thousands,of,O,48
2,1.0,32,marched,33,have,18,9,18,Thousands,of,demonstrators,O,48
3,1.0,9,through,32,marched,33,18,9,of,demonstrators,have,O,48
4,1.0,16,London,9,through,32,33,18,demonstrators,have,marched,O,48


In [9]:
# splitting
y = LabelEncoder().fit_transform(df.tag)

df_train, df_test, y_train, y_test = model_selection.train_test_split(
    df, y, stratify=y, test_size=0.25, random_state=SEED, shuffle=True
)
print('train', df_train.shape[0])
print('test', df_test.shape[0])

train 50155
test 16719


In [0]:
# some wrappers to work with word2vec
from gensim.models.word2vec import Word2Vec
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.base import TransformerMixin
from collections import defaultdict

   
class Word2VecWrapper(TransformerMixin):
    def __init__(self, window=5, negative=5, size=100, iter=100, is_cbow=False, random_state=SEED):
        self.window_ = window
        self.negative_ = negative
        self.size_ = size
        self.iter_ = iter
        self.is_cbow_ = is_cbow
        self.w2v = None
        self.random_state = random_state
        
    def get_size(self):
        return self.size_

    def fit(self, X, y=None):
        """
        X: list of strings
        """
        sentences_list = [x.split() for x in X]
        self.w2v = Word2Vec(sentences_list, 
                            window=self.window_,
                            negative=self.negative_, 
                            size=self.size_, 
                            iter=self.iter_,
                            sg=not self.is_cbow_, seed=self.random_state)

        return self
    
    def has(self, word):
        return word in self.w2v

    def transform(self, X):
        """
        X: a list of words
        """
        if self.w2v is None:
            raise Exception('model not fitted')
        return np.array([self.w2v[w] if w in self.w2v else np.zeros(self.size_) for w in X ])

In [11]:
%%time
# here we exploit that word2vec is an unsupervised learning algorithm
# so we can train it on the whole dataset (subject to discussion)

sentences_list = [x.strip() for x in ' '.join(df.word).split('.')]

w2v_cbow = Word2VecWrapper(window=5, negative=5, size=300, iter=300, is_cbow=True, random_state=SEED)
w2v_cbow.fit(sentences_list)

CPU times: user 44.7 s, sys: 459 ms, total: 45.1 s
Wall time: 25.2 s


In [12]:
%%time
# baseline 1 
# random labels
from sklearn.preprocessing import OneHotEncoder
from sklearn.dummy import DummyClassifier


columns = ['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']

model = Pipeline([
    ('enc', OneHotEncoder()),
    ('est', DummyClassifier(random_state=SEED)),
])

model.fit(df_train[columns], y_train)

print('train', metrics.f1_score(y_train, model.predict(df_train[columns]), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(df_test[columns]), average='macro'))


train 0.05887736725599869
test 0.060439542712750365
CPU times: user 130 ms, sys: 14 ms, total: 144 ms
Wall time: 144 ms


In [0]:
%%time
# baseline 2 
# pos features + one hot encoding + logistic regression
from sklearn.preprocessing import OneHotEncoder


columns = ['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']

model = Pipeline([
    ('enc', OneHotEncoder()),
    ('est', LogisticRegressionCV(Cs=5, cv=5, n_jobs=-1, scoring='f1_macro', 
                             penalty='l2', solver='newton-cg', multi_class='multinomial', random_state=SEED)),
])

model.fit(df_train[columns], y_train)

print('train', metrics.f1_score(y_train, model.predict(df_train[columns]), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(df_test[columns]), average='macro'))

In [0]:
%%time
# baseline 3
# use word2vec cbow embedding + baseline 2 + svm
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.svm import LinearSVC
import scipy.sparse as sp

embeding = w2v_cbow
encoder_pos = OneHotEncoder()
X_train = sp.hstack([
    embeding.transform(df_train.word),
    embeding.transform(df_train['next-word']),
    embeding.transform(df_train['next-next-word']),
    embeding.transform(df_train['prev-word']),
    embeding.transform(df_train['prev-prev-word']),
    encoder_pos.fit_transform(df_train[['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])
])
X_test = sp.hstack([
    embeding.transform(df_test.word),
    embeding.transform(df_test['next-word']),
    embeding.transform(df_test['next-next-word']),
    embeding.transform(df_test['prev-word']),
    embeding.transform(df_test['prev-prev-word']),
    encoder_pos.transform(df_test[['pos','next-pos','next-next-pos','prev-pos','prev-prev-pos']])
])

model = model_selection.GridSearchCV(LinearSVC(penalty='l2', multi_class='ovr', random_state=SEED), 
                                    {'C': np.logspace(-4, 0, 5)}, 
                                    cv=3, scoring='f1_macro', n_jobs=-1, verbose=1)
model.fit(X_train, y_train)

print('train', metrics.f1_score(y_train, model.predict(X_train), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(X_test), average='macro'))

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:  5.0min finished


train 0.95846030477073
test 0.8122113864662406
CPU times: user 1min 56s, sys: 5.51 s, total: 2min 1s
Wall time: 7min 2s


###### Бьём бейзлайны

Во-первых, лемматизируем наши слова и будем обучать word2vec на леммах (уменьшится словарь, для задачи NER конкретная форма слова представляется не очень важной, а векторы будут точнее).

Во-вторых, добавим признак капитализации (первая буква заглавная или нет?) всех слов в контексте. Если это слово или предшуствующее ему не \__START1\__ и не \__START2\__ (предположительно немногочисленными случаями, когда первое слово в предложении является NER, можно пренебречь, да и небольшой шум не сделает сильно хуже): NER по большей части будет начинаться с заглавной буквы, но не каждое слово с заглавной буквы будет обязательно NER, и поэтому мы добавляем этот признак и для слов в контексте, поскольку, ввероятно, должны быть определённые закономерности.

In [13]:
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

for prefix in ['prev-prev-', 'prev-', '', 'next-', 'next-next-']:
    df[f'{prefix}lemma'] = df[f'{prefix}word'].apply(lambda wordform: lemmatizer.lemmatize(wordform.lower(), pos=get_wordnet_pos(nltk.pos_tag([wordform.lower()])[0][1])))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [0]:
for prefix in ['prev-prev-', 'prev-', 'next-', 'next-next-']:
    df[prefix + 'uppercase'] = df.apply(lambda row: 1 if row[f'{prefix}word'] not in ['__START1__', '__START2__'] and row[f'{prefix}word'][0].isupper() else 0, axis=1)
df['uppercase'] = df.apply(lambda row: 1 if row['prev-word'] != '__START1__' and row['word'][0].isupper() else 0, axis=1)

In [15]:
df.head(n=20)

Unnamed: 0,sentence_idx,next-next-pos,next-next-word,next-pos,next-word,pos,prev-pos,prev-prev-pos,prev-prev-word,prev-word,word,tag,length,prev-prev-lemma,prev-lemma,lemma,next-lemma,next-next-lemma,prev-prev-uppercase,prev-uppercase,next-uppercase,next-next-uppercase,uppercase
0,1.0,18,demonstrators,9,of,18,39,40,__START2__,__START1__,Thousands,O,48,__start2__,__start1__,thousand,of,demonstrator,0,0,0,0,0
1,1.0,33,have,18,demonstrators,9,18,39,__START1__,Thousands,of,O,48,__start1__,thousand,of,demonstrator,have,0,1,0,0,0
2,1.0,32,marched,33,have,18,9,18,Thousands,of,demonstrators,O,48,thousand,of,demonstrator,have,march,1,0,0,0,0
3,1.0,9,through,32,marched,33,18,9,of,demonstrators,have,O,48,of,demonstrator,have,march,through,0,0,0,0,0
4,1.0,16,London,9,through,32,33,18,demonstrators,have,marched,O,48,demonstrator,have,march,through,london,0,0,0,1,0
5,1.0,28,to,16,London,9,32,33,have,marched,through,O,48,have,march,through,london,to,0,0,1,0,0
6,1.0,29,protest,28,to,16,9,32,marched,through,London,B-geo,48,march,through,london,to,protest,0,0,0,0,1
7,1.0,7,the,29,protest,28,16,9,through,London,to,O,48,through,london,to,protest,the,0,1,0,0,0
8,1.0,15,war,7,the,29,28,16,London,to,protest,O,48,london,to,protest,the,war,1,0,0,0,0
9,1.0,9,in,15,war,7,29,28,to,protest,the,O,48,to,protest,the,war,in,0,0,0,0,0


In [16]:
%%time

sentences_list = [x.strip() for x in ' '.join(df.lemma).split('.')]

w2v_cbow_lemmas = Word2VecWrapper(window=5, negative=5, size=300, iter=300, is_cbow=True, random_state=SEED)
w2v_cbow_lemmas.fit(sentences_list)

CPU times: user 45.2 s, sys: 405 ms, total: 45.6 s
Wall time: 25.4 s


In [17]:
y = LabelEncoder().fit_transform(df.tag)

df_train, df_test, y_train, y_test = model_selection.train_test_split(
    df, y, stratify=y, test_size=0.25, random_state=SEED, shuffle=True
)
print('train', df_train.shape[0])
print('test', df_test.shape[0])

train 50155
test 16719


In [18]:
%%time
# baselines 1-2
# use RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

columns = [f'{prefix}{feature}'
    for prefix in ['prev-prev-', 'prev-', '', 'next-', 'next-next-']
    for feature in ['pos', 'uppercase']
]

encoder_pos = OneHotEncoder()
encoder_pos.fit(df[columns])
X_train = encoder_pos.transform(df_train[columns])
X_test = encoder_pos.transform(df_test[columns])

# параметры подобраны с помощью сетки, просто обучать всё сразу очень долго,
# поэтому они подбирались последовательным пепебором раскомменчиванием в param_grid
from sklearn.ensemble import RandomForestClassifier
model = model_selection.GridSearchCV(
    estimator=RandomForestClassifier(
        random_state=SEED, max_features=None, verbose=1, criterion='entropy',
        min_samples_leaf=1, min_samples_split=2, n_estimators=500
    ),
    param_grid={
        # 'criterion': ['gini', 'entropy'],
        # 'min_samples_split': [2, 3, 4],
        # 'min_samples_leaf': [1, 2, 3, 4, 5, 6],
     },
     scoring='f1_macro', cv=3, n_jobs=-1, verbose=1, refit=True
)

model.fit(X_train, y_train)
# print(model.best_params_)
print('train', metrics.f1_score(y_train, model.predict(X_train), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(X_test), average='macro'))

Fitting 3 folds for each of 1 candidates, totalling 3 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  6.0min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:  4.0min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:    4.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


train 0.7804723072356119
test 0.6426730786726875
CPU times: user 4min 4s, sys: 349 ms, total: 4min 4s
Wall time: 10min 6s


[Parallel(n_jobs=1)]: Done 500 out of 500 | elapsed:    1.5s finished


In [0]:
# CatBoost

# !pip install catboost
# from sklearn.preprocessing import OneHotEncoder
# from catboost import CatBoostClassifier, Pool, cv

# NUM_OF_FEATS = 5
# columns = ['pos', 'next-pos', 'next-next-pos', 'prev-pos', 'prev-prev-pos']
# model = CatBoostClassifier(
#     custom_loss=[metrics.f1_score(average='macro')],
#     random_seed=SEED,
#     logging_level='Silent'
# )

# # cv_params = model.get_params()
# # cv_params.update({
# #     random_seed=SEED
# # })

# # cv_data = cv(
# #     Pool(X_train, y_train, cat_features=range(NUM_OF_FEATS)), cv_params, plot=True
# # )

# model.fit(
#     df_train[columns], y_train,
#     cat_features=range(NUM_OF_FEATS),
#     eval_set=(df_test[columns], y_test),
#     verbose=True
# )

# print('train', metrics.f1_score(y_train, model.predict(df_train[columns]), average='macro'))
# print('test', metrics.f1_score(y_test, model.predict(df_test[columns]), average='macro'))

In [60]:
# LGBM
# import lightgbm as lgb

# columns = ['pos', 'next-pos', 'next-next-pos', 'prev-pos', 'prev-prev-pos']
# lgb_train = lgb.Dataset(df_train[columns], y_train)
# lgb_eval = lgb.Dataset(df_test[columns], y_test, reference=lgb_train)

# params = {
#     'boosting_type': 'gbdt',
#     'objective': 'multiclass',
#     'num_class': len(np.unique(y)),
#     'metric': 'multi_logloss',
#     'num_leaves': 127,
#     'learning_rate': 0.005,
#     'feature_fraction': 0.6,
#     'bagging_fraction': 0.6,
#     'bagging_freq': 1,
#     'verbose': 50,
#     'seed': SEED
# }

# print('Starting training...')
# # train
# gbm = lgb.train(
#     params,
#     lgb_train,
#     num_boost_round=5000,
#     valid_sets=lgb_eval,
#     early_stopping_rounds=100
# )

# print('Saving model...')
# # save model to file
# gbm.save_model('model.txt')

Starting training...
[1]	valid_0's multi_logloss: 0.736673
Training until validation scores don't improve for 100 rounds.
[2]	valid_0's multi_logloss: 0.732001
[3]	valid_0's multi_logloss: 0.72757
[4]	valid_0's multi_logloss: 0.722838
[5]	valid_0's multi_logloss: 0.717367
[6]	valid_0's multi_logloss: 0.712229
[7]	valid_0's multi_logloss: 0.708249
[8]	valid_0's multi_logloss: 0.703345
[9]	valid_0's multi_logloss: 0.697759
[10]	valid_0's multi_logloss: 0.692245
[11]	valid_0's multi_logloss: 0.68868
[12]	valid_0's multi_logloss: 0.684151
[13]	valid_0's multi_logloss: 0.680346
[14]	valid_0's multi_logloss: 0.674944
[15]	valid_0's multi_logloss: 0.670183
[16]	valid_0's multi_logloss: 0.666086
[17]	valid_0's multi_logloss: 0.662804
[18]	valid_0's multi_logloss: 0.65925
[19]	valid_0's multi_logloss: 0.655896
[20]	valid_0's multi_logloss: 0.651731
[21]	valid_0's multi_logloss: 0.647756
[22]	valid_0's multi_logloss: 0.644395
[23]	valid_0's multi_logloss: 0.640758
[24]	valid_0's multi_logloss: 0

<lightgbm.basic.Booster at 0x7f921fada860>

In [61]:
# train_preds = gbm.predict(df_train[columns])
# predictions = []
# for x in train_preds:
#     predictions.append(np.argmax(x))
# print('train', metrics.f1_score(y_train, predictions, average='macro'))

# test_preds = gbm.predict(df_test[columns])
# predictions = []
# for x in test_preds:
#     predictions.append(np.argmax(x))
# print('test', metrics.f1_score(y_test, predictions, average='macro'))

train 0.7383378668271513
test 0.5966941731442842


In [22]:
# baseline 3
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
import scipy.sparse as sp

embeding = w2v_cbow_lemmas
encoder_pos = OneHotEncoder()
lemma_columns = [
    f'{prefix}lemma'
    for prefix in ['prev-prev-', 'prev-', '', 'next-', 'next-next-']
]

columns = [
    f'{prefix}{feature}'
    for prefix in ['prev-prev-', 'prev-', '', 'next-', 'next-next-']
    for feature in ['pos', 'uppercase']
]
encoder_pos.fit(df[columns])

X_train = sp.hstack([
    embeding.transform(df_train[lemma_column])
    for lemma_column in lemma_columns
] + [encoder_pos.transform(df_train[columns])])
X_test = sp.hstack([
    embeding.transform(df_test[lemma_column])
    for lemma_column in lemma_columns
] + [encoder_pos.transform(df_test[columns])])

model = RandomForestClassifier(
    random_state=SEED, max_features=None, verbose=1, criterion='entropy',
    min_samples_leaf=1, min_samples_split=2, n_estimators=10, n_jobs=-1
)

model.fit(X_train, y_train)

print('train', metrics.f1_score(y_train, model.predict(X_train), average='macro'))
print('test', metrics.f1_score(y_test, model.predict(X_test), average='macro'))

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed: 11.5min finished
[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  10 out of  10 | elapsed:    0.8s finished


train 0.9901597349666811


[Parallel(n_jobs=2)]: Using backend ThreadingBackend with 2 concurrent workers.


test 0.837889956055061


[Parallel(n_jobs=2)]: Done  10 out of  10 | elapsed:    0.3s finished


0.8378 > 0.8122

In [69]:
# # LGBM
# import lightgbm as lgb
# import scipy.sparse as sp

# embeding = w2v_cbow
# encoder_pos = OneHotEncoder()
# columns = ['pos', 'next-pos', 'next-next-pos', 'prev-pos', 'prev-prev-pos']

# X_train = sp.hstack([
#     embeding.transform(df_train.word),
#     embeding.transform(df_train['next-word']),
#     embeding.transform(df_train['next-next-word']),
#     embeding.transform(df_train['prev-word']),
#     embeding.transform(df_train['prev-prev-word']),
#     encoder_pos.fit_transform(df_train[columns])
# ])
# X_test = sp.hstack([
#     embeding.transform(df_test.word),
#     embeding.transform(df_test['next-word']),
#     embeding.transform(df_test['next-next-word']),
#     embeding.transform(df_test['prev-word']),
#     embeding.transform(df_test['prev-prev-word']),
#     encoder_pos.transform(df_test[columns])
# ])

# lgb_train = lgb.Dataset(X_train, y_train)
# lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# params = {
#     'boosting_type': 'gbdt',
#     'objective': 'multiclass',
#     'num_class': len(np.unique(y)),
#     'metric': 'multi_logloss',
#     'num_leaves': 127,
#     'learning_rate': 0.01,
#     'feature_fraction': 0.7,
#     'bagging_fraction': 0.6,
#     'bagging_freq': 1,
#     'verbose': 50,
#     'seed': SEED
# }

# print('Starting training...')
# # train
# gbm = lgb.train(
#     params,
#     lgb_train,
#     num_boost_round=100,
#     valid_sets=lgb_eval,
#     early_stopping_rounds=100
# )

# print('Saving model...')
# # save model to file
# gbm.save_model('model.txt')

Starting training...
[1]	valid_0's multi_logloss: 0.716336
Training until validation scores don't improve for 100 rounds.
[2]	valid_0's multi_logloss: 0.693827
[3]	valid_0's multi_logloss: 0.673643
[4]	valid_0's multi_logloss: 0.656817
[5]	valid_0's multi_logloss: 0.641113
[6]	valid_0's multi_logloss: 0.62601
[7]	valid_0's multi_logloss: 0.613716
[8]	valid_0's multi_logloss: 0.601476
[9]	valid_0's multi_logloss: 0.589819
[10]	valid_0's multi_logloss: 0.57886
[11]	valid_0's multi_logloss: 0.568466
[12]	valid_0's multi_logloss: 0.558747
[13]	valid_0's multi_logloss: 0.549919
[14]	valid_0's multi_logloss: 0.540853
[15]	valid_0's multi_logloss: 0.532099
[16]	valid_0's multi_logloss: 0.524481
[17]	valid_0's multi_logloss: 0.516384
[18]	valid_0's multi_logloss: 0.508678
[19]	valid_0's multi_logloss: 0.501272
[20]	valid_0's multi_logloss: 0.494033
[21]	valid_0's multi_logloss: 0.487396
[22]	valid_0's multi_logloss: 0.481031
[23]	valid_0's multi_logloss: 0.474665
[24]	valid_0's multi_logloss: 

<lightgbm.basic.Booster at 0x7f921dd07e10>

In [71]:
# train_preds = gbm.predict(X_train)
# predictions = []
# for x in train_preds:
#     predictions.append(np.argmax(x))
# print('train', metrics.f1_score(y_train, predictions, average='macro'))

# test_preds = gbm.predict(X_test)
# predictions = []
# for x in test_preds:
#     predictions.append(np.argmax(x))
# print('test', metrics.f1_score(y_test, predictions, average='macro'))

train 0.7680029478652147
test 0.39954632571069604


[1 point] Why did we select f1 score with macro averaging as our classification quality measure? What other metrics are suitable?

F1 score представляет из себя более общую и "репрезентативную" метрику, чем просто полнота или точность, поскольку высокое значение одного не гарантирует высокое значение другого. F1-score же учитывает обе метрики, причём в равной мере (т.к. β = 1), поэтому если значение F1-score высокое, как у нас, то это говорит о том, что классификатор ведёт себя хорошо как по той, так и по другой метрике (но никто не говорит, что одинаково). Необходимо усреднение, поскольку у нас многоклассовая, а не бинарная классификация. Как видно из частотностей выше, классы неравномерно распределены и в тестовой выборке сохраняется такое распределение (см. stratify=y), поэтому в случае микро-усреднения высокое качество будет даже у константного алгоритма с самым частотным классом (т.к. в этом случае рассматривается F1-метрика в общем, а не для каждого класса по отдельности), в отличие от макро-усреднения, где усредняется значение F1-метрики по всем классам в отдельности, что гораздо важнее в нашей задаче, поскольку мы хотим не только предсказывать отсутствие NER, но и предсказывать его наличие в редких случаях, поэтому эта метрика гораздо лучше отражает корректность алгоритма. Weighted average нам тоже не очень подходит, т.к. мы хотим предсказывать наличие NER ничуть не меньше, чем отсутствие оного. Macro accuracy, macro precision, macro recall в принципе тоже можно использовать, но они обладают тем недостатком, что отражают работу классификатора не в полной мере. Также можно использовать log-loss для небинарного случая, поскольку в этом случае также логарифм считается для каждого класса отдельно.