#Named Entity Recognition
The aim is the most effective recognition of proper names based on the "NKJP_org.csv" file.

The method I have adopted is CRF (Conditional Random Fields). Such a model is designed to determine the conditional probability (Y/X), where Y in our case is a specific class of proper name (e.g. "country"), and X will be individual features of a given or neighboring word. Such features will be defined by me.

In [1]:
!pip install sklearn-crfsuite

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.3.6-py2.py3-none-any.whl (12 kB)
Collecting python-crfsuite>=0.8.3
  Downloading python_crfsuite-0.9.9-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: python-crfsuite, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.9 sklearn-crfsuite-0.3.6


In [2]:
!pip install scikit-learn==0.23

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-learn==0.23
  Downloading scikit_learn-0.23.0-cp38-cp38-manylinux1_x86_64.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m46.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.0.2
    Uninstalling scikit-learn-1.0.2:
      Successfully uninstalled scikit-learn-1.0.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.5 requires scikit-learn>=1.0.0, but you have scikit-learn 0.23.0 which is incompatible.
imbalanced-learn 0.8.1 requires scikit-learn>=0.24, but you have scikit-learn 0.23.0 which is incompatible.[0m[31m
[0mSuccessfully installed scikit-learn-0.23.0


In [3]:
import pandas as pd
import numpy as np
import csv
import sklearn_crfsuite
from sklearn_crfsuite import metrics

In [5]:
#Uploading the file to pandas DataFrame
df = pd.read_csv('NKJP_org.csv', names = ['word', 'tag', 'label1', 'label2'], sep = '\t', quoting = csv.QUOTE_NONE, header = None, skip_blank_lines=False)
df

Unnamed: 0,word,tag,label1,label2
0,Zatrzasnął,praet:sg:m1:perf,,
1,drzwi,subst:pl:acc:n:pt,,
2,od,prep:gen:nwok,,
3,mieszkania,subst:sg:gen:n:ncol,,
4,",",interp,,
...,...,...,...,...
1301532,.,interp,,
1301533,.,interp,,
1301534,.,interp,,
1301535,.,interp,,


In [6]:
df['label1'] = df['label1'].fillna('O') #completing the column with the "O" label in case of null values
df['label'] = df['label2'].fillna(df['label1']) #creating one column with labels

#removing columns 'label1' and 'label2' which are no longer needed
del df['label1']
del df['label2']

df

Unnamed: 0,word,tag,label
0,Zatrzasnął,praet:sg:m1:perf,O
1,drzwi,subst:pl:acc:n:pt,O
2,od,prep:gen:nwok,O
3,mieszkania,subst:sg:gen:n:ncol,O
4,",",interp,O
...,...,...,...
1301532,.,interp,O
1301533,.,interp,O
1301534,.,interp,O
1301535,.,interp,O


In [7]:
#parts of the tags are separated by ":", 
#so it is possible to split a column into several separate columns
df = df.join(df['tag'].str.split(':', expand = True).add_prefix('tag'))
df

Unnamed: 0,word,tag,label,tag0,tag1,tag2,tag3,tag4,tag5,tag6
0,Zatrzasnął,praet:sg:m1:perf,O,praet,sg,m1,perf,,,
1,drzwi,subst:pl:acc:n:pt,O,subst,pl,acc,n,pt,,
2,od,prep:gen:nwok,O,prep,gen,nwok,,,,
3,mieszkania,subst:sg:gen:n:ncol,O,subst,sg,gen,n,ncol,,
4,",",interp,O,interp,,,,,,
...,...,...,...,...,...,...,...,...,...,...
1301532,.,interp,O,interp,,,,,,
1301533,.,interp,O,interp,,,,,,
1301534,.,interp,O,interp,,,,,,
1301535,.,interp,O,interp,,,,,,


In [8]:
df = df.replace({None: ''})
df = df.fillna('')

In [9]:
#Removal of website addresses.
#They confuse the tags - the tag repeats the website address
df.drop(df[df['word'].str.contains(r'html|http|www|.pl')].index, inplace = True)

In [10]:
#In case of ":" the tag was ":interp" instead of "interp"
df.loc[df['tag1'] == 'interp', ['tag0']] = 'interp'
df.loc[df['tag1'] == 'interp', ['tag1']] = ''

#For symbols like ":)))" the tag was specified as ":)))sym" instead of "sym"
df.loc[df['tag1'] == 'sym', ['tag0']] = 'sym'
df.loc[df['tag1'] == 'sym', ['tag1']] = ''

####Label counts

In [11]:
df['label'].value_counts()

O             1233027
forename        13191
surname         12974
orgName         11195
settlement       8381
country          8088
geogName         4518
date             4420
persName         1149
addName           962
region            784
time              547
placeName         377
district          318
bloc              139
Name: label, dtype: int64

####Tag counts

In [12]:
df['tag'].value_counts().head(20)

interp               218929
                      85662
part                  68298
conj                  41999
fin:sg:ter:imperf     30216
subst:sg:gen:f        29373
prep:loc:nwok         27124
subst:sg:nom:m1       23407
subst:sg:gen:m3       23012
adv:pos               21967
subst:sg:nom:f        21893
comp                  20425
prep:gen              17753
prep:acc              17191
adv                   16425
prep:loc              15948
subst:sg:acc:f        15476
dig                   14784
subst:sg:loc:m3       13953
subst:sg:acc:m3       13913
Name: tag, dtype: int64

####The counts of the first parts of the tag

In [13]:
df['tag0'].value_counts().head(20)

subst      326815
interp     223420
adj        120424
prep       115770
            85664
part        68369
fin         59445
praet       53261
adv         42277
conj        41999
comp        20425
inf         19207
dig         14784
ppas        13472
ppron3      13355
ger         11848
brev        11080
num          8527
ppron12      8102
aglt         7598
Name: tag0, dtype: int64

####The counts of the first parts of the tag

In [14]:
df['tag1'].value_counts().head(20)

          483030
sg        469048
pl        165533
loc        43915
gen        33325
acc        25176
pos        21967
inst       13515
perf       12248
imperf     11808
pun         8203
com         2946
npun        2877
nom         2853
dat         2610
sup          939
nwok          38
wok           33
voc            4
subst          2
Name: tag1, dtype: int64

In [15]:
df

Unnamed: 0,word,tag,label,tag0,tag1,tag2,tag3,tag4,tag5,tag6
0,Zatrzasnął,praet:sg:m1:perf,O,praet,sg,m1,perf,,,
1,drzwi,subst:pl:acc:n:pt,O,subst,pl,acc,n,pt,,
2,od,prep:gen:nwok,O,prep,gen,nwok,,,,
3,mieszkania,subst:sg:gen:n:ncol,O,subst,sg,gen,n,ncol,,
4,",",interp,O,interp,,,,,,
...,...,...,...,...,...,...,...,...,...,...
1301532,.,interp,O,interp,,,,,,
1301533,.,interp,O,interp,,,,,,
1301534,.,interp,O,interp,,,,,,
1301535,.,interp,O,interp,,,,,,


In [16]:
#Converting the DataFrame to a list of lists, 
#so that each sentence will be placed in a separate list.
## Sentences in the csv file are separated from each other by a blank line,
#the same as in the DataFrame

df_list = df.values.tolist()
df_list2 = []

for i in range(len(df_list)):
  if i == 0 or (df_list[i][0] != '' and df_list[i-1][0] == ''):
    new_list = []
    new_list.append(df_list[i])
  elif df_list[i][0] != '' and df_list[i-1][0] != '':
    new_list.append(df_list[i])
  elif df_list[i][0] == '':
    df_list2.append(new_list)
  else:
    print('error')

In [17]:
# Division into training and test sets in the proportion of 80-20
import random
random.shuffle(df_list2) 

ratio = int(len(df_list2)/5) # 20%
test = df_list2[:ratio]
train = df_list2[ratio:]

In [18]:
train[0]

[['Kto', 'subst:sg:nom:m1', 'O', 'subst', 'sg', 'nom', 'm1', '', '', ''],
 ['ma', 'fin:sg:ter:imperf', 'O', 'fin', 'sg', 'ter', 'imperf', '', '', ''],
 ['decydować', 'inf:imperf', 'O', 'inf', 'imperf', '', '', '', '', ''],
 ['o', 'prep:loc', 'O', 'prep', 'loc', '', '', '', '', ''],
 ['tym',
  'subst:sg:loc:n:ncol',
  'O',
  'subst',
  'sg',
  'loc',
  'n',
  'ncol',
  '',
  ''],
 [',', 'interp', 'O', 'interp', '', '', '', '', '', ''],
 ['kto', 'subst:sg:nom:m1', 'O', 'subst', 'sg', 'nom', 'm1', '', '', ''],
 ['jest', 'fin:sg:ter:imperf', 'O', 'fin', 'sg', 'ter', 'imperf', '', '', ''],
 ['politykiem',
  'subst:sg:inst:m1',
  'O',
  'subst',
  'sg',
  'inst',
  'm1',
  '',
  '',
  ''],
 ['posierpniowym',
  'adj:sg:inst:m1:pos',
  'O',
  'adj',
  'sg',
  'inst',
  'm1',
  'pos',
  '',
  ''],
 ['?', 'interp', 'O', 'interp', '', '', '', '', '', '']]

In [19]:
test[0]

[['–', 'interp', 'O', 'interp', '', '', '', '', '', ''],
 ['Przypuszczam',
  'fin:sg:pri:imperf',
  'O',
  'fin',
  'sg',
  'pri',
  'imperf',
  '',
  '',
  ''],
 [',', 'interp', 'O', 'interp', '', '', '', '', '', ''],
 ['że', 'comp', 'O', 'comp', '', '', '', '', '', ''],
 ['grzecznie', 'adv:pos', 'O', 'adv', 'pos', '', '', '', '', ''],
 ['.', 'interp', 'O', 'interp', '', '', '', '', '', '']]

In [20]:
print('Training dataset size:', len(train))
print('Test dataset size:', len(test))

Training dataset size: 68542
Test dataset size: 17135


###Solution no. 1

In [21]:
#Function that determines the abbreviated shape of the word.
#It will be one of the features used in the model
def shape(word):
    shape = ""
    for letter in word:
        if letter.isdigit():
            shape = shape + "d"
        elif letter.isalpha():
            if letter.isupper():
                shape = shape + "X"
            else:
                shape = shape + "x"
        else:
            shape = shape + letter
    if len(shape) > 4:
      shape_start = shape[:2]
      shape_end = shape[-2:]
      shape_middle = ''.join(sorted(set(shape[2:-2])))
      shape = shape_start+shape_middle+shape_end
      return shape   
    else:
      return shape

In [22]:
print('shape("Lamp"):', shape('Koło'))
print('shape("augmentation"):', shape('rabarbar'))
print('shape("Boy-Żeleński"):', shape('Boy-Żeleński'))

shape("Lamp"): Xxxx
shape("augmentation"): xxxxx
shape("Boy-Żeleński"): Xx-Xxxx


In [23]:
#Features definition
def word2features(sent, i):
    word = sent[i][0]
    tag = sent[i][1]
    
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[:-2]': word[:-2],
        'word.istitle()': word.istitle(),
        'word.isupper()': word.isupper(),
        'word_shape': shape(word),
        'digit': any([char.isdigit() for char in word]),
        'hyphen': '-' in word,
        'tag': tag
    
    }
    if i > 0:
        word1 = sent[i-1][0]
        tag1 = sent[i-1][1]
        tag_a_1 = sent[i-1][3]
        tag_b_1 = sent[i-1][4]
        tag_c_1 = sent[i-1][5]
        tag_d_1 = sent[i-1][6]
        tag_e_1 = sent[i-1][7]
        tag_f_1 = sent[i-1][8]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word[:-2]': word1[:-2],
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:word_shape': shape(word1),
            '-1:digit': any([char.isdigit() for char in word1]),
            '-1:hyphen': '-' in word1,
            '-1:tag': tag1
        })
    else:
        features['BOS'] = True
        
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        tag1 = sent[i+1][1]
        tag_a_1 = sent[i+1][3]
        tag_b_1 = sent[i+1][4]
        tag_c_1 = sent[i+1][5]
        tag_d_1 = sent[i+1][6]
        tag_e_1 = sent[i+1][7]
        tag_f_1 = sent[i+1][8]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word[:-2]': word1[:-2],
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:word_shape': shape(word1),
            '+1:digit': any([char.isdigit() for char in word1]),
            '+1:hyphen': '-' in word1,
            '+1:tag': tag1
        })
    else:
        features['EOS'] = True
                
    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label, tag1, tag2, tag3, tag4, tag5, tag6, tag7 in sent]

def sent2tokens(sent):
    return [token for token, postag, label, tag1, tag2, tag3, tag4, tag5, tag6, tag7 in sent]

In [24]:
X_train = [sent2features(s) for s in train]
y_train = [sent2labels(s) for s in train]

X_test = [sent2features(s) for s in test]
y_test = [sent2labels(s) for s in test]

In [25]:
X_train[0]

[{'bias': 1.0,
  'word.lower()': 'kto',
  'word[:-2]': 'K',
  'word.istitle()': True,
  'word.isupper()': False,
  'word_shape': 'Xxx',
  'digit': False,
  'hyphen': False,
  'tag': 'subst:sg:nom:m1',
  'BOS': True,
  '+1:word.lower()': 'ma',
  '+1:word[:-2]': '',
  '+1:word.istitle()': False,
  '+1:word.isupper()': False,
  '+1:word_shape': 'xx',
  '+1:digit': False,
  '+1:hyphen': False,
  '+1:tag': 'fin:sg:ter:imperf'},
 {'bias': 1.0,
  'word.lower()': 'ma',
  'word[:-2]': '',
  'word.istitle()': False,
  'word.isupper()': False,
  'word_shape': 'xx',
  'digit': False,
  'hyphen': False,
  'tag': 'fin:sg:ter:imperf',
  '-1:word.lower()': 'kto',
  '-1:word[:-2]': 'K',
  '-1:word.istitle()': True,
  '-1:word.isupper()': False,
  '-1:word_shape': 'Xxx',
  '-1:digit': False,
  '-1:hyphen': False,
  '-1:tag': 'subst:sg:nom:m1',
  '+1:word.lower()': 'decydować',
  '+1:word[:-2]': 'decydow',
  '+1:word.istitle()': False,
  '+1:word.isupper()': False,
  '+1:word_shape': 'xxxxx',
  '+1:digit

In [26]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs', 
    c1=0.1, 
    c2=0.1, 
    max_iterations=100, 
    all_possible_transitions=True
)

In [27]:
crf.fit(X_train, y_train)



CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=100)

In [28]:
labels = list(crf.classes_)
labels

['O',
 'surname',
 'orgName',
 'forename',
 'country',
 'settlement',
 'persName',
 'geogName',
 'region',
 'date',
 'time',
 'district',
 'placeName',
 'addName',
 'bloc']

In [29]:
y_pred = crf.predict(X_test)
print('Total precision: ', 
      round(metrics.flat_precision_score(y_test, y_pred, average = 'weighted', labels = labels), 5))
print('Total recall: ', 
      round(metrics.flat_recall_score(y_test, y_pred, average = 'weighted', labels = labels), 5))
print('Total f1 score: ', 
      round(metrics.flat_f1_score(y_test, y_pred, average = 'weighted', labels = labels), 5))

Total precision:  0.98511
Total recall:  0.98599
Total f1 score:  0.98521


In [30]:
sorted_labels = sorted(labels)

print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))



              precision    recall  f1-score   support

           O      0.992     0.997     0.995    228605
     addName      0.683     0.318     0.434       176
        bloc      0.250     0.025     0.045        40
     country      0.933     0.840     0.884      1648
        date      0.899     0.857     0.877       858
    district      0.880     0.314     0.463        70
    forename      0.929     0.878     0.903      2650
    geogName      0.780     0.544     0.641       895
     orgName      0.806     0.733     0.768      2192
    persName      0.921     0.662     0.770       210
   placeName      0.841     0.529     0.649        70
      region      0.870     0.486     0.624       179
  settlement      0.851     0.730     0.786      1648
     surname      0.867     0.913     0.889      2585
        time      0.881     0.698     0.779       106

    accuracy                          0.986    241932
   macro avg      0.826     0.635     0.701    241932
weighted avg      0.985   

In [31]:
labels = list(crf.classes_)
labels.remove('O')
labels

['surname',
 'orgName',
 'forename',
 'country',
 'settlement',
 'persName',
 'geogName',
 'region',
 'date',
 'time',
 'district',
 'placeName',
 'addName',
 'bloc']

In [32]:
#After removing "O" label:
y_pred = crf.predict(X_test)
print('After removing "O" label:', end = '\n\n')
print('Total precision: ', 
      round(metrics.flat_precision_score(y_test, y_pred, average = 'weighted', labels = labels), 5))
print('Total recall: ', 
      round(metrics.flat_recall_score(y_test, y_pred, average = 'weighted', labels = labels), 5))
print('Total f1 score: ', 
      round(metrics.flat_f1_score(y_test, y_pred, average = 'weighted', labels = labels), 5))

After removing "O" label:

Total precision:  0.86853
Total recall:  0.7893
Total f1 score:  0.82285


In [33]:
sorted_labels = sorted(labels)

print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))



              precision    recall  f1-score   support

     addName      0.683     0.318     0.434       176
        bloc      0.250     0.025     0.045        40
     country      0.933     0.840     0.884      1648
        date      0.899     0.857     0.877       858
    district      0.880     0.314     0.463        70
    forename      0.929     0.878     0.903      2650
    geogName      0.780     0.544     0.641       895
     orgName      0.806     0.733     0.768      2192
    persName      0.921     0.662     0.770       210
   placeName      0.841     0.529     0.649        70
      region      0.870     0.486     0.624       179
  settlement      0.851     0.730     0.786      1648
     surname      0.867     0.913     0.889      2585
        time      0.881     0.698     0.779       106

   micro avg      0.873     0.789     0.829     13327
   macro avg      0.814     0.609     0.680     13327
weighted avg      0.869     0.789     0.823     13327



###Solution no. 2

In [58]:
def word2features(sent, i):
    word = sent[i][0]
    tag_a = sent[i][3]
    tag_b = sent[i][4]
    tag_c = sent[i][5]
    tag_d = sent[i][6]
    tag_e = sent[i][7]
    tag_f = sent[i][8]
    tag_g = sent[i][9]
    
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[:-2]': word[:-2],
        'word.istitle()': word.istitle(),
        'word.isupper()': word.isupper(),
        'word_shape': shape(word),
        'digit': any([char.isdigit() for char in word]),
        'hyphen': '-' in word,
        'tag_a': tag_a,
        'tag_b': tag_b,
        'tag_c': tag_c,
        'tag_d': tag_d,
        'tag_e': tag_e
    
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        tag_a_1 = sent[i-1][3]
        tag_b_1 = sent[i-1][4]
        tag_c_1 = sent[i-1][5]
        tag_d_1 = sent[i-1][6]
        tag_e_1 = sent[i-1][7]
        tag_f_1 = sent[i-1][8]
        tag_g_1 = sent[i-1][9]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word[:-2]': word1[:-2],
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:word_shape': shape(word1),
            '-1:digit': any([char.isdigit() for char in word1]),
            '-1:hyphen': '-' in word1,
            '-1:tag_a': tag_a_1,
            '-1:tag_b': tag_b_1,
            '-1:tag_c': tag_c_1,
            '-1:tag_d': tag_d_1,
            '-1:tag_e': tag_e_1
        })
    else:
        features['BOS'] = True
        
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        tag_a_1 = sent[i+1][3]
        tag_b_1 = sent[i+1][4]
        tag_c_1 = sent[i+1][5]
        tag_d_1 = sent[i+1][6]
        tag_e_1 = sent[i+1][7]
        tag_f_1 = sent[i+1][8]
        tag_g_1 = sent[i+1][9]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word[:-2]': word1[:-2],
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:word_shape': shape(word1),
            '+1:digit': any([char.isdigit() for char in word1]),
            '+1:hyphen': '-' in word1,
            '+1:tag_a': tag_a_1,
            '+1:tag_b': tag_b_1,
            '+1:tag_c': tag_c_1,
            '+1:tag_d': tag_d_1,
            '+1:tag_e': tag_e_1
        })
    else:
        features['EOS'] = True
                
    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label, tag1, tag2, tag3, tag4, tag5, tag6, tag7 in sent]

def sent2tokens(sent):
    return [token for token, postag, label, tag1, tag2, tag3, tag4, tag5, tag6, tag7 in sent]

In [59]:
X_train = [sent2features(s) for s in train]
y_train = [sent2labels(s) for s in train]

X_test = [sent2features(s) for s in test]
y_test = [sent2labels(s) for s in test]

In [60]:
X_train[0]

[{'bias': 1.0,
  'word.lower()': 'kto',
  'word[:-2]': 'K',
  'word.istitle()': True,
  'word.isupper()': False,
  'word_shape': 'Xxx',
  'digit': False,
  'hyphen': False,
  'tag_a': 'subst',
  'tag_b': 'sg',
  'tag_c': 'nom',
  'tag_d': 'm1',
  'tag_e': '',
  'BOS': True,
  '+1:word.lower()': 'ma',
  '+1:word[:-2]': '',
  '+1:word.istitle()': False,
  '+1:word.isupper()': False,
  '+1:word_shape': 'xx',
  '+1:digit': False,
  '+1:hyphen': False,
  '+1:tag_a': 'fin',
  '+1:tag_b': 'sg',
  '+1:tag_c': 'ter',
  '+1:tag_d': 'imperf',
  '+1:tag_e': ''},
 {'bias': 1.0,
  'word.lower()': 'ma',
  'word[:-2]': '',
  'word.istitle()': False,
  'word.isupper()': False,
  'word_shape': 'xx',
  'digit': False,
  'hyphen': False,
  'tag_a': 'fin',
  'tag_b': 'sg',
  'tag_c': 'ter',
  'tag_d': 'imperf',
  'tag_e': '',
  '-1:word.lower()': 'kto',
  '-1:word[:-2]': 'K',
  '-1:word.istitle()': True,
  '-1:word.isupper()': False,
  '-1:word_shape': 'Xxx',
  '-1:digit': False,
  '-1:hyphen': False,
  '-

In [61]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs', 
    c1=0.1, 
    c2=0.1, 
    max_iterations=100, 
    all_possible_states =True,
    all_possible_transitions=True
)

In [62]:
crf.fit(X_train, y_train)



CRF(algorithm='lbfgs', all_possible_states=True, all_possible_transitions=True,
    c1=0.1, c2=0.1, keep_tempfiles=None, max_iterations=100)

In [63]:
labels = list(crf.classes_)
labels

['O',
 'surname',
 'orgName',
 'forename',
 'country',
 'settlement',
 'persName',
 'geogName',
 'region',
 'date',
 'time',
 'district',
 'placeName',
 'addName',
 'bloc']

In [64]:
y_pred = crf.predict(X_test)
print('Total precision: ', 
      round(metrics.flat_precision_score(y_test, y_pred, average = 'weighted', labels = labels), 5))
print('Total recall: ', 
      round(metrics.flat_recall_score(y_test, y_pred, average = 'weighted', labels = labels), 5))
print('Total f1 score: ', 
      round(metrics.flat_f1_score(y_test, y_pred, average = 'weighted', labels = labels), 5))

Total precision:  0.98553
Total recall:  0.98633
Total f1 score:  0.98558


In [65]:
sorted_labels = sorted(labels)

print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))



              precision    recall  f1-score   support

           O      0.992     0.997     0.995    228605
     addName      0.732     0.341     0.465       176
        bloc      0.333     0.050     0.087        40
     country      0.933     0.850     0.889      1648
        date      0.897     0.859     0.877       858
    district      0.885     0.329     0.479        70
    forename      0.926     0.877     0.901      2650
    geogName      0.785     0.547     0.645       895
     orgName      0.815     0.740     0.775      2192
    persName      0.910     0.676     0.776       210
   placeName      0.814     0.500     0.619        70
      region      0.892     0.464     0.610       179
  settlement      0.855     0.738     0.792      1648
     surname      0.862     0.922     0.891      2585
        time      0.864     0.717     0.784       106

    accuracy                          0.986    241932
   macro avg      0.833     0.640     0.706    241932
weighted avg      0.986   

In [66]:
labels = list(crf.classes_)
labels.remove('O')
labels

['surname',
 'orgName',
 'forename',
 'country',
 'settlement',
 'persName',
 'geogName',
 'region',
 'date',
 'time',
 'district',
 'placeName',
 'addName',
 'bloc']

In [67]:
#After removing "O" label:
y_pred = crf.predict(X_test)
print('After removing "O" label:', end = '\n\n')
print('Total precision: ', 
      round(metrics.flat_precision_score(y_test, y_pred, average = 'weighted', labels = labels), 5))
print('Total recall: ', 
      round(metrics.flat_recall_score(y_test, y_pred, average = 'weighted', labels = labels), 5))
print('Total f1 score: ', 
      round(metrics.flat_f1_score(y_test, y_pred, average = 'weighted', labels = labels), 5))

After removing "O" label:

Total precision:  0.86972
Total recall:  0.79493
Total f1 score:  0.82613


In [68]:
sorted_labels = sorted(labels)

print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))



              precision    recall  f1-score   support

     addName      0.732     0.341     0.465       176
        bloc      0.333     0.050     0.087        40
     country      0.933     0.850     0.889      1648
        date      0.897     0.859     0.877       858
    district      0.885     0.329     0.479        70
    forename      0.926     0.877     0.901      2650
    geogName      0.785     0.547     0.645       895
     orgName      0.815     0.740     0.775      2192
    persName      0.910     0.676     0.776       210
   placeName      0.814     0.500     0.619        70
      region      0.892     0.464     0.610       179
  settlement      0.855     0.738     0.792      1648
     surname      0.862     0.922     0.891      2585
        time      0.864     0.717     0.784       106

   micro avg      0.874     0.795     0.832     13327
   macro avg      0.822     0.615     0.685     13327
weighted avg      0.870     0.795     0.826     13327



###Conclusion

I presented two different solutions to the problem of proper name recognition using the CRF method. The solutions differed from each other in the used features of the words. In each of them I used features of a given word and neighboring words. These included: a word written in lowercase letters, a word without the last two characters, information whether the word begins with a capital letter and whether the word contains a hyphen or a digit.

In the case of the first solution, I also took into account a tag describing a given word containing various information about it, e.g. part of speech, number, gender, etc. In the second version, however, I divided this information into separate features. This operation allowed for slightly better results. The f1 score improved from 0.822 to 0.826.