In [1]:
#|hide
#|default_exp ner_crf

In [2]:
#| hide
%matplotlib inline
from nbdev.showdoc import *

In [19]:
#| export
import nltk
import sklearn
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelBinarizer
import sklearn_crfsuite as crfsuite
from sklearn_crfsuite import metrics
import joblib
import os
import eli5
import scipy
from sklearn.metrics import make_scorer
from sklearn.model_selection import RandomizedSearchCV

# NER with Conditional Random Fields (CRF)
(follows: https://github.com/nlptown/nlp-notebooks/blob/master/Named%20Entity%20Recognition%20with%20Conditional%20Random%20Fields.ipynb)

This notebook is about **sequence labelling**. CRF is a powerful technique dating from before Deep Learning became popular. The goal is to label each word in a text with a word class. POS tagging takes care of parts of speech (POS) tagging or labelling. NER relates words to more generic named entities: Names, places, organizations, etc. But, depending on the context of the research, there might be  more specialized entities (i.e. diseases, symptomes in health, etc.) in a text that need labelling in order to get the most important information out for analytics, search or matching applications.

For this we use `sklearn-crfsuite` for Conditional Random Fields (CRF): Sequence labelling. `sklearn-crfsuite` is a wrapper around `python-crfsuite`, which is a Python binding of **CRFSuite**. The reason to use `sklearn-crfsuite` are its handy utility functions for evaluating the output of the model.

I guess, due to the upsurge of Deep Learning the `sklearn-crfsuite` is not updated anymore. So instead of installing the library with 'pip install sklearn-crfsuite', we should use a patched version of the library in order to generate the evaluation reports.

As an alternative we could use a variant library: spaCy-crfsuite.

In [None]:
#| hide
# Run from the commandline
#%pip install git+https://github.com/MeMartijn/updated-sklearn-crfsuite.git\#egg=sklearn_crfsuite

Data comes from NLTK. It is the CoNLL-2002 data. Spanish and Dutch texts labellled with 4 types of entities: locations (LOC), persons (PER), organizations (ORG), and miscellaneous entities (MISC). The data is split up in three chunks: `train_sents` and two little test chunks: `dev_sents` and `test_sents`.

In [4]:
#| export
#nltk.download('conll2002') # Just run this line once
train_sents = list(nltk.corpus.conll2002.iob_sents('ned.train'))
dev_sents = list(nltk.corpus.conll2002.iob_sents('ned.testa'))
test_sents = list(nltk.corpus.conll2002.iob_sents('ned.testb'))

Let's have a look at the data. They are a list of tokenized sentences: the string, POS tag and it's entity tag. Nowadays the POS tag is not used in deep learning, but with CRF it provides useful information: Nouns are more common denoting entities than verbs, so the POS tags carry useful information.

In [6]:
#| hide
train_sents[0]

[('De', 'Art', 'O'),
 ('tekst', 'N', 'O'),
 ('van', 'Prep', 'O'),
 ('het', 'Art', 'O'),
 ('arrest', 'N', 'O'),
 ('is', 'V', 'O'),
 ('nog', 'Adv', 'O'),
 ('niet', 'Adv', 'O'),
 ('schriftelijk', 'Adj', 'O'),
 ('beschikbaar', 'Adj', 'O'),
 ('maar', 'Conj', 'O'),
 ('het', 'Art', 'O'),
 ('bericht', 'N', 'O'),
 ('werd', 'V', 'O'),
 ('alvast', 'Adv', 'O'),
 ('bekendgemaakt', 'V', 'O'),
 ('door', 'Prep', 'O'),
 ('een', 'Art', 'O'),
 ('communicatiebureau', 'N', 'O'),
 ('dat', 'Conj', 'O'),
 ('Floralux', 'N', 'B-ORG'),
 ('inhuurde', 'V', 'O'),
 ('.', 'Punc', 'O')]

### Feature extraction

How does CRF work? Deep learning neural nets just learn their relevant features from the input texts themselves. CRFs learn the relationship between **the features we give them** and the **label of a token in a given context**. They **do not** learn these features themselves, the quality of the model highly depends on the **features** we present to them. That is why we concentrate in this notebook on a method that collects the features for every token.

What kind of information should we work with? The word itself, its POS tag, completely uppercase, starts with a capital or digit, bigram or trigram characters of its ending. We give every token a **bias** feature which always has the same value. It is used to calculate the relative frequency of each label type in the training data.

In our notebook 00_word_embeddings we trained word embeddings on Dutch wikipedia and clustered these in 500 clusters. We read these clusters from file `read_clusters`, and map each word to the ID of the cluster it is in. Useful for NER because most entity types cluster together. This allows CRF to generalize above word-level.

We also want the CRF to look at the **context** of the tokens. Giving it information (the words, POS tags, capital, complete uppercase) on the two words to the left and to the right (windowing). If there is no left or right context, we return BOS (beginning of sentence) or EOS (end of sentence).

So, this all boils down to:

- use a method to collect the features for every token:
  - the word itself + POS tag
  - completely uppercase? starts with digit? starts with capital?
  - character bigram or trigram the word ends with
  - use a `bias` feature that always has the same value (through it the CRF model can learn the relative freq. of each label type in the training data)

- we use the word embeddings to give the model more information about the meaning of a word (500 wikipedia clusters of word embeddings). We read those from file and map each word to the ID of the cluster it is in. Define `read_clusters`.

- we want the CRF to look at the context of a token. We provide the CRF with the 2 words on either side of the token: their case, POS tags. If there is no left or right, we give it that information: BOS or EOS.

In [7]:
#| export
# Making use of the Wikipedia word embeddings
def read_clusters(cluster_file):
  word2cluster = {}
  with open(cluster_file) as i:
    for line in i:
      word, cluster = line.strip().split('\t')
      word2cluster[word] = cluster
  return word2cluster

# Using features of the words AND looking at the context of a token (neigbours +/- 2)
def word2features(sent, i, word2cluster):
  word = sent[i][0]
  postag = sent[i][1]
  features = [
    'bias',
    'word.lower=' + word.lower(),
    'word[-3]=' + word[-3:], # looking at the last 3 chars of the token
    'word[-2]=' + word[-2:], # looking at the last 2 chars of the token
    'word.isupper=%s' % word.isupper(),
    'word.istitle=%s' % word.istitle(),
    'word.isdigit=%s' % word.isdigit(),
    'word.cluster=%s' % word2cluster[word.lower()] if word.lower() in word2cluster else '0'
    'postag=' + postag
  ]
  # Look at the first neighbour token to the left
  if i > 0:
    word1 = sent[i-1][0]
    postag1 = sent[i-1][1]
    features.extend([
      '-1:word.lower=' + word1.lower(),
      '-1:word.istitle=%s' % word1.istitle(),
      '-1:word.isupper=%s' % word1.isupper(),
      '-1:postag=' + postag1
    ])
  else:
    features.append('BOS')
  # Look at the second neighbour to the left
  if i > 1: 
    word2 = sent[i-2][0]
    postag2 = sent[i-2][1]
    features.extend([
      '-2:word.lower=' + word2.lower(),
      '-2:word.istitle=%s' % word2.istitle(),
      '-2:word.isupper=%s' % word2.isupper(),
      '-2:postag=' + postag2
    ])
  # look at the first neigbour to the right
  if i < len(sent)-1:
    word1 = sent[i+1][0]
    postag1 = sent[+1][0]
    features.extend([
      '+1:word.lower=' + word1.lower(),
      '+1:word.istitle=%s' % word1.istitle(),
      "+1:word.isupper=%s" % word1.isupper(),
      '+1:postag=' + postag1
    ])
  else:
    features.append('EOS')
  # Look at the second neighbour to the right
  if i < len(sent)-2:
    word2 = sent[i+2][0]
    postag2 = sent[+2][0]
    features.extend([
      '+2:word.lower=' + word2.lower(),
      '+2:word.istitle=%s' % word2.istitle(),
      "+2:word.isupper=%s" % word2.isupper(),
      '+2:postag=' + postag2
    ])
  return features

# Now we define the functions to do all the work
def sent2features(sent, word2cluster):
  return [word2features(sent, i, word2cluster) for i in range(len(sent))]

def sent2labels(sent):
  return [label for token, postag, label in sent]

def sent2tokens(sent):
  return [token for token, postag, label in sent]

word2cluster = read_clusters('/home/peter/Documents/data/nlp/clusters_nl.tsv')

Let's try the `sent2features` function out using the first word from the first training_sent:

In [8]:
#| hide
train_sents[0][0]

('De', 'Art', 'O')

In [9]:
#| export
sent2features(train_sents[0], word2cluster)[0]

['bias',
 'word.lower=de',
 'word[-3]=De',
 'word[-2]=De',
 'word.isupper=False',
 'word.istitle=True',
 'word.isdigit=False',
 'word.cluster=38',
 'BOS',
 '+1:word.lower=tekst',
 '+1:word.istitle=False',
 '+1:word.isupper=False',
 '+1:postag=tekst',
 '+2:word.lower=van',
 '+2:word.istitle=False',
 '+2:word.isupper=False',
 '+2:postag=van']

Now we assign our training and test sets the appropriate labels:

In [10]:
#| export
X_train = [sent2features(s, word2cluster) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_dev = [sent2features(s, word2cluster) for s in dev_sents]
y_dev = [sent2labels(s) for s in dev_sents]

X_test = [sent2features(s, word2cluster) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]


Next we create a CRF model and start the training using the standard `lbfgs` algorithm for parameter estimation and run it for 100 iterations.

When done, we save the model using `joblib`.

In [11]:
#| export
crf = crfsuite.CRF(
    verbose='true',
    algorithm='lbfgs',
    max_iterations=100
)

crf.fit(X_train, y_train, X_dev=X_dev, y_dev=y_dev)

loading training data to CRFsuite: 100%|██████████| 15806/15806 [00:01<00:00, 9429.46it/s] 





loading dev data to CRFsuite: 100%|██████████| 2895/2895 [00:00<00:00, 8787.71it/s]



Holdout group: 2

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 0
0....1....2....3....4....5....6....7....8....9....10
Number of features: 171320
Seconds required: 0.407

L-BFGS optimization
c1: 0.000000
c2: 1.000000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=0.28  loss=104683.26 active=171320 precision=0.100  recall=0.111  F1=0.105  Acc(item/seq)=0.901 0.496  feature_norm=1.00
Iter 2   time=0.16  loss=96793.85 active=171320 precision=0.100  recall=0.111  F1=0.105  Acc(item/seq)=0.901 0.496  feature_norm=1.15
Iter 3   time=0.16  loss=92785.91 active=171320 precision=0.100  recall=0.111  F1=0.105  Acc(item/seq)=0.901 0.496  feature_norm=1.26
Iter 4   time=0.16  loss=87079.17 active=171320 precision=0.100  recall=0.111  F1=0.105  Acc(item/seq)=0.901 0.496  feature_norm=1.46
Iter 5   time=0.16  loss=74874.43 active=17

CRF(algorithm='lbfgs', max_iterations=100, verbose='true')

Let's see whether we can write our model to file:

In [13]:
#| export
OUTPUT_PATH = '/home/peter/Documents/data/nlp/models'
OUTPUT_FILE = 'crf_model'

if not os.path.exists(OUTPUT_PATH):
  os.mkdir(OUTPUT_PATH)

joblib.dump(crf, os.path.join(OUTPUT_PATH, OUTPUT_FILE))

['/home/peter/Documents/data/nlp/models/crf_model']

With our model saved, we now can evaluate the output of our CRF model. We will load our model from file and test it on the full test set.

We will have a look at the first sentence.

In [14]:
#| export
crf = joblib.load(os.path.join(OUTPUT_PATH, OUTPUT_FILE))
y_pred = crf.predict(X_test)

example_sent = test_sents[0]
print("Sentence:", ' '.join(sent2tokens(example_sent)))
print("Predicted:", ' '.join(crf.predict([sent2features(example_sent, word2cluster)])[0]))
print("Correct:", ' '.join(sent2labels(example_sent)))

Sentence: Dat is in Italië , Spanje of Engeland misschien geen probleem , maar volgens ' Der Kaiser ' in Duitsland wel .
Predicted: O O O B-LOC O B-LOC O B-LOC O O O O O O O B-MISC I-MISC O O B-LOC O O
Correct: O O O B-LOC O B-LOC O B-LOC O O O O O O O B-PER I-PER O O B-LOC O O


We are now ready to evalaute the whole test set. We print out a classification report for all labels except 'O'. They are the majority of labels anyway, so they will skew the results (they are most probably assigned correctly).

In [15]:
#| export
labels = list(crf.classes_)
labels.remove('O')
y_pred = crf.predict(X_test)
sorted_labels = sorted(
  labels,
  key=lambda name: (name[1:], name[0])
)
# The following code only runs with the updated metrics.py module in `sklearn_crfsuite` library.
# Here: pip install git+https://github.com/MeMartijn/updated-sklearn-crfsuite.git\#egg=sklearn_crfsuite
print(metrics.flat_classification_report(y_test, y_pred, labels=sorted_labels))

              precision    recall  f1-score   support

       B-LOC       0.83      0.82      0.83       774
       I-LOC       0.35      0.45      0.40        49
      B-MISC       0.81      0.61      0.70      1187
      I-MISC       0.54      0.41      0.46       410
       B-ORG       0.79      0.70      0.74       882
       I-ORG       0.75      0.61      0.67       551
       B-PER       0.82      0.88      0.85      1098
       I-PER       0.90      0.95      0.93       807

   micro avg       0.80      0.74      0.77      5758
   macro avg       0.72      0.68      0.70      5758
weighted avg       0.80      0.74      0.76      5758



We have good scores, especially `B-LOC` and `B-PER` score very good.

Next we will use the `eli5` library to have a look at the most likely transitions the CRF model has identified. `Eli5` helps us to explain the predictions of our CRF model

In [18]:
eli5.show_weights(crf, top=20)

From \ To,O,B-LOC,I-LOC,B-MISC,I-MISC,B-ORG,I-ORG,B-PER,I-PER
O,3.67,4.606,0.0,4.124,0.0,4.478,0.0,3.713,0.0
B-LOC,-0.282,-0.417,7.194,0.0,0.0,0.0,0.0,-0.79,0.0
I-LOC,-0.879,-0.289,5.668,0.0,0.0,0.0,0.0,0.0,0.0
B-MISC,-1.258,0.809,0.0,-0.164,7.135,0.787,0.0,0.819,0.0
I-MISC,-1.562,0.0,0.0,-0.208,7.003,1.42,0.0,-0.587,0.0
B-ORG,-0.253,0.0,0.0,-0.826,0.0,0.0,7.58,0.147,0.0
I-ORG,-0.528,0.0,0.0,0.0,0.0,0.0,6.693,0.231,0.0
B-PER,0.207,-0.406,0.0,-0.711,0.0,0.0,0.0,-1.505,8.565
I-PER,0.38,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6.5

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8
+3.796,word.istitle=False,,,,,,,
+3.386,word.isupper=False,,,,,,,
+2.956,word.cluster=195,,,,,,,
+2.893,+1:postag=+,,,,,,,
+2.722,word.cluster=177,,,,,,,
+2.668,word.cluster=158,,,,,,,
+2.648,word.cluster=370,,,,,,,
+2.539,word.cluster=415,,,,,,,
+2.288,word.cluster=178,,,,,,,
+2.204,"-1:word.lower=""",,,,,,,

Weight?,Feature
+3.796,word.istitle=False
+3.386,word.isupper=False
+2.956,word.cluster=195
+2.893,+1:postag=+
+2.722,word.cluster=177
+2.668,word.cluster=158
+2.648,word.cluster=370
+2.539,word.cluster=415
+2.288,word.cluster=178
+2.204,"-1:word.lower="""

Weight?,Feature
+3.972,word.cluster=325
+3.803,word.cluster=375
+3.589,word.cluster=68
+3.380,word.cluster=139
+2.089,word.cluster=143
+2.016,word.cluster=476
+2.016,-1:word.lower=in
+1.879,word.cluster=102
+1.844,word[-2]=ië
+1.617,+1:postag=ronde

Weight?,Feature
+1.721,word.cluster=238
+1.640,+2:word.lower=m
+1.381,word.cluster=161
+1.167,"+2:word.lower=,"
+1.061,-1:word.lower=col
+1.004,word.cluster=38
+0.994,word[-2]=rk
+0.967,+1:postag=Mutations
+0.937,word[-2]=al
+0.846,+1:postag=Hoedenverhalen

Weight?,Feature
+3.166,word.cluster=23
+2.695,word.cluster=39
+2.653,word.cluster=100
+2.521,word.cluster=338
+2.141,word.cluster=294
+2.070,word[-2]=se
+1.822,word.cluster=11
+1.729,+2:postag=1
+1.726,word.lower=sport
+1.715,word.lower=buitenland

Weight?,Feature
+1.857,-2:word.lower=ronde
+1.763,-1:word.isupper=True
+1.555,-1:word.lower=ronde
+1.448,word.cluster=37
+1.354,+2:postag=wiel
+1.322,word.cluster=325
+1.300,-1:word.istitle=True
+1.288,-1:postag=Num
+1.260,word.cluster=1
+1.237,+1:word.lower=ned

Weight?,Feature
+2.880,word.cluster=228
+2.812,word.cluster=424
+2.375,+1:postag=Morgen
+2.193,word.cluster=187
+2.159,word[-3]=com
+2.123,word.lower=quizpeople
+2.077,word[-3]=ple
+1.935,word.cluster=83
+1.920,+1:word.lower=morgen
+1.775,word.cluster=250

Weight?,Feature
+1.374,word.lower=morgen
+1.343,word.cluster=403
+1.304,-1:word.lower=vlaams
+1.304,word.cluster=321
+1.275,word.cluster=413
+1.153,-1:word.lower=radio
+1.097,word[-3]=gen
+1.092,word.cluster=187
+1.045,-1:postag=Misc
+1.042,word[-3]=ion

Weight?,Feature
+3.660,word.cluster=489
+3.036,word.cluster=301
+3.030,word.cluster=204
+3.015,word.cluster=3
+2.875,word.cluster=246
+2.592,word.cluster=6
+2.496,word.cluster=337
+2.434,word.cluster=326
+2.366,word.cluster=296
+2.204,word.cluster=87

Weight?,Feature
+1.969,-1:word.lower=van
+1.545,word.cluster=3
+1.443,word.cluster=450
+1.303,+1:word.lower=(
+1.296,+2:word.lower=die
+1.269,+2:postag=(
+1.254,word.cluster=249
+1.234,word.cluster=6
+1.122,word.cluster=388
+1.034,word.cluster=337


## Finding the optimal hyperparameters

So far we've trained a model with the default parameters. It's unlikely that these will give us the best performance possible. Therefore we're going to search automatically for the best hyperparameter settings by iteratively training different models and evaluating them. Eventually we'll pick the best one.

Here we'll focus on two parameters: c1 and c2. These are the parameters for L1 and L2 regularization, respectively. Regularization prevents overfitting on the training data by adding a penalty to the loss function. In L1 regularization, this penalty is the sum of the absolute values of the weights; in L2 regularization, it is the sum of the squared weights. L1 regularization performs a type of feature selection, as it assigns 0 weight to irrelevant features. L2 regularization, by contrast, makes the weight of irrelevant features small, but not necessarily zero. L1 regularization is often called the Lasso method, L2 is called the Ridge method, and the linear combination of both is called Elastic Net regularization.

We define the parameter space for c1 and c2 and use the flat F1-score to compare the individual models. We'll rely on three-fold cross validation to score each of the 50 candidates. We use a randomized search, which means we're not going to try out all specified parameter settings, but instead, we'll let the process sample randomly from the distributions we've specified in the parameter space. It will do this 50 (n_iter) times. This process takes a while, but it's worth the wait.

In [20]:
#| export
crf = crfsuite.CRF(
  algorithm='lbfgs',
  max_iterations=100,
  all_possible_transitions=True,
  keep_tempfiles=True
)

params_space = {
    'c1': scipy.stats.expon(scale=0.5),
    'c2': scipy.stats.expon(scale=0.05),
}

f1_scorer = make_scorer(metrics.flat_f1_score,
                        average='weighted', labels=labels)

rs = RandomizedSearchCV(crf, params_space,
                        cv=3,
                        verbose=1,
                        n_jobs=-1,
                        n_iter=50,
                        scoring=f1_scorer)
rs.fit(X_train, y_train)

Fitting 3 folds for each of 50 candidates, totalling 150 fits


RandomizedSearchCV(cv=3,
                   estimator=CRF(algorithm='lbfgs',
                                 all_possible_transitions=True,
                                 keep_tempfiles=True, max_iterations=100),
                   n_iter=50, n_jobs=-1,
                   param_distributions={'c1': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f1a3be2ec70>,
                                        'c2': <scipy.stats._distn_infrastructure.rv_frozen object at 0x7f19ed1d5f70>},
                   scoring=make_scorer(flat_f1_score, average=weighted, labels=['B-ORG', 'B-MISC', 'B-PER', 'I-PER', 'B-LOC', 'I-MISC', 'I-ORG', 'I-LOC']),
                   verbose=1)

In [21]:
#| export
print('best params:', rs.best_params_)
print('best CV score:', rs.best_score_)
print('model size: {:0.2f}M'.format(rs.best_estimator_.size_ / 1000000))

best params: {'c1': 0.03718596914962556, 'c2': 0.01236631335837559}
best CV score: 0.7542030812940546
model size: 1.54M


In [22]:
#| export
best_crf = rs.best_estimator_
y_pred = best_crf.predict(X_test)
print(metrics.flat_classification_report(
    y_test, y_pred, labels=sorted_labels, digits=3
))

              precision    recall  f1-score   support

       B-LOC      0.849     0.853     0.851       774
       I-LOC      0.394     0.571     0.467        49
      B-MISC      0.836     0.618     0.711      1187
      I-MISC      0.649     0.415     0.506       410
       B-ORG      0.813     0.718     0.762       882
       I-ORG      0.779     0.641     0.703       551
       B-PER      0.832     0.901     0.865      1098
       I-PER      0.885     0.969     0.925       807

   micro avg      0.822     0.755     0.787      5758
   macro avg      0.755     0.711     0.724      5758
weighted avg      0.818     0.755     0.780      5758



In [23]:
#| hide
import nbdev; nbdev.nbdev_export()