University of Zagreb\
Faculty of Electrical Engineering and Computing

## Text Analysis and Retrieval 2021/2022
https://www.fer.unizg.hr/predmet/apt/

------------------------------

### Basics of NLP

*Version: 1.1*

(c) 2022 Josip Jukić, Jan Šnajder

Submission deadline: **April 6, 2022, 23:59 CET** 

------------------------------

### Instructions

Hello visitor, this lab assignment consists of three parts. Your task boils down to filling out the missing parts of code and evaluating the cells. These parts are indicated by the "YOUR CODE HERE" template.

Each subtask is supplemented by several tests that you can run. Apart from that, there are additional test that will be executed after submition. If your solution is valid and it passes all of the visible tests, there shouldn't be any problems with the additional tests.

**IMPORTANT: Don't change the names of the predefined methods or random seeds**, because the tests won't be executed properly.

You're required to do this assignment **on your own**.

If you stumble upon problems, please refer to josip.jukic@fer.hr for office hours.

## Tasks

### 1. Preprocessing

In [1]:
import spacy
import numpy as np
import pandas as pd

We will use [spaCy](https://spacy.io/) exetensively in this assigment. You are advised to study the main aspects of this tool. You can go through the basics [here](https://spacy.io/usage/spacy-101). We recommend that you go through the procedures that we covered in the lectures: tokenization, lemmatization, part-of-speech (POS) tagging, and named entity recognition (NER).

Furthermore, we will rely on [NumPy](https://numpy.org/) and [pandas](https://pandas.pydata.org/) libraries. If you are not familiar with those libraries, we advise you to go through [this tutorial](https://www.hackerearth.com/practice/machine-learning/data-manipulation-visualisation-r-python/tutorial-data-manipulation-numpy-pandas-python/tutorial/).

In [2]:
# Load spacy model
nlp = spacy.load("en_core_web_sm")

#### (a)
Process the example below with spaCy. Tokenize the document and gather the tokens in a list. Finally, print the tokens.

In [3]:
ex1_a1 = (
    "A wizard is never late, Frodo Baggins. "
    "Nor is he early; he arrives precisely when he means to."
)

In [4]:
def tokenizer(text):
    return [token for token in nlp(text)]

tokens = tokenizer(ex1_a1)
print(tokens)

[A, wizard, is, never, late, ,, Frodo, Baggins, ., Nor, is, he, early, ;, he, arrives, precisely, when, he, means, to, .]


#### (b)
Implement `sentencizer` using [spaCy](https://spacy.io/usage/linguistic-features).

In [5]:
def sentencizer(text):
    """
    Receives a string as an input,
    splits the document to sentences and gathers them in a list.
    """
    return [sentence.text for sentence in nlp(text).sents]

In [6]:
assert sentencizer("Sentence no. 1. Sentence no. 2.") == [
    "Sentence no. 1.",
    "Sentence no. 2.",
]

#### (c)

Implement `lemmatizer` using [spaCy](https://spacy.io/usage/linguistic-features).

In [7]:
def lemmatizer(text):
    """
    Receives a string as an input and lemmatizes it.
    The lemmas are returned in a list.
    """
    return [token.lemma_ for token in nlp(text)]

In [8]:
assert lemmatizer(ex1_a1) == [
    "a",
    "wizard",
    "be",
    "never",
    "late",
    ",",
    "Frodo",
    "Baggins",
    ".",
    "nor",
    "be",
    "he",
    "early",
    ";",
    "he",
    "arrive",
    "precisely",
    "when",
    "he",
    "mean",
    "to",
    ".",
]

#### (d)

Implement the `ngrams` methods. You might find the [`tee`](https://www.geeksforgeeks.org/python-itertools-tee/) method from the `itertools` package useful, but you're not obliged to use it. The method should return a generator. Plase refer to the [link](https://wiki.python.org/moin/Generators) if you aren't familiar with Python generators.

In [9]:
from itertools import tee


def ngrams(sequence, n, **kwargs):
    """
    Receives a list of tokens and generates n-grams.
    """
    start = 0
    for i in range(len(sequence) - n + 1):
        yield tuple(sequence[start : start+n])
        start += 1

In [10]:
assert list(ngrams(lemmatizer(ex1_a1), 2)) == [
    ("a", "wizard"),
    ("wizard", "be"),
    ("be", "never"),
    ("never", "late"),
    ("late", ","),
    (",", "Frodo"),
    ("Frodo", "Baggins"),
    ("Baggins", "."),
    (".", "nor"),
    ("nor", "be"),
    ("be", "he"),
    ("he", "early"),
    ("early", ";"),
    (";", "he"),
    ("he", "arrive"),
    ("arrive", "precisely"),
    ("precisely", "when"),
    ("when", "he"),
    ("he", "mean"),
    ("mean", "to"),
    ("to", "."),
]


### 2. News classification

#### (a)
Load the prepared BBC news data to a `pandas` dataframe named `df_bbc`. Explore the dataset structure.

In [11]:
import pandas as pd


df_bbc = pd.read_csv('bbc.csv')

display(df_bbc)
print(df_bbc[['type']].nunique())

Unnamed: 0,news,type
0,New 'yob' targets to be unveiled\n \n Fifty ne...,politics
1,Newcastle line up Babayaro\n \n Newcastle mana...,sport
2,Europe backs digital TV lifestyle\n \n How peo...,tech
3,Fears raised over ballet future\n \n Fewer chi...,entertainment
4,Barkley fit for match in Ireland\n \n England ...,sport
...,...,...
195,Wales 'must learn health lessons'\n \n The new...,politics
196,Clarke to press on with ID cards\n \n New Home...,politics
197,Artists' secret postcards on sale\n \n Postcar...,entertainment
198,Lopez misses UK charity premiere\n \n Jennifer...,entertainment


type    5
dtype: int64


#### (b)
To make the classification task a bit more challenging, we want to remove the news title from the text.\
Additionally, we will replace all whitespaces with single spaces. Implement title removal and whitespace replacement in `clean_text`.\
E.g., "This \n is  \t an &nbsp;&nbsp;&nbsp;&nbsp; example. " -> "This is an example."

In [12]:
def clean_text(text):
    """
    Removes news title and replaces all whitespaces with single spaces.
    Returns preprocessed text.
    """
    text_wo_title = ''.join(text.split('\n')[1:])
    return ' '.join(text_wo_title.split())

In [13]:
assert (
    clean_text("Breaking news\nClever Hans \t learns  to integrate.")
    == "Clever Hans learns to integrate."
)


In [14]:
df_bbc["text"] = df_bbc.news.apply(clean_text)

#### (c)
(1) Implement an abstract pipeline in `preprocess_pipe`. The method receieves a sequence of texts and a pipe function, which is used to preprocess documents in combination with the spaCy model `nlp` that we loaded at the beggining. We recommend you to use [`pipe`](https://spacy.io/usage/processing-pipelines).\
(2) Implement `lemmatize_pipe` that collects lemmas and returns a list of n-grams ranging from `ngram_min` to `ngram_max`. Additonally, **truncate** the documents to `max_len` tokens and **remove the stop words**. Refer to the tests below to see how this method should behave.

In [15]:
def lemmatize_pipe(doc, max_len, ngram_min, ngram_max):
    """
    Removes stopword, truncates the document to `max_len` tokens,
    and returns lemma n-grams in range [`ngram_min`, `ngram_max`].
    """
    docs_stopw = [token for token in doc if not token.is_stop][:max_len]
    docs_stopw_lower = [token.lemma_.lower() if token.pos_ != 'PROPN' else token.lemma_ for token in docs_stopw]
    l = []
    for n in (ngram_min, ngram_max):
        tmp = list(ngrams(docs_stopw_lower, n))
        l.extend(tmp)
    return l
        
def preprocess_pipe(texts, pipe_fn):
    l = []
    for doc in nlp.pipe(texts):
        l.append(pipe_fn(doc))
    return l

In [16]:
from functools import partial


pipe_fn = partial(lemmatize_pipe, max_len=100, ngram_min=1, ngram_max=2)

ex2_c1 = ["Text no. 1", "Text no. 2"]
sol2_c1 = [
    [("text",), (".",), ("1",), ("text", "."), (".", "1")],
    [("text",), (".",), ("2",), ("text", "."), (".", "2")],
]

assert preprocess_pipe(ex2_c1, pipe_fn) == sol2_c1

ex2_c2 = [
    "It’s a dangerous business, Frodo, going out your door.",
    "You step onto the road, and if you don’t keep your feet, there’s no knowing where you might be swept off to.",
]
sol2_c2 = [
    [
        ("dangerous",),
        ("business",),
        (",",),
        ("Frodo",),
        (",",),
        ("go",),
        ("door",),
        (".",),
        ("dangerous", "business"),
        ("business", ","),
        (",", "Frodo"),
        ("Frodo", ","),
        (",", "go"),
        ("go", "door"),
        ("door", "."),
    ],
    [
        ("step",),
        ("road",),
        (",",),
        ("foot",),
        (",",),
        ("know",),
        ("sweep",),
        (".",),
        ("step", "road"),
        ("road", ","),
        (",", "foot"),
        ("foot", ","),
        (",", "know"),
        ("know", "sweep"),
        ("sweep", "."),
    ],
]

assert preprocess_pipe(ex2_c2, pipe_fn) == sol2_c2

In [17]:
from functools import partial
from sklearn.model_selection import train_test_split


pipe_fn = partial(lemmatize_pipe, max_len=100, ngram_min=1, ngram_max=2)

df_bbc["lemmas"] = preprocess_pipe(df_bbc.text, pipe_fn)
df_bbc_train, df_bbc_test = train_test_split(
    df_bbc[["lemmas", "type"]], test_size=0.2, random_state=42
)

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Load vectorizers
count_vectorizer = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False, min_df=3)
tfidf_vectorizer = TfidfVectorizer(tokenizer=lambda doc: doc, lowercase=False, min_df=3)

#### (d)
Implement `train_lr`. Run `test_performance` with count and TF-IDF vectorizer. Compare the results.

In [19]:
from sklearn.linear_model import LogisticRegression as LR


def train_lr(df_train, vectorizer, lr_kwargs={"max_iter": 1000, "solver": "lbfgs"}):
    """
    Receives the train set `df_train` as pd.DataFrame and extracts lemma n-grams
    with their correspoding labels (news type).
    The text is vectorized and used to train a logistic regression with
    training arguments passed as `lr_kwargs`.
    Returns the fitted model.
    """
    X = vectorizer.fit_transform(df_train['lemmas'])
    y = df_train['type']
    
    model = LR(max_iter=lr_kwargs['max_iter'], solver=lr_kwargs['solver'])
    model.fit(X, y)
    
    return model

In [20]:
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score


def test_performance(model, df_test, vectorizer):
    X_test, y_test = df_test.lemmas, df_test.type
    X_vec = vectorizer.transform(X_test)
    y_pred = model.predict(X_vec)
    print(classification_report(y_pred=y_pred, y_true=y_test))
    return f1_score(y_pred=y_pred, y_true=y_test, average="macro")

In [21]:
## Count vectorizer scenario
lr = train_lr(df_bbc_train, count_vectorizer)
f1 = test_performance(lr, df_bbc_test, count_vectorizer)
print(f"f1 = {f1:.3f}")

               precision    recall  f1-score   support

     business       0.92      1.00      0.96        11
entertainment       1.00      0.83      0.91         6
     politics       1.00      1.00      1.00         8
        sport       0.92      1.00      0.96        12
         tech       1.00      0.67      0.80         3

     accuracy                           0.95        40
    macro avg       0.97      0.90      0.93        40
 weighted avg       0.95      0.95      0.95        40

f1 = 0.925


In [22]:
## TF-IDF vectorizer scenario
lr = train_lr(df_bbc_train, tfidf_vectorizer)
f1 = test_performance(lr, df_bbc_test, tfidf_vectorizer)
print(f"f1 = {f1:.3f}")

               precision    recall  f1-score   support

     business       0.85      1.00      0.92        11
entertainment       1.00      0.67      0.80         6
     politics       1.00      1.00      1.00         8
        sport       0.92      1.00      0.96        12
         tech       1.00      0.67      0.80         3

     accuracy                           0.93        40
    macro avg       0.95      0.87      0.90        40
 weighted avg       0.93      0.93      0.92        40

f1 = 0.895


### 3. Named entity recognition

Named entity recognition (NER) is a NLP that seeks to classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, quantities, monetary values, percentages, etc. Refer to [Jurafsky \& Martin, Speech and Language Processing, Chapter 17](https://web.stanford.edu/~jurafsky/slp3/17.pdf) for additional information.

In this task, we will try out two approaches:
1. **classification**, where we classify named entities for each word in a document,
2. and **sequence labeling**, a more natural way to solve NER.

First, let's see spaCy's visualization tool `displacy` in action. We will take the first document from our data frame and render named entities with spaCy's default NER model. Although there are some minor innacuracies, spaCy's NER model generally performs very well (~90% accuracy).

In [23]:
from spacy import displacy


doc = nlp(df_bbc.news.iloc[0])
displacy.render(doc, style="ent", jupyter=True)

#### (a)
We want to use spaCy's deafult model to produce silver standard NER labels for our BBC news dataset. First step is to implement `entity_pipe`, a method that extracts POS tags and NER labels, which we will pass as an argument to `preprocess_pipe`. `entity_pipe` receives a spaCy document, extracts triplets in the form of (token, POS tag, named entity label), and returns the list of collected triplets. Refer to [spaCy's documention for NER](https://spacy.io/usage/linguistic-features#named-entities).

In [24]:
def entity_pipe(doc):
    tmp = []
    for token in doc:
        elem = (token.text, token.tag_, token.ent_iob_ if token.ent_iob_ == 'O' else token.ent_iob_ + '-' + token.ent_type_)
        tmp.append(elem)
    return tmp

In [25]:
from functools import partial


ex3_a1 = [
    "One does not simply walk into Mordor.",
    "What about second breakfast?",
]
sol3_a1 = [
    [
        ("One", "PRP", "O"),
        ("does", "VBZ", "O"),
        ("not", "RB", "O"),
        ("simply", "RB", "O"),
        ("walk", "VB", "O"),
        ("into", "IN", "O"),
        ("Mordor", "NNP", "B-ORG"),
        (".", ".", "O"),
    ],
    [
        ("What", "WP", "O"),
        ("about", "IN", "O"),
        ("second", "JJ", "B-ORDINAL"),
        ("breakfast", "NN", "O"),
        ("?", ".", "O"),
    ],
]
assert preprocess_pipe(ex3_a1, entity_pipe) == sol3_a1

We will only the first 50 documents to reduce the computational complexity.

In [26]:
df_bbc_trunc = df_bbc[:50].copy()

df_bbc_trunc["tags"] = preprocess_pipe(df_bbc_trunc["text"], entity_pipe)
data = sum(df_bbc_trunc["tags"], [])
tokens, pos, tags = zip(*data)
df_iob = pd.DataFrame({"token": tokens, "POS": pos, "tag": tags})
df_iob.head()

Unnamed: 0,token,POS,tag
0,Fifty,CD,B-CARDINAL
1,new,JJ,O
2,areas,NNS,O
3,getting,VBG,O
4,special,JJ,O


#### (b)
Vectorize the data in `df_iob` with [`DictVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html). You can transform the datafframe to a dictionary with [`to_dict`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_dict.html). The structure of the dictionary should look like so: [{column -> value}, … , {column -> value}]. Refer to the linked documentation to see how to utilize the `orient` argument.
After vectorization, split the data using [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), with `test_size=0.5` and `shuffle=False` to preserve the sentence structure. We are trying to classify named entites, so you can simply use the `tag` column from `df_iob` to extract labels. You can keep them in the string format.

In [27]:
from sklearn.feature_extraction import DictVectorizer


d = df_iob[['token', 'POS']].to_dict(orient='records')
vectorizer = DictVectorizer()
X = vectorizer.fit_transform(d)
y = df_iob['tag']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=False)

You can train your classifier now. For this purpose, let's choose Multinomial Naïve Bayes (MNB). Since MNB can learn incrementally, notice that we train our model with [`partial_fit`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.partial_fit) to reduce the computational complexity.

In [28]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report


classes = np.unique(df_iob.tag.values).tolist()
nb = MultinomialNB()
nb.partial_fit(X_train, y_train, classes)

print(classification_report(y_pred=nb.predict(X_test), y_true=y_test, labels=classes))

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


               precision    recall  f1-score   support

   B-CARDINAL       0.49      0.36      0.42        83
       B-DATE       0.51      0.18      0.26       157
      B-EVENT       0.00      0.00      0.00         4
        B-FAC       0.00      0.00      0.00         2
        B-GPE       0.87      0.42      0.57       205
   B-LANGUAGE       0.00      0.00      0.00         0
        B-LAW       0.00      0.00      0.00         1
        B-LOC       0.00      0.00      0.00        25
      B-MONEY       0.00      0.00      0.00        44
       B-NORP       0.00      0.00      0.00        56
    B-ORDINAL       0.00      0.00      0.00        14
        B-ORG       0.61      0.14      0.23       217
    B-PERCENT       0.00      0.00      0.00        33
     B-PERSON       0.86      0.13      0.23       189
    B-PRODUCT       0.00      0.00      0.00         4
   B-QUANTITY       0.00      0.00      0.00         4
       B-TIME       0.00      0.00      0.00         8
B-WORK_OF

For non-sparse classes, the $F_1$ score should be close to $1$. The possible explanation is that spaCy's default NER model is rule-based, which makes it easy to learn. Remeber that we used spaCy to produce silver labels. To check how the classifier performs on human-annotated data, let's explore the next dataset "ner.csv".

In [29]:
df_ner = pd.read_csv("ner.csv", encoding="ISO-8859-1")
# Fill NaNs with preceding values (for the "Sentence #" column).
df_ner.fillna(method='ffill', inplace=True)

Repeat the same procedure as in **(b)** with [`DictVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.DictVectorizer.html) on `df_clf`. Use [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html), with `test_size=0.5` and `shuffle=False`.

In [30]:
df_clf = df_ner[["Word", "POS", "Tag"]]

d = df_clf[['Word', 'POS']].to_dict(orient='records')
vectorizer = DictVectorizer()
X = vectorizer.fit_transform(d)
y = df_clf['Tag']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=False)

classes = np.unique(df_clf.Tag.values).tolist()

In [31]:
nb = MultinomialNB()
nb.partial_fit(X_train, y_train, classes)

MultinomialNB()

Let's drop the `O` tag, since it is the most frequent tag and it is hard to interpret the performance quality when it is included. This will give us a more realistic `F_1` score. If you wish, you can compare the results by setting `labels=classes` instead of `labels=new_classes`. If your classifier performs terribly, that is expected, so don't worry.

In [32]:
new_classes = classes.copy()
new_classes.pop()
print(
    classification_report(y_pred=nb.predict(X_test), y_true=y_test, labels=new_classes)
)

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       B-art       0.00      0.00      0.00        27
       B-eve       0.00      0.00      0.00        14
       B-geo       0.41      0.94      0.57      1813
       B-gpe       0.96      0.68      0.80       772
       B-nat       0.00      0.00      0.00        12
       B-org       0.65      0.33      0.44       917
       B-per       0.81      0.44      0.57       879
       B-tim       0.87      0.63      0.73       943
       I-art       0.00      0.00      0.00        16
       I-eve       0.00      0.00      0.00        14
       I-geo       0.90      0.23      0.37       387
       I-gpe       0.00      0.00      0.00        20
       I-nat       0.00      0.00      0.00         2
       I-org       0.72      0.28      0.40       781
       I-per       0.64      0.31      0.41       915
       I-tim       0.00      0.00      0.00       310

   micro avg       0.57      0.52      0.55      7822
   macro avg       0.37   

Let's try to improve the performance with the sequence labeling approach. Specifically, we're going to use CRF. First, we have to prepare the sentence-level dataset.

In [33]:
from collections import Counter

import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics


sentences = df_ner.groupby("Sentence #").Word.agg(lambda s: " ".join(s)).values.tolist()
processed = preprocess_pipe(sentences, entity_pipe)

#### (c)
Implement missing features in `token2features`:
- -1:token.lower() = preceding token in lowercase
- -1:token.istitle() = is the preceding token a title
- -1:token.isupper() = is the preceding token a digit
- -1:postag = POS tag of the preceding token

Analogously, add the same features for succeeding tokens.

In [34]:
def token2features(sent, i):
    token = sent[i][0]
    postag = sent[i][1]

    features = {
        "bias": 1.0,
        "token.lower()": token.lower(),
        "token[-3:]": token[-3:],
        "token[-2:]": token[-2:],
        "token.isupper()": token.isupper(),
        "token.istitle()": token.istitle(),
        "token.isdigit()": token.isdigit(),
        "postag": postag,
        "postag[:2]": postag[:2]
    }
    if i > 0:
        features.update(
            {
                "-1:token.lower()": sent[i-1][0].lower(),
                "-1:token.istitle()": sent[i-1][0].istitle(),
                "-1:token.isupper()": sent[i-1][0].isupper(),
                "-1:postag": sent[i-1][1],
            }
        )
    else:
        features["BOS"] = True
        
    if i < len(sent) - 1:
        features.update(
            {
                "+1:token.lower()": sent[i+1][0].lower(),
                "+1:token.istitle()": sent[i+1][0].istitle(),
                "+1:token.isupper()": sent[i+1][0].isupper(),
                "+1:postag": sent[i+1][1],
            }
        )
    else:
        features["EOS"] = True
    return features


def sent2features(sent):
    return [token2features(sent, i) for i in range(len(sent))]


def sent2labels(sent):
    return [label for _, _, label in sent]


def sent2tokens(sent):
    return [token for token, _, _ in sent]

In [35]:
ex3_b1 = [
    ("Thousands", "NNS", "B-CARDINAL"),
    ("of", "IN", "O"),
    ("demonstrators", "NNS", "O"),
    ("have", "VBP", "O"),
    ("marched", "VBN", "O"),
    ("through", "IN", "O"),
    ("London", "NNP", "B-GPE"),
    ("to", "TO", "O"),
    ("protest", "VB", "O"),
    ("the", "DT", "O"),
    ("war", "NN", "O"),
    ("in", "IN", "O"),
    ("Iraq", "NNP", "B-GPE"),
    ("and", "CC", "O"),
    ("demand", "VB", "O"),
    ("the", "DT", "O"),
    ("withdrawal", "NN", "O"),
    ("of", "IN", "O"),
    ("British", "JJ", "B-NORP"),
    ("troops", "NNS", "O"),
    ("from", "IN", "O"),
    ("that", "DT", "O"),
    ("country", "NN", "O"),
    (".", ".", "O"),
]

sol3_b1 = {
    "bias": 1.0,
    "token.lower()": "through",
    "token[-3:]": "ugh",
    "token[-2:]": "gh",
    "token.isupper()": False,
    "token.istitle()": False,
    "token.isdigit()": False,
    "postag": "IN",
    "postag[:2]": "IN",
    "-1:token.lower()": "marched",
    "-1:token.istitle()": False,
    "-1:token.isupper()": False,
    "-1:postag": "VBN",
    "+1:token.lower()": "london",
    "+1:token.istitle()": True,
    "+1:token.isupper()": False,
    "+1:postag": "NNP",
}

assert sent2features(ex3_b1)[5] == sol3_b1

In [36]:
X = [sent2features(s) for s in processed]
y = [sent2labels(s) for s in processed]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=False)

If the training lasts longer than ~10 minutes, you can reduce `max_iterations`.

In [37]:
crf = sklearn_crfsuite.CRF(
    algorithm="lbfgs", c1=0.1, c2=0.1, max_iterations=100, all_possible_transitions=True
)
crf.fit(X_train, y_train)



CRF(algorithm='lbfgs', all_possible_transitions=True, c1=0.1, c2=0.1,
    keep_tempfiles=None, max_iterations=100)

CRF should heavily outperform our previous attempt with the classifier. Check the performance without the `O` tag. If you wish, you can see how $F_1$ changes if you include the `O` tag, simply by setting `labels=classes` in `flat_classification_report`. The benefits of solving NER as a sequence labeling task should be obvious after you inspect the margin of improvement.

In [38]:
new_classes = [s.upper() for s in new_classes]

y_pred = crf.predict(X_test)
print(metrics.flat_classification_report(y_test, y_pred, labels=new_classes))

  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

       B-ART       0.00      0.00      0.00         0
       B-EVE       0.00      0.00      0.00         0
       B-GEO       0.00      0.00      0.00         0
       B-GPE       0.88      0.94      0.91      1749
       B-NAT       0.00      0.00      0.00         0
       B-ORG       0.76      0.70      0.72       740
       B-PER       0.00      0.00      0.00         0
       B-TIM       0.00      0.00      0.00         0
       I-ART       0.00      0.00      0.00         0
       I-EVE       0.00      0.00      0.00         0
       I-GEO       0.00      0.00      0.00         0
       I-GPE       0.86      0.87      0.86       370
       I-NAT       0.00      0.00      0.00         0
       I-ORG       0.76      0.82      0.79       878
       I-PER       0.00      0.00      0.00         0
       I-TIM       0.00      0.00      0.00         0

   micro avg       0.83      0.86      0.84      3737
   macro avg       0.20   

Let's explore the top (un)likely transitions. Can you spot any expected patterns?

In [39]:
top_n_trans = 20


def print_transitions(trans_features):
    for (label_from, label_to), weight in trans_features:
        print("%-14s -> %-14s: %0.5f" % (label_from, label_to, weight))


print("Top likely transitions:")
print_transitions(Counter(crf.transition_features_).most_common(top_n_trans))
print("\nTop unlikely transitions:")
print_transitions(Counter(crf.transition_features_).most_common()[-top_n_trans:])

Top likely transitions:
I-FAC          -> I-FAC         : 6.83986
I-CARDINAL     -> I-CARDINAL    : 6.62248
I-EVENT        -> I-EVENT       : 6.57318
B-PERSON       -> I-PERSON      : 6.32764
B-TIME         -> I-TIME        : 6.06083
I-GPE          -> I-GPE         : 6.05518
I-ORG          -> I-ORG         : 6.02829
B-PERCENT      -> I-PERCENT     : 5.97228
B-CARDINAL     -> I-CARDINAL    : 5.93175
B-LOC          -> I-LOC         : 5.89972
B-EVENT        -> I-EVENT       : 5.83768
I-PERSON       -> I-PERSON      : 5.82728
I-MONEY        -> I-MONEY       : 5.74150
B-QUANTITY     -> I-QUANTITY    : 5.57306
B-MONEY        -> I-MONEY       : 5.56706
I-DATE         -> I-DATE        : 5.51564
B-FAC          -> I-FAC         : 5.50932
B-WORK_OF_ART  -> I-WORK_OF_ART : 5.47687
B-DATE         -> I-DATE        : 5.40929
I-TIME         -> I-TIME        : 5.40436

Top unlikely transitions:
B-GPE          -> I-ORG         : -1.91419
B-NORP         -> B-ORG         : -1.95154
O              -> I-PER

Additionally, let's take a look at the most important features for specific tags.

In [40]:
top_n_feat = 30


def print_state_features(state_features):
    for (attr, label), weight in state_features:
        print("%0.5f %-14s %s" % (weight, label, attr))


print("Top positive:")
print_state_features(Counter(crf.state_features_).most_common(top_n_feat))

print()

print("Top negative:")
print_state_features(Counter(crf.state_features_).most_common()[-top_n_feat:])

Top positive:
5.61942 B-PERSON       -1:token.lower():mr.
4.99355 O              bias
4.96132 B-DATE         token[-3:]:day
4.45379 B-LOC          token.lower():asia
4.37708 B-CARDINAL     token.lower():millions
4.36302 O              BOS
4.22766 B-ORDINAL      token[-2:]:th
4.21113 I-DATE         token[-2:]:0s
4.19905 B-NORP         token.istitle()
4.09768 O              token.lower():president
3.73460 B-NORP         token.lower():shi'ite
3.68081 B-ORG          token.lower():taliban
3.67184 B-GPE          token.lower():ukrainian
3.56562 O              token.lower():minister
3.49082 B-ORG          token.lower():cholera
3.46878 B-LOC          token.lower():siberia
3.43886 B-PERSON       -1:token.lower():minister
3.43404 O              +1:token.lower():pacific
3.42720 B-NORP         token.lower():baluchistan
3.41009 B-CARDINAL     token.lower():dozens
3.39178 O              -1:token.lower():late
3.38578 B-ORG          token.lower():commonwealth
3.34223 I-DATE         -1:token.lower():las

Let's conclude this assignment with an overview of CRF feature importance using the `eli5` library.

In [41]:
import eli5

eli5.show_weights(crf, top=10)

AttributeError: module 'jinja2.ext' has no attribute 'with_'