# Libraries For NER Task

This notebook explores the Hugging Face Transformers and Flair libraries for the **named entity recognition (NER) task** using the [annotated NER corpus](https://www.kaggle.com/datasets/abhinavwalia95/entity-annotated-corpus) dataset.

## Setup

First, let's prepare the data by transforming it into a list of sentences, which we'll feed as input to the models. These steps were previously explained and performed in last week's exploratory data analysis (EDA) notebook.

In [1]:
import pandas as pd

# Load the dataset
ner_dataset = pd.read_csv('/Users/julia/Datasets/ner_corpus/ner_dataset.csv', 
    encoding='latin1', on_bad_lines='warn', low_memory=False)

# Train-test split
train = ner_dataset[:838862]
test = ner_dataset[838862:]

# Define a function to perform data processing steps
def process_data(df):

    df_prepared = df.fillna(method='ffill', axis=0)
    df_prepared['Sentence #'] = df_prepared['Sentence #'].str.replace('Sentence: ', '')
    df_prepared['Sentence #'] = df_prepared['Sentence #'].astype(int)

    sentences = df_prepared.groupby('Sentence #')['Word'].apply(lambda x: ' '.join(x))
    sentences = sentences.str.replace(' .', '.', regex=False)
    sentence_list = sentences.tolist()

    return sentence_list

# Process data for the training set
train_list = process_data(train)

`train_list` is a list of sentences. Let's print the first five:

In [2]:
train_list[0:5]

['Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country.',
 'Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as " Bush Number One Terrorist " and " Stop the Bombings. "',
 'They marched from the Houses of Parliament to a rally in Hyde Park.',
 'Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000.',
 "The protest comes on the eve of the annual conference of Britain 's ruling Labor Party in the southern English seaside resort of Brighton."]

## Hugging Face & Flair

In this section, we'll compare the **usage and performance of Hugging Face compared to Flair**, to help us select a library to use for establishing a model baseline. Note that at this point, we are not training or fine-tuning any models yet — we're simply running models in inference mode.

Here are details about each:
- **Hugging Face** — We'll use the Hugging Face Transformers library's pipelines ([documentation](https://huggingface.co/docs/transformers/main_classes/pipelines)) — a simple API that will allow us to quickly run a model. Specifically, we're going to use `TokenClassificationPipeline` ([documentation](https://huggingface.co/docs/transformers/v4.24.0/en/main_classes/pipelines#transformers.TokenClassificationPipeline)), which performs NER. This pipeline accepts [Hugging Face models that were fine-tuned on a token classification task](https://huggingface.co/models?pipeline_tag=token-classification), and we'll use their default NER model [dbmdz/bert-large-cased-finetuned-conll03-english](https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english). This model was presumably trained on a NER dataset called CoNLL-2003 ([Hugging Face dataset card](https://huggingface.co/datasets/conll2003) | [Papers With Code](https://paperswithcode.com/dataset/conll-2003)), which has entities for locations, miscellaneous, organizations, and people.
- **Flair** — Flair ([GitHub](https://github.com/flairNLP/flair)) is a state-of-the-art NLP library built on top of PyTorch. Their [flair/ner-english-large](https://huggingface.co/flair/ner-english-large) model scored 94.36% on CoNLL-03 and their [flair/ner-english-fast](https://huggingface.co/flair/ner-english-fast) model scored 92.92% on CoNLL-03. Both models predict 4 entity tags: PER, LOC, ORG, and MISC. Another English model called [flair/ner-english-ontonotes-large](https://huggingface.co/flair/ner-english-ontonotes-large) can predict 18 entity tags. In this section, we'll use flair/ner-english-fast.

Set up **Hugging Face Transformers**:

In [3]:
from transformers import AutoTokenizer, AutoModelForTokenClassification, TokenClassificationPipeline

tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")

pipe = TokenClassificationPipeline(model=model, tokenizer=tokenizer, ignore_labels=["O"], aggregation_strategy='simple')
output = pipe(train_list[0:5])

Set up **Flair**:

In [4]:
from flair.data import Sentence
from flair.models import SequenceTagger

tagger = SequenceTagger.load('flair/ner-english-fast')



2022-12-02 12:02:11,154 loading file /Users/julia/.flair/models/ner-english-fast/4c58e7191ff952c030b82db25b3694b58800b0e722ff15427f527e1631ed6142.e13c7c4664ffe2bbfa8f1f5375bd0dced866b8c1dd7ff89a6d705518abf0a611
2022-12-02 12:02:12,972 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>


Both models agree on **sentence #0**:

In [5]:
sentence0 = Sentence(train_list[0])
tagger.predict(sentence0)
print(sentence0, 'from Flair')
print([(o['entity_group'], o['word']) for o in output[0]], 'from Hugging Face')

Sentence: "Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country ." → ["London"/LOC, "Iraq"/LOC, "British"/MISC] from Flair
[('LOC', 'London'), ('LOC', 'Iraq'), ('MISC', 'British')] from Hugging Face


Flair did a better job on **sentence #1** by recognizing the whole slogans as entities:

In [6]:
sentence1 = Sentence(train_list[1])
tagger.predict(sentence1)
print(sentence1, 'from Flair')
print([(o['entity_group'], o['word']) for o in output[1]], 'from Hugging Face')

Sentence: "Families of soldiers killed in the conflict joined the protesters who carried banners with such slogans as " Bush Number One Terrorist " and " Stop the Bombings . "" → ["Bush Number One Terrorist"/MISC, "Stop the Bombings"/MISC] from Flair
[('MISC', 'Bush')] from Hugging Face


Flair did a better job on **sentence #2** by recognizing that "Houses of Parliament" is an organization and not a location:

In [7]:
sentence2 = Sentence(train_list[2])
tagger.predict(sentence2)
print(sentence2, 'from Flair')
print([(o['entity_group'], o['word']) for o in output[2]], 'from Hugging Face')

Sentence: "They marched from the Houses of Parliament to a rally in Hyde Park ." → ["Houses of Parliament"/ORG, "Hyde Park"/LOC] from Flair
[('LOC', 'Houses of'), ('ORG', 'Parliament'), ('LOC', 'Hyde Park')] from Hugging Face


Both models agree on **sentence #3** that there are no entities:

In [8]:
sentence3 = Sentence(train_list[3])
tagger.predict(sentence3)
print(sentence3, 'from Flair')
print([(o['entity_group'], o['word']) for o in output[3]], 'from Hugging Face')

Sentence: "Police put the number of marchers at 10,000 while organizers claimed it was 1,00,000 ." from Flair
[] from Hugging Face


Both models agree on **sentence #4**:

In [9]:
sentence4 = Sentence(train_list[4])
tagger.predict(sentence4)
print(sentence4, 'from Flair')
print([(o['entity_group'], o['word']) for o in output[4]], 'from Hugging Face')

Sentence: "The protest comes on the eve of the annual conference of Britain ' s ruling Labor Party in the southern English seaside resort of Brighton ." → ["Britain"/LOC, "Labor Party"/ORG, "English"/MISC, "Brighton"/LOC] from Flair
[('LOC', 'Britain'), ('ORG', 'Labor Party'), ('MISC', 'English'), ('LOC', 'Brighton')] from Hugging Face


From these spot checks, it appears that Flair's `ner-english-fast` model does a better job at NER than `bert-large-cased-finetuned-conll03-english` hosted on Hugging Face. In addition, we find that the documentation and usability of Flair is preferred over Hugging Face's Transformers library. For these reasons, **we'll proceed with using Flair to establish a model baseline** in a future notebook.

## Evaluation Metrics

Looking ahead, we may want to establish metrics for evaluating a model's performance. For NER tasks, it is common to report the **precision, recall, and F1 score for each entity**. To calculate this, we're going to use `seqeval` ([GitHub](https://github.com/chakki-works/seqeval)).

Here's an example of how it works. Note how the metrics are calculated at the **entity-level** (MISC, PER) as well as the **model level**.

In [10]:
from seqeval.metrics import classification_report

y_true = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]

print(classification_report(y_true, y_pred))

              precision    recall  f1-score   support

        MISC       0.00      0.00      0.00         1
         PER       1.00      1.00      1.00         1

   micro avg       0.50      0.50      0.50         2
   macro avg       0.50      0.50      0.50         2
weighted avg       0.50      0.50      0.50         2



Also see scikit-learn's [precision_recall_fscore_support](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html).