# Named Entity Recognition using Flair on CONLL-2003
## Experiment description
This notebook contains a ML fabric flow for Named Entity Recognition
using the [flair NLP package](https://github.com/flairNLP/flair/)

Note: This example implements all the required data objects. For a clean notebook which uses the objects already implemented in the `ner_sample` Python package [click here](flair_ner_clean.ipynb).

##### Jupyter helpers:

In [None]:
%reload_ext autoreload
%autoreload 2

Define imports

In [None]:
from pathlib import Path
from typing import List, Tuple
import copy

import flair
import requests
import torch
from flair.data import Corpus, Sentence
from flair.datasets import CONLL_03
from flair.embeddings import TokenEmbeddings, WordEmbeddings, StackedEmbeddings, PooledFlairEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer
from seqeval.metrics import f1_score, accuracy_score
from tqdm.notebook import tqdm

from ner_sample import ExperimentRunner
from ner_sample.data import DataLoader
from ner_sample.evaluation import Evaluator, EvaluationMetrics
from ner_sample.experimentation import MlflowExperimentation
from ner_sample.models import BaseModel


### Dataset loading

First thing we do is to implement a `DataLoader`. A `DataLoader` defines the logic for obtaining a dataset. It could either fetch a dataset from a local folder, or from a remote location like the web, S3, Blob storage or similar. 

To implement a `DataLoader`, there are two main functions to be created:
- `download_dataset`: A function for downloading the dataset into the local machine (should be implmented in a way that only downloads once and then checks if the dataset already exists locally)
- `get_dataest`: A function for getting a dataset for modeling, into the experiment code itself.

Note:
- Each dataset should have a name and a version, which will be used to know exactly what data was used for this experiment. This would provide us with the possibility of reproducing the experiment.
- The dataset obtained should already be ready for modeling. Any train/test split should be done prior to the dataset loading. We don't want random folds created every time here because then our experiments would not be comparable (each one would use a different subset)

> In this case, the dataset, *CONLL03*, will be downloaded from Github. 
> It has three folds: *eng.train* (train), *eng.testa* (dev) and *eng.testb* (test). We will be using the `flair` package to represent the dataset as a flair `Corpus`.

> Additional parameters (**downsample** in this case, which samples the dataset) can be passed to the as kwargs to the super init method, and they will be automatically logged.

In [None]:
class ConllDataLoader(DataLoader):
    def __init__(
        self,
        dataset_name="conll_03",
        dataset_version="1",
        local_data_path="../data/processed/",
        dataset_path="https://raw.githubusercontent.com/glample/tagger/master/dataset/",
        downsample=0.05
    ):
        """
        Data Loader for the CONLL 03 dataset.
        download_dataset downloads the three datasets (train, testa and testb) from Github
        get_dataset returns a flair Corpus object holding the three datasets.
        """
        self.folds = ("eng.train", "eng.testa", "eng.testb")
        self.local_data_path = local_data_path
        self.dataset_path = dataset_path
        self.downsample = downsample
        super().__init__(dataset_name=dataset_name, 
                         dataset_version=dataset_version,
                         downsample=downsample)

    def download_dataset(self) -> None:
        if self.dataset_name == "conll_03" and self.dataset_version == "1":

            for fold in self.folds:
                local_path = Path(self.local_data_path, self.dataset_name).resolve()
                fold_path = self.dataset_path + fold
                if not local_path.exists():
                    local_path.mkdir(parents=True)

                dataset_file = Path(local_path, fold)
                if dataset_file.exists():
                    print("Dataset already exists, skipping download")
                    return

                response = requests.get(fold_path)
                dataset_raw = response.text
                with open(dataset_file, "w") as f:
                    f.write(dataset_raw)
                print(f"Finished writing fold {fold} to {self.local_data_path}")

            print(
                f"Finished downloading dataset {self.dataset_name} version {self.dataset_version}"
            )

        else:
            raise ValueError("Selected dataset was not found")

    def get_dataset(self) -> Tuple:
        try:
            corpus = CONLL_03(base_path=self.local_data_path, in_memory=True)
            corpus = corpus.downsample(self.downsample)  # Just for example purposes

            train = corpus  # includes train and dev

            test = corpus.test

            # Copy labels to a new tag (Flair overrides the ner tag during prediction)
            for sentence in test:
                for token in sentence.tokens:
                    token.annotation_layers["gold_ner"] = copy.deepcopy(
                        token.annotation_layers["ner"]
                    )
                    token.annotation_layers["ner"][0].value = "O"

            return train, test

        except FileNotFoundError:
            print(
                f"Dataset {self.dataset_name} with version {self.dataset_version} not found in data/raw"
            )


### Load data
Once we have implemented our `DataLoader`, we can just instantiate it and call `download_dataset()` and then 'get_dataset()'. This way we ensure that our notebook can be run anywhere.

In [None]:
data_loader = ConllDataLoader(dataset_name = "conll_03", dataset_version="1")
data_loader.download_dataset()
train_corpus, test = data_loader.get_dataset() #train_corpus is a flair Corpus containing train and dev
train = train_corpus.train
dev = train_corpus.dev

print(f"Train set type: {type(train)}")
print(f"Test set type: {type(test)}")

In [None]:
len(test)

In [None]:
print(f"First sample in train sample:\n {train.dataset[0]}")

### Experimentation

The next phase is the experiment logger definition. The default one uses MLflow, but the API is generic and can be extended to any experiment logging mechanism. 
The experimentation class is in charge of collecting all the parameters and metrics the experiments emit along the way (from the dataset name and version, through model hyperparams and up to the final metric values).

To use the default one, just call `MlflowExperimentation()`

> Note: If you plan to use Mlflow hosted in Databricks, follow these steps:
1. Pass `tracking_uri='databricks'` to the `MlflowExperimentation` object
2. See [this doc on how to create a personal access token](https://docs.databricks.com/dev-tools/api/latest/authentication.html#token-management) 
3. See [this doc on setting up databricks-cli](https://docs.microsoft.com/en-us/azure/databricks/dev-tools/cli/)
4. [Create new experiment on Mlflow](https://docs.microsoft.com/en-us/azure/databricks/applications/mlflow/) (if needed)


In [None]:
experimentation = MlflowExperimentation()

### Modeling

The next step is writing our actual model, with optional preprocessing and postprocessing.

The class to implement is `BaseModel` which exposes the sklearn-style `fit` and `predict` functions that needs to be implemented.

Note:
- The model class needs to define which parameters should be logged, by adding keys and values to the self.hyper_params dict, or by passing the variables to the super constructor.
- The base class contains fields for DataProcessors: preprocessor and postprocessor. Use these if you want the preprocessing or postprocessing to occur during the model call (which makes it easier to operationalize the model on a new environment, without having to provide all the preprocessinr and postprocessing scripts.
- It is also possible to pass the `Experimentation` object, if it is required during training (for example, while storing values for each epoch during model training)

> This example uses the `flair` framework to create a NER model. All the model hyper parameters are added as class variables and sent to the parent class constructor for logging. Some hyperparameters are collected from the actual pytorch model (in `get_hyper_params`), to have better coverage of hyperparameters for each run.

In [None]:
class FlairNERModel(BaseModel):
    def __init__(
        self,
        corpus: Corpus,
        hidden_size: int = 256,
        pooling: str = "min",
        word_embeddings: str = "glove",
        train_with_dev: bool = True,
        max_epochs: int = 10,
    ):
        """
        NER detector using the Flair NLP package.
        Source: https://github.com/flairNLP/flair/blob/master/resources/docs/EXPERIMENTS.md
        All class inputs (except for the corpus) are model hyper parameters.
        They are then directed to the base class and get logged into the experiment logger
        """
        self.tag_type = "ner"
        self.tag_dictionary = None
        self.tagger = None
        self.embeddings = None

        self.hidden_size = hidden_size
        self.pooling = pooling
        self.word_embeddings = word_embeddings
        self.train_with_dev = train_with_dev
        self.max_epochs = max_epochs

        self.set_tagger_definition(corpus)

        hyper_params = self.get_hyper_params(
            hidden_size=hidden_size,
            pooling=pooling,
            word_embeddings=word_embeddings,
            train_with_dev=train_with_dev,
            max_epochs=max_epochs,
        )

        super().__init__(**hyper_params)

    def fit(self, X, y=None) -> None:
        # initialize trainer
        trainer: ModelTrainer = ModelTrainer(self.tagger, X)

        trainer.train(
            "models/taggers/flair-ner",
            train_with_dev=self.train_with_dev,
            max_epochs=self.max_epochs,
        )

    def predict(self, X):
        tagged_sentences = []
        for sentence in tqdm(X):
            self.tagger.predict(sentence)
            tagged_sentences.append(sentence)
        print(f"Tagged {len(tagged_sentences)} sentences")
        return tagged_sentences

    def get_hyper_params(self, **hyper_params):
        basic_params = {
            param_name: param_value
            for (param_name, param_value) in self.tagger.__dict__.items()
            if type(param_value) in (bool, float, int, str)
        }
        hyper_params.update(basic_params)
        return hyper_params

    def set_embeddings_definition(self):
        """
        Sets the embedding layers used by this tagger
        """
        # initialize embeddings
        embedding_types: List[TokenEmbeddings] = [
            # Word embeddings (default = GloVe)
            WordEmbeddings(self.word_embeddings),
            # contextual string embeddings, forward
            PooledFlairEmbeddings("news-forward", pooling=self.pooling),
            # contextual string embeddings, backward
            PooledFlairEmbeddings("news-backward", pooling=self.pooling),
        ]
        self.embeddings: StackedEmbeddings = StackedEmbeddings(
            embeddings=embedding_types
        )

    def set_tagger_definition(self, corpus: Corpus):
        """
        Returns the definition of the Flair SequenceTagger (the full model)
        :param corpus: Used only for setting the tag_dictionary
        """

        if not self.embeddings:
            self.set_embeddings_definition()
        self.tag_dictionary = corpus.make_tag_dictionary(tag_type=self.tag_type)

        tagger: SequenceTagger = SequenceTagger(
            hidden_size=self.hidden_size,
            embeddings=self.embeddings,
            tag_dictionary=self.tag_dictionary,
            tag_type=self.tag_type,
            use_crf=False,
        )
        self.tagger = tagger


### Model training

The model we just created can be called and fitted. Alternatively, we can postpone the fit to the last part, which performs a full experiment cycle.

> In this example, we skip training as it takes a lot of time. Instead, we load a pre-trained model directly from `flair`.

In [None]:
model = FlairNERModel(corpus=train_corpus)

TRAIN=False

if TRAIN:
    model.fit(corpus=train_corpus)
else:
    # Simulate training has finished by downloading a pretrained model
    model.tagger = SequenceTagger.load('ner')

#### Prediction

Once we have a fitted model, we can run the experiment to validate its performance and log results. Before that, let's verify that we get something meaningful when calling the model:

In [None]:
example_sentence = Sentence("In Penny Lane, there is a barber showing photographs")

model.predict([example_sentence])
for token in example_sentence.tokens:
    print(f" {token.text} | {token.get_tag('ner')}")

Looks good!

## Model evaluation
In this phase, we define how the model should be evaluated. There are two main building blocks:
- `Evaluator`: Which holds the logic for how evaluation takes place. The function to implement is `evaluate`.
- `EvaluationMetrics`: Which holds the actual values of metrics. The function to implement is `get_metrics`.

> In this example we implmement `NEREvaluator` and `NEREvaluationMetrics` with our specific logic and metrics. We use the `semeval` package to calculate **f1** and **accuracy** metrics for the NER task.

In [None]:
class NEREvaluationMetrics(EvaluationMetrics):
    """
    This class holds the metrics calculated during the experiment run
    """

    def __init__(self, f1, accuracy):
        self.f1 = f1
        self.accuracy = accuracy
        super().__init__()

    def __repr__(self):
        return f"F1 score: {self.f1}, Accuracy score: {self.accuracy}"

    def get_metrics(self):
        """
        Return a dict with f1 and accuracy values
        """
        return { "f1": self.f1, "accuracy":self.accuracy }


class NEREvaluator(Evaluator):
    """
    This class holds the logic for evaluating a prediction outcome
    y_test in our case is None
    """

    def evaluate(self, y_test, predictions) -> NEREvaluationMetrics:

        golds = []
        predicted = []
        for sentence in predictions:
            gold_tags = [token.get_tag("gold_ner").value for token in sentence.tokens]
            golds.append(gold_tags)
            predicted_tags = [token.get_tag("ner").value for token in sentence.tokens]
            predicted.append(predicted_tags)

        f1 = f1_score(golds, predicted)
        accuracy = accuracy_score(golds, predicted)
        return NEREvaluationMetrics(f1=f1, accuracy=accuracy)


evaluator = NEREvaluator()

### Running an experiment

To run the full experiment, we leverage the `ExpreimentRunner` class. This class is in charge of evaluating the model on a test dataset, calculating metrics, collecting all params and metrics and logging them to the experiment logger. It's like an experiment orchastrator. 
In additional to all the collected params and metrics, one could add additional params to the call to ExperimentRunner and these will too be collected. 

> In many cases the `ExperimentRunner` class could be used it without any modification, but if modifications are needed, just make sure that you implement the various functions (, and also verify that the different params and metrics are logged correctly (in the `__init__`)

Instantiate the `ExperimentRunner` object, while passing all the previous building blocks.

Finally, we call `experiment_runner.evaluate()` to perform prediction on the supplied test set, calculate metrics and store everything in the experiment logger.

In [None]:
experiment_runner = ExperimentRunner(
    model=model,
    X_train=train,
    X_test=test,
    data_loader=data_loader,
    log_experiment=True,
    experiment_logger=experimentation,
    evaluator=evaluator,
    experiment_name="Experiment",
    data_scientist="Omri"
)

results = experiment_runner.evaluate()
print(results)


### Summary

This example flow demonstrates how to use the different building blocks in this framework.

**Possible next steps:**
1. Implement the different modules in the Python package, and use them in other notebooks / scripts / modules
2. Run `mlflow ui` from this notebook's path and observe the different parameters and metrics stored
3. Create a notebook template for your experiment, which can be used to generate new notebooks containing the needed objects (lodaing data, experimentation, evaluation, run experiment)

To start the Mlflow UI, run `!mlflow ui` and open http://localhost:5000/#/ in your browser

In [None]:
!mlflow ui

Open http://localhost:5000/#/