#### Named Entity Recognition (NER)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nastyachizhikova/doc_test/blob/main/source/notebooks/NER.ipynb)

# Table of contents 

1. [Introduction to the task](#1.-Introduction-to-the-task)

2. [Get started with the model](#2.-Get-started-with-the-model)

3. [Use the model for prediction](#3.-Use-the-model-for-prediction)

    3.1. [Predict using Python](#3.1-Predict-using-Python)
    
    3.2. [Predict using Python pipeline](#3.2-Predict-using-Python-pipeline)
    
    3.3. [Predict using CLI](#3.3-Predict-using-CLI)
     
4. [Train the model on your data](#4.-Train-the-model-on-your-data)
    
    4.1. [Train your model from Python](#4.1-Train-your-model-from-Python)
    
    4.2. [Train your model from CLI](#4.2-Train-your-model-from-CLI)
    
5. [Models list](#5.-Models-list)

6. [NER-tags list](#6.-NER-tags-list)


# 1. Introduction to the task

**Named Entity Recognition (NER)** is a task of assigning a tag (from a predefined set of tags) to each token in a given sequence. In other words, NER-task consists of identifying named entities in the text and classifying them into types (e.g. person name, organization, location etc). 

**BIO encoding schema** is usually used in NER task. It uses 3 tags: B for the beginning of the entity, I for the inside of the entity, and O for non-entity tokens. The second part of the tag stands for the entity type.

Here is an example of a tagged sequence:

| Elon | Musk | founded | Tesla| in | 2003 | . |
| --- | --- | --- | --- | --- | --- | --- |
| B-PER | I-PER | O | B-ORG | O | B-DATE | O |

Here we can see three extracted named entities: *Elon Musk* (which is a person's name), *Tesla* (which is a name of an organization) and *2003* (which is a date). To see more examples try out our [Demo](https://demo.deeppavlov.ai/#/en/ner).

The list of possible types of NER entities may vary depending on your dataset domain. The list of tags used in DeepPavlov's models can be found in the [table](#5.-NER-tags-list).

# 2. Get started with the model

First make sure you have the DeepPavlov Library installed.
[More info about the first installation](https://deeppavlov-test.readthedocs.io/en/latest/notebooks/Get%20Started%20with%20DeepPavlov.html)

In [1]:
!pip install --q deeppavlov

Then make sure that all the required packages for the model are installed.

In [None]:
!python -m deeppavlov install ner_ontonotes_bert_torch

`ner_ontonotes_bert_torch` here is the name of the model's *config_file*. [What is a Config File?](https://deeppavlov-test.readthedocs.io/en/latest/notebooks/Config%20File.html) 

Configuration file defines the model and describes its hyperparameters. To use another model, change the name of the *config_file* here and further.
The full list of NER models with their config names can be found in the [table](#4.-Models-list).


# 3. Use the model for prediction

## 3.1 Predict using Python

After [installing](#2.-Get-started-with-the-model) the model, build it from the config and predict.

In [None]:
from deeppavlov import configs, build_model

ner_model = build_model(configs.ner.ner_ontonotes_bert_torch, download=True)

In [None]:
ner_model(['Bob Ross lived in Florida', 'Elon Musk founded Tesla'])

[[['Bob', 'Ross', 'lived', 'in', 'Florida'],
  ['Elon', 'Musk', 'founded', 'Tesla']],
 [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE'],
  ['B-PERSON', 'I-PERSON', 'O', 'B-ORG']]]

## 3.2 Predict using Python pipeline

Alternatively, you can use a Python way to describe and build your model for prediction, without using the config file.

In [1]:
from pathlib import Path

from deeppavlov import Element, Model
from deeppavlov.core.data.simple_vocab import SimpleVocabulary
from deeppavlov.download import download_resource
from deeppavlov.models.classifiers.proba2labels import Proba2Labels
from deeppavlov.models.preprocessors.torch_transformers_preprocessor import TorchTransformersNerPreprocessor
from deeppavlov.models.torch_bert.torch_transformers_sequence_tagger import TorchTransformersSequenceTagger


transformer = "bert-base-cased"
model_path = Path('./ner_ontonotes_bert_torch/' + transformer)

download_resource(
    'http://files.deeppavlov.ai/v1/ner/ner_ontonotes_bert_torch.tar.gz',
    {'./ner_ontonotes_bert_torch'}
)

preprocessor = TorchTransformersNerPreprocessor(
    vocab_file=transformer,
    do_lower_case=False,
    max_seq_length=512,
    max_subword_length=15,
    token_masking_prob=0.0,
)
 
classes_vocab = SimpleVocabulary(
    save_path=model_path/'tag.dict',
    load_path=model_path/'tag.dict',
    pad_with_zeros=True,
    unk_token=["O"]
)

tagger = TorchTransformersSequenceTagger(
    n_tags=classes_vocab.len,
    return_probas=False,
    use_crf=True,
    attention_probs_keep_prob=0.5,
    encoder_layer_ids=[-1],
    pretrained_bert='bert-base-cased',
    save_path=model_path/'model',
    load_path=model_path/'model',
    optimizer='AdamW',
    optimizer_parameters={'lr': 2e-05, 
                          "weight_decay": 1e-06, 
                          "betas": [0.9, 0.999],
                          "eps": 1e-06},
    clip_norm=1.0,
    min_learning_rate=1e-07,
    learning_rate_drop_patience=30,
    learning_rate_drop_div=1.5,
    load_before_drop=True,
)

ner_model = Model(
    x=['x'],
    out=["x_tokens", "y_pred"],
    pipe=[
        Element(component=preprocessor, x=['x'], out=["x_tokens", "x_subword_tokens", "x_subword_tok_ids", "startofword_markers", "attention_mask"]),
        Element(component=classes_vocab, x=["y"], out=["y_ind"]),
        Element(component=tagger, x=["x_subword_tok_ids", "attention_mask", "startofword_markers"], out=["y_pred_ind"]),
        Element(component=classes_vocab, x=["y_pred_ind"], out=["y_pred"])
    ]
)

In [None]:
ner_model(['Bob Ross lived in Florida', 'Elon Musk founded Tesla'])

[[['Bob', 'Ross', 'lived', 'in', 'Florida'],
  ['Elon', 'Musk', 'founded', 'Tesla']],
 [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE'],
  ['B-PERSON', 'I-PERSON', 'O', 'B-ORG']]]

## 3.3 Predict using CLI

You can also get predictions in an interactive mode through CLI.

In [None]:
! python deeppavlov interact ner_ontonotes_bert_torch [-d]

`-d` is an optional download key (alternative to `download=True` in Python code). The key `-d` is used to download the pre-trained model along with embeddings and all other files needed to run the model. 

Or make predictions for samples from *stdin*.

In [None]:
! python deeppavlov predict ner_ontonotes_bert_torch -f <file-name>

# 4. Train the model on your data


## 4.1 Train your model from Python

### Provide your data path

To train the model on your data, you need to change the path to the training data in the *config_file*. 

Parse the *config_file* and change the path to your data from Python.

In [None]:
from deeppavlov import configs, train_model
from deeppavlov.core.commands.utils import parse_config

model_config = parse_config(configs.ner.ner_ontonotes_bert_torch)

#  dataset that the model was trained on
print(model_config['dataset_reader']['data_path'])

~/.deeppavlov/downloads/ontonotes/


Provide a *data_path* to your own dataset. 

In [7]:
# download and unzip a new example dataset
!wget http://files.deeppavlov.ai/deeppavlov_data/conll2003_v2.tar.gz
!tar -xzvf "conll2003_v2.tar.gz"

In [6]:
# provide a path to the train file
model_config["dataset_reader"]["data_path"] = "contents/train.txt"


### Train dataset format

To train the model, you need to have a txt-file with a dataset in the following format:

```
EU B-ORG
rejects O
the O
call O
of O
Germany B-LOC
to O
boycott O
lamb O
from O
Great B-LOC
Britain I-LOC
. O

China B-LOC
says O
time O
right O
for O
Taiwan B-LOC
talks O
. O
```


The source text is **tokenized** and **tagged**. For each token, there is a tag with BIO markup. Tags are separated from tokens with **whitespaces**. Sentences are separated with **empty lines**.


### Train the model using new config

In [None]:
ner_model = train_model(model_config)

Use your model for prediction.

In [None]:
ner_model(['Bob Ross lived in Florida', 'Elon Musk founded Tesla'])

[[['Bob', 'Ross', 'lived', 'in', 'Florida'],
  ['Elon', 'Musk', 'founded', 'Tesla']],
 [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE'],
  ['B-PERSON', 'I-PERSON', 'O', 'B-ORG']]]

## 4.2 Train your model from CLI

In [None]:
! python -m deeppavlov train ner_ontonotes_bert_torch

# 5. Models list

The table presents a list of all of the NER-models available in DeepPavlov Library.

| Config name  | Dataset | Language | Model Size | F1 score |
| :--- | --- | --- | --- | ---: |
| [ner_ontonotes_bert](https://github.com/deepmipt/DeepPavlov/blob/dev/deeppavlov/configs/ner/ner_ontonotes_bert.json)
[ner_ontonotes_bert_probas](https://github.com/deepmipt/DeepPavlov/blob/dev/deeppavlov/configs/ner/ner_ontonotes_bert_probas.json)| Ontonotes   | En | 1.3 GB | 87.9 |
| [ner_ontonotes_bert_probas](https://github.com/deepmipt/DeepPavlov/blob/dev/deeppavlov/configs/ner/ner_ontonotes_bert_probas.json)|| En | ? | ? |
| [ner_ontonotes_bert_mult](https://github.com/deepmipt/DeepPavlov/blob/dev/deeppavlov/configs/ner/ner_ontonotes_bert_mult.json)|| Multi | 2.0 GB | 87.2 |
| [ner_rus_bert](https://github.com/deepmipt/DeepPavlov/blob/dev/deeppavlov/configs/ner/ner_rus_bert.json)| Collection3   | Ru | 2.0 GB | 97.7 |
| [ner_rus_bert_probas](https://github.com/deepmipt/DeepPavlov/blob/dev/deeppavlov/configs/ner/ner_rus_bert_probas.json)| Collection3   | Ru | ? | ? |
| [ner_rus_convers_distilrubert_2L](https://github.com/deepmipt/DeepPavlov/blob/dev/deeppavlov/configs/ner/ner_rus_convers_distilrubert_2L.json)| Collection3   | Ru | ? | ? |
| [ner_rus_convers_distilrubert_6L](https://github.com/deepmipt/DeepPavlov/blob/dev/deeppavlov/configs/ner/ner_rus_convers_distilrubert_6L.json)| Collection3   | Ru |? | ? |


# 6. NER-tags list

The table presents a list of all of the NER entity tags used in DeepPavlov's NER-models.

|              |                                                        |
| ------------ | ------------------------------------------------------ |
| **PERSON**       | People including fictional                             |
| **NORP**         | Nationalities or religious or political groups         |
| **FACILITY**     | Buildings, airports, highways, bridges, etc.           |
| **ORGANIZATION** | Companies, agencies, institutions, etc.                |
| **GPE**          | Countries, cities, states                              |
| **LOCATION**     | Non-GPE locations, mountain ranges, bodies of water    |
| **PRODUCT**      | Vehicles, weapons, foods, etc. (Not services)          |
| **EVENT**        | Named hurricanes, battles, wars, sports events, etc.   |
| **WORK OF ART**  | Titles of books, songs, etc.                           |
| **LAW**          | Named documents made into laws                         |
| **LANGUAGE**     | Any named language                                     |
| **DATE**         | Absolute or relative dates or periods                  |
| **TIME**         | Times smaller than a day                               |
| **PERCENT**      | Percentage (including “%”)                             |
| **MONEY**        | Monetary values, including unit                        |
| **QUANTITY**     | Measurements, as of weight or distance                 |
| **ORDINAL**      | “first”, “second”                                      |
| **CARDINAL**     | Numerals that do not fall under another type           |