#### Named Entity Recognition (NER)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nastyachizhikova/doc_test/blob/main/source/notebooks/NER.ipynb)

Named Entity Recognition (NER) task can be formulated as:

*Given a sequence of tokens (words, and maybe punctuation symbols) provide a tag from a predefined set of tags for each token in the sequence.*

Here is an example of a tagged sequence:

![NER-tagged sentences](https://github.com/nastyachizhikova/doc_test/blob/main/source/notebooks/images/NER_demo.jpg?raw=true)

The list of possible types of NER entities may vary depending on your dataset domain. The list of tags used in DeepPavlov's models can be found in the [table](#5.-NER-tags-list).

# Table of contents 

1. [Get Started](#1.-Get-Started)

2. [Use the model for prediction](#2.-Using-the-model-from-Python)

    2.1. [Predict using Python](#2.1-Predict-using-Python)
    
    2.2. [Predict using CLI](#2.1-Predict-using-CLI) 


3. [Train the model on your data](#3.-Using-the-model-from-the-command-line)
    
    3.1. [Train your model from Python](#3.1-Train-your-model-from-Python)
    
    3.2. [Train your model from CLI](#3.2-Train-your-model-from-CLI)
    
4. [Models list](#3.-Models-list)

5. [NER-tags list](#4.-NER-tags-list)


# 1. Get Started 

First make sure you have the DeepPavlov library installed.
[More info about the first installation](https://deeppavlov-test.readthedocs.io/en/latest/notebooks/Get%20Started%20with%20DeepPavlov.html)

In [1]:
!pip install --q deeppavlov

Before using the model make sure that all required packages are installed using the command.

In [None]:
!python -m deeppavlov install ner_rus_bert_torch

`ner_rus_bert_torch` here is the name of the model's *config_file*. [What is a Config File?](https://deeppavlov-test.readthedocs.io/en/latest/notebooks/Config%20File.html) 

You can change the NER model you are using by changing the name of the *config_file*.
The full list of NER models with their config names can be found in the [table](#4.-Models-list).


# 2. Use the model for prediction


## 2.1 Predict using Python

Build the model from the config and predict.

**input**: list of sequences

**output**: [list of tokens, list of their corresponding NER-tags]

In [None]:
from deeppavlov import configs, build_model

ner_model = build_model(configs.ner.ner_rus_bert_torch, download=True)

In [None]:
ner_model(['Bob Ross lived in Florida', 'Elon Musk founded Tesla'])

[[['Bob', 'Ross', 'lived', 'in', 'Florida'],
  ['Elon', 'Musk', 'founded', 'Tesla']],
 [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE'],
  ['B-PERSON', 'I-PERSON', 'O', 'B-ORG']]]


## 1.2 Predict using CLI

In [None]:
$ python deeppavlov interact ner_ontonotes_bert_torch [-d]

`-d` is an optional download key. The key `-d` is used to download the pre-trained model along with embeddings and all other files needed to run the model. 

# 3. Train the model on your data


## 3.1 Train your model from Python

### Provide your data path

To train the model on your data, you need to change the path to the training data in the *config_file*. 

Parse the *config_file* and change the path to your data from Python.

In [None]:
from deeppavlov import configs, train_model
from deeppavlov.core.commands.utils import parse_config

model_config = parse_config(configs.ner.ner_ontonotes_bert_torch)

#  dataset that the model was trained on
print(model_config['dataset_reader']['data_path'])

~/.deeppavlov/downloads/ontonotes/


Provide a *data_path* to your own dataset.

In [5]:
model_config["dataset_reader"]["data_path"] = "/content/faq.csv"


### Training dataset format

To train the neural network, you need to have a dataset in the following format:

```
EU B-ORG
rejects O
the O
call O
of O
Germany B-LOC
to O
boycott O
lamb O
from O
Great B-LOC
Britain I-LOC
. O

China B-LOC
says O
time O
right O
for O
Taiwan B-LOC
talks O
. O
```


The source text is **tokenized** and **tagged**. For each token, there is a tag with BIO markup. Tags are separated from tokens with **whitespaces**. Sentences are separated with **empty lines**.


### Train the model using new config

In [None]:
ner_model = train_model(model_config)

Use your model for prediction.

In [None]:
ner_model(['Bob Ross lived in Florida', 'Elon Musk founded Tesla'])

[[['Bob', 'Ross', 'lived', 'in', 'Florida'],
  ['Elon', 'Musk', 'founded', 'Tesla']],
 [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE'],
  ['B-PERSON', 'I-PERSON', 'O', 'B-ORG']]]


## 3.2 Train your model from CLI

In [None]:
$ python -m deeppavlov train ner_ontonotes_bert_torch

# 4. Models list

| Config name  | Dataset | Language | Model Size | F1 score |
| :--- | --- | --- | --- | ---: |
| [ner_rus_bert_torch](https://github.com/deepmipt/DeepPavlov/blob/0.17.2/deeppavlov/configs/ner/ner_rus_bert_torch.json)| Collection3   | Ru | 2.0 GB | 97.7 |

# 5. NER-tags list

|              |                                                        |
| ------------ | ------------------------------------------------------ |
| **PERSON**       | People including fictional                             |
| **NORP**         | Nationalities or religious or political groups         |
| **FACILITY**     | Buildings, airports, highways, bridges, etc.           |
| **ORGANIZATION** | Companies, agencies, institutions, etc.                |
| **GPE**          | Countries, cities, states                              |
| **LOCATION**     | Non-GPE locations, mountain ranges, bodies of water    |
| **PRODUCT**      | Vehicles, weapons, foods, etc. (Not services)          |
| **EVENT**        | Named hurricanes, battles, wars, sports events, etc.   |
| **WORK OF ART**  | Titles of books, songs, etc.                           |
| **LAW**          | Named documents made into laws                         |
| **LANGUAGE**     | Any named language                                     |
| **DATE**         | Absolute or relative dates or periods                  |
| **TIME**         | Times smaller than a day                               |
| **PERCENT**      | Percentage (including “%”)                             |
| **MONEY**        | Monetary values, including unit                        |
| **QUANTITY**     | Measurements, as of weight or distance                 |
| **ORDINAL**      | “first”, “second”                                      |
| **CARDINAL**     | Numerals that do not fall under another type           |