# Named Entity Recognition (NER)

A brief description of the task.
Here is the list of all of the current configs

## Table of contents 
1. [Models list](#1.-Models-list)

2. [Using the model from Python](#2.-Using-the-model-from-Python)

    2.1. [Using the pre-trained model for prediction](#1.1-Using-the-pre-trained-model-for-prediction)

    2.2. [Train the model on your data](#1.2-Train-the-model-on-your-data) 
3. [Using the model from the command line](#3.-Using-the-model-from-the-command-line)

    3.1. [Using the pre-trained model for prediction](#2.1-Using-the-pre-trained-model-for-prediction)
    
    3.2. [Train the model on your data](#2.2-Train-the-model-on-your-data)


# 1. Models list

| Model    | Dataset | Language |
| :--- |  | ---: |
| `ner_rus_bert_torch <ner/ner_rus_bert_torch.json>`    | Collection3   | Ru |


# 2. Using the model from Python 

First make sure you have the DeepPavlov library installed
\#TODO: link to installation page

In [1]:
!pip install --q deeppavlov

[K     |████████████████████████████████| 880 kB 5.3 MB/s 
[K     |████████████████████████████████| 76 kB 3.1 MB/s 
[K     |████████████████████████████████| 2.9 MB 30.5 MB/s 
[K     |████████████████████████████████| 8.2 MB 35.3 MB/s 
[K     |████████████████████████████████| 40 kB 16 kB/s 
[K     |████████████████████████████████| 57 kB 4.1 MB/s 
[K     |████████████████████████████████| 20.1 MB 1.3 MB/s 
[K     |████████████████████████████████| 53 kB 1.9 MB/s 
[K     |████████████████████████████████| 46 kB 3.1 MB/s 
[K     |████████████████████████████████| 65 kB 3.4 MB/s 
[K     |████████████████████████████████| 10.4 MB 38.0 MB/s 
[K     |████████████████████████████████| 3.8 MB 23.2 MB/s 
[K     |████████████████████████████████| 43 kB 1.6 MB/s 
[K     |████████████████████████████████| 7.3 MB 22.8 MB/s 
[K     |████████████████████████████████| 510 kB 50.8 MB/s 
[K     |████████████████████████████████| 6.7 MB 19.3 MB/s 
[K     |██████████████████████████████


## 1.1 Using the pre-trained model for prediction

Build the model using its *config_file* name (in this case - *ner_ontonotes_bert_torch*). 

What is a config file? # link to the tutorial

You can change the NER model you are using by changing the name of the *config_file*.
The full list of NER models with their config names can be found **here**.

In [None]:
from deeppavlov import configs, build_model

ner_model = build_model(configs.ner.ner_ontonotes_bert_torch, download=True)


### Predict

**input**: list of sequences

**output_format**: [list of tokens, list of their corresponding NER-tags]

In [5]:
ner_model(['Bob Ross lived in Florida', 'Elon Musk founded Tesla'])

[[['Bob', 'Ross', 'lived', 'in', 'Florida'],
  ['Elon', 'Musk', 'founded', 'Tesla']],
 [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE'],
  ['B-PERSON', 'I-PERSON', 'O', 'B-ORG']]]


## 1.2 Train the model on your data


### Provide your data path

To train the model on your data, you need to change the path to the training data in the *config_file*. 

You can do that manually by editing the config json-file.
Alternatively, you can parse the *config_file* and change the path to your data from Python.

In [6]:
from deeppavlov import configs, train_model
from deeppavlov.core.commands.utils import parse_config

model_config = parse_config(configs.ner.ner_ontonotes_bert_torch)

#  dataset that the model was trained on
print(model_config['dataset_reader']['data_path'])

~/.deeppavlov/downloads/ontonotes/


You can provide a *data_path* to your dataset...

In [7]:
model_config["dataset_reader"]["data_path"] = "/content/faq.csv"
# model_config["dataset_reader"]["data_url"] = None

...or give a link to your data in the *data_url* parameter.

In [8]:
model_config["dataset_reader"]["data_path"] = ''
# model_config["dataset_reader"]["data_url"] = "http://files.deeppavlov.ai/faq/school/faq_school_en.csv"


### Training dataset format

To train the neural network, you need to have a dataset in the following format:

```
EU B-ORG
rejects O
the O
call O
of O
Germany B-LOC
to O
boycott O
lamb O
from O
Great B-LOC
Britain I-LOC
. O

China B-LOC
says O
time O
right O
for O
Taiwan B-LOC
talks O
. O
```


The source text is **tokenized** and **tagged**. For each token, there is a tag with BIO markup. Tags are separated from tokens with **whitespaces**. Sentences are separated with **empty lines**.

??? - Dataset is a text file or a set of text files. The dataset must be split into three parts: train, test, and validation.


### Train the model

In [None]:
ner_model = train_model(model_config)

Use your model for prediction

In [None]:
ner_model(['Bob Ross lived in Florida', 'Elon Musk founded Tesla'])

[[['Bob', 'Ross', 'lived', 'in', 'Florida'],
  ['Elon', 'Musk', 'founded', 'Tesla']],
 [['B-PERSON', 'I-PERSON', 'O', 'O', 'B-GPE'],
  ['B-PERSON', 'I-PERSON', 'O', 'B-ORG']]]


# 3. Using the model from the command line

Before using the model make sure that all required packages are installed using the command:

In [None]:
!python -m deeppavlov install ner_ontonotes_bert_torch


## 2.1 Use the pre-trained model for prediction

In [None]:
!python deeppavlov interact ner_ontonotes_bert_torch [-d]


## 2.2 Train the model on your data

In [None]:
!python -m deeppavlov train ner_ontonotes_bert_torch