# DeepPavlov: Transfer Learning with BERT

Today we will cover following tasks:
* classification
* tagging (Named Enitity Recognition)
* question answering (Stanford Question Answering Dataset)

and zero-shot transfer from English to 103 other languages.

## BERT input representation
Text preprocessing for BERT relies on tokenizing text on subtokens (or WordPieces). Then BERT internally represents each subtoken as sum of three vectors:
* subtoken embedding
* segment embedding
* position embedding

<img src="https://github.com/deepmipt/dp_tutorials/blob/master/img/BERT_input.png?raw=1" width="75%" />

## BERT for text classification
When we want to use BERT model for text classification task we can train only one dense layer on top of the output from the last BERT Transformer layer for special `[CLS]` token.

<img src="https://github.com/deepmipt/dp_tutorials/blob/master/img/BERT_classification.png?raw=1" width="75%" />

Install DeepPavlov library:

In [None]:
! pip install deeppavlov

Install requirements for BERT-based classification model trained to detect insults in [Social Commentary](https://www.kaggle.com/c/detecting-insults-in-social-commentary):

In [None]:
! python -m deeppavlov install insults_kaggle_bert

Download and interact with pre-trained model with CLI:


In [None]:
! python -m deeppavlov interact -d insults_kaggle_bert

Interact with text classification model with DeepPavlov Python API:

In [None]:
from deeppavlov import build_model, configs

model = build_model(configs.classifiers.insults_kaggle_bert, download=False) # download=True if model is not downloaded yet

In [None]:
model(['hey, how are you?', 'You are so stupid!'])

### Dataformat for classification

Let's check training data for  insults classification model. We can get data path from model configuration file from section `dataset_reader`.

In [None]:
import json
from pprint import pprint
model_config = json.load(open(configs.classifiers.insults_kaggle_bert))

pprint(model_config['dataset_reader'])
pprint(model_config['metadata']['variables'])

there are three .csv files:

In [None]:
! ls ~/.deeppavlov/downloads/insults_data/

In [None]:
! head ~/.deeppavlov/downloads/insults_data/train.csv

If you want to train model on your data you need to create configuration file and set up `data_path` to folder with train.csv, valid.csv, test.csv and change `MODEL_PATH` where to save trained model. Details in [documentation](http://docs.deeppavlov.ai/en/master/features/models/classifiers.html#how-to-train-on-other-datasets).

Train model with CLI:
```
! python -m deeppavlov train config_name
```
or in Python
```
from deeppavlov import train_model
model = train_model(model_config)
```

## BERT for tagging (Named Entity Recognition)

BERT model can be used for tagging tasks such like Named Entity Recognition and Part of Speech tagging.
We train only one dense layer on top of the output from the last BERT Transformer layer for each token. You can optionally add CRF layer on top the dense layer like in most common architecture BiLSTM + CRF for tagging.

Named Entity Recognition:

For example, we want to extract persons' and organizations' names from the text. Then for the input text:

    Yan Goodfellow works for Google Brain

a NER model needs to provide the following sequence of tags:

    B-PER I-PER    O     O   B-ORG  I-ORG

Where *B-* and *I-* prefixes stand for the beginning and inside of the entity, while *O* stands for out of tag or no tag. Markup with the prefix scheme is called *BIO markup*. This markup is introduced for distinguishing of consequent entities with similar types.

Here is how input is preprocessed for tagging:

<img src="https://github.com/deepmipt/dp_tutorials/blob/master/img/BERT_NER.png?raw=1" width="75%" />

In [None]:
! python -m deeppavlov interact ner_ontonotes_bert -d

Data for Named Enitity Recognition task is usually stored in CoNLL files.
Typical CoNLL file with NER data contains lines with pairs of tokens (word/punctuation symbol) and tags, separated by a whitespace. In many cases additional information such as POS tags included between  Different documents are separated by lines **started** with **-DOCSTART-** token. Different sentences are separated by an empty line. Example

    -DOCSTART- -X- -X- O

    EU NNP B-NP B-ORG
    rejects VBZ B-VP O
    German JJ B-NP B-MISC
    call NN I-NP O
    to TO B-VP O
    boycott VB I-VP O
    British JJ B-NP B-MISC
    lamb NN I-NP O
    . . O O

    Peter NNP B-NP B-PER
    Blackburn NNP I-NP I-PER
    
    
If you want to train model on your own data you can convert it to this CoNLL format or implement your version of `dataset_reader`. As for classification task model can be trained with CLI:
```
! python -m deeppavlov train config_name
```
or in Python
```
from deeppavlov import train_model
model = train_model(model_config)
```

## BERT for Question Answering (Stanford Question Answering Dataset)

One can use BERT model for extractive Question Answering, e.g.,
context:
```markdown
In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail… Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals **within a cloud**. Short, intense periods of rain in scattered locations are called “showers”.
```
and question:
```
Where do water droplets collide with ice crystals to form precipitation?
```
Answer is always a span from context.

To solve this task with BERT model all we need is to train two dense layes to predict answer start and answer end positions:

<img src="https://github.com/deepmipt/dp_tutorials/blob/master/img/BERT_QA.png?raw=1" width="50%" />

Downloading and interacting with pre-trained model:

In [None]:
from deeppavlov import build_model, configs

model = build_model(configs.squad.squad_bert, download=True)

In [None]:
model(['In meteorology, precipitation is any product of the condensation of atmospheric water vapor that falls under gravity. The main forms of precipitation include drizzle, rain, sleet, snow, graupel and hail… Precipitation forms as smaller droplets coalesce via collision with other rain drops or ice crystals within a cloud. Short, intense periods of rain in scattered locations are called “showers”.'], 
      ['Where do water droplets collide with ice crystals to form precipitation?'])

Model returns an answer, position in characters and confidence.

To train model on your data you should put it json files in SQuAD format: https://rajpurkar.github.io/SQuAD-explorer/

These json files contain paragraphs, questions and answers.


## Zero-shot Transfer from English to 103 languages

BERT model was originaly trained only for English language, but lately multilingual model trained on 103 was released. It gives ability to train models on language and use them for 103 other language. This technique is called zero-shot transfer as we don't use any training data for target language.

<img src="https://github.com/deepmipt/dp_tutorials/blob/master/img/BERT_multilingual.png?raw=1" width="75%" />

We will cover two examples:
 * NER transfer from Ontonotes dataset (English -> 103)
 * QA transfer from SQuAD dataset (English -> 103)
 
 These models are also available at [demo.ipavlov.ai](https://demo.ipavlov.ai/#multiLang)

### Zero-shot multilingual NER

Download and interact the model:

In [None]:
from deeppavlov import build_model, configs

model = build_model(configs.ner.ner_ontonotes_bert_mult, download=True)

In [None]:
model(['Curling World Championship will be held in Antananarivo'])

In [None]:
model(['Чемпионат мира по кёрлингу пройдёт в Антананариву']) # Чемпионат мира по кёрлингу == Curling World Championship

### Zero-shot multilingual QA
Get configuration file, download and interact the model:


In [None]:
! wget https://raw.githubusercontent.com/deepmipt/DeepPavlov/squad_multilingual_configs/deeppavlov/configs/squad/squad_bert_multilingual_freezed_emb.json

In [None]:
from deeppavlov import build_model, configs

model = build_model('./squad_bert_multilingual_freezed_emb.json', download=True)

In [None]:
model(['Su área de distribución comprende casi toda Sudamérica al este de los Andes en las \
       cuencas del río Orinoco, del Amazonas y del Río de la Plata; cubriendo desde el este \
       de Venezuela y la Guyana hasta Uruguay y el norte y centro de Argentina. Pueden vivir \
       en diferentes tipos de hábitat, pero muestran preferencia por algunos en concreto. \
       Suelen encontrarse cerca de lagos, ríos, marismas o manglares.'], 
      ['What countries do capybara live in?'])

As you can see model can work even if context and question languages are different!

### Zero-shot transfer performance

Results for Zero-Shot NER from English to Russian:

| model                            | Overall (Span F-1)   | PER (Span F-1)    | LOC (Span F-1)   | ORG (Span F-1) |
|----------------------------------|-------|----------|----|----|
| RuBERT NER | 97.7 |98.3   | 99.7 | 94.9|
| Zero-shot Multilingual BERT NER   | 79.4 | 95.7   |82.6 | 55.7|

Results for Zero-Shot QA from English to Russian:

| model                            | F-1   |
|----------------------------------|-------|
| RuBERT QA | 84.6 |
| Zero-shot Multilingual BERT QA   | 77.36 |

Results for Zero-Shot QA from Russian to English:

| model                            | F-1   |
|----------------------------------|-------|
| BERT QA | 88.49 |
| Zero-shot Multilingual BERT QA   | 75.26 |