# EXAMPLE - 5

**Tasks :- NER tagging, POS tagging**

**Tasks Description**

``NER`` :-This is a Named Entity Recognition task where individual words of the sentence are tagged with an entity label it belongs to. The words which don't belong to any entity label are simply labeled as "O".

``POS`` :- This is a Part of Speech tagging task. A part of speech is a category of words that have similar grammatical properties. Each word of the sentence is tagged with the part of speech label it belongs to. The words which don't belong to any part of speech label are simply labeled as "O".

**Conversational Utility** :-  In conversational AI context, determining the syntactic parts of the sentence can help in extracting noun-phrases or important keyphrases from the sentence.

**Data** :- In this example, we are using the <a href="https://www.clips.uantwerpen.be/conll2003/ner/">coNLL 2003</a> data which is BIO tagged format with the POS and NER tags separated by space.

The data is already present in ``coNLL_data`` directory.

# Step - 1: Transforming data

Raw data is in BIO tagged format with the POS and NER tags separated by space.

We already provide a sample transformation function ``coNLL_ner_pos_to_tsv`` to convert this data to required tsv format. 

Running data transformations will save the required train, dev and test tsv data files under ``data`` directory in root of library. For more details on the data transformation process, refer to <a href="https://multi-task-nlp.readthedocs.io/en/latest/data_transformations.html">data transformations</a> in documentation.

The transformation file should have the following details which is already created ``transform_file_conll.yml``.

```
transform1:
  transform_func: coNLL_ner_pos_to_tsv
  read_file_names:
    - coNLL_train.txt
    - coNLL_testa.txt
    - coNLL_testb.txt
  read_dir: coNLL_data
  save_dir: ../../data
 ```
 Following command can be used to run the data transformation for the tasks.

In [None]:
#Vô powershell (PS), nếu đang ở thư mục gốc thì copy vô PS chạy lệnh ở dưới này lên trước, còn nếu ở trong thư mục SRL_DEPENDENCY rồi thì thôi
cd .\SRL_NER
#Lệnh chính
python ..\data_transformations.py --transform_file transform_file_conll.yml

# Step -2 Data Preparation

For more details on the data preparation process, refer to <a href="https://multi-task-nlp.readthedocs.io/en/latest/training.html#running-data-preparation">data preparation</a> in documentation.

Defining tasks file for training single model for entailment task. The file is already created at ``tasks_file_conll.yml``
```
conllner:
  model_type: BERT
  config_name: bert-base-uncased
  dropout_prob: 0.2
  label_map_or_file: ../../data/ner_coNLL_train_label_map.joblib
  metrics:
  - seqeval_f1_score
  - seqeval_precision
  - seqeval_recall
  loss_type: NERLoss
  task_type: NER
  file_names:
  - ner_coNLL_train.tsv
  - ner_coNLL_testa.tsv
  - ner_coNLL_testb.tsv

conllpos:
    model_type: BERT
    config_name: bert-base-uncased
    dropout_prob: 0.2
    label_map_or_file: ../../data/pos_coNLL_train_label_map.joblib
    metrics:
    - seqeval_f1_score
    - seqeval_precision
    - seqeval_recall
    loss_type: NERLoss
    task_type: NER
    file_names:
    - pos_coNLL_train.tsv
    - pos_coNLL_testa.tsv
    - pos_coNLL_testb.tsv
```

In [None]:
#Nếu đang ở thư mục SRL_DEPENDENCY_MODEL thì khỏi chạy lệnh này
cd .\SRL_NER
#Lệnh chính
python ..\data_preparation.py --task_file tasks_file_conll.yml --data_dir .\data_to_prepare --max_seq_len 50

# Step -3 Running Training

In [None]:
#Nếu đang ở thư mục SRL_DEPENDENCY_MODEL thì khỏi chạy lệnh này
cd .\SRL_NER
#Lệnh chính
python ..\train.py --data_dir .\data_to_prepare\dmis-lab\biobert-base-cased-v1.2_prepared_data --task_file tasks_file_conll.yml --out_dir conll_ner_pos_biobert_base --epochs 1 --train_batch_size 32 --eval_batch_size 32 --grad_accumulation_steps 1 --log_per_updates 50 --max_seq_len 50 --eval_while_train --test_while_train

# Step - 4 Infering

You can import and use the ``inferPipeline`` to get predictions for the required tasks.
The trained model and maximum sequence length to be used needs to be specified.

For knowing more details about infering, refer to <a href="https://multi-task-nlp.readthedocs.io/en/latest/infering.html">infer pipeline</a> in documentation.

In [6]:
import sys
sys.path.insert(1, '../')

from infer_pipeline import inferPipeline

pipe = inferPipeline(modelPath="./conll_ner_pos_biobert_base/multi_task_model_0_1123.pt", maxSeqLen=50)

Some weights of the model checkpoint at dmis-lab/biobert-base-cased-v1.2 were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [10]:
sample_sentence_1 = "normal lhx4 splicing was abolished by this intronic mutation , which segregates in a dominant and fully penetrant manner over three generations and activates two exonic cryptic splice sites , thereby predicting two different proteins deleted in their homeodomain sequence."

sample_sentence_2 = "to determine how activators interact with the complex and to examine the importance of these interactions , relative to other potential targeting mechanisms , for swi/snf function , we found to identify and mutate amino acids ee(437 and 438)."

samples = [ [sample_sentence_1], [sample_sentence_2] ]
tasks = ['conllSRL', 'conllNER']
pipe.infer(samples, tasks)

Eval: 2it [00:00, 10.51it/s]                       


AssertionError: length of sample and result list not same