## EXAMPLE - 3

**Tasks :- Answerability detection**

**Tasks Description**

``answerability`` :- This is modeled as a sentence pair classification task where the first sentence is a query and second sentence is a context passage. The objective of this task is to determine whether the query can be answered from the context passage or not.

**Conversational Utility** :- This can be a useful component for building a question-answering/ machine comprehension based system. In such cases, it becomes very important to determine whether the given query can be answered with given context passage or not before extracting/abstracting an answer from it. Performing question-answering for a query which is not answerable from the context, could lead to incorrect answer extraction.

**Data** :- In this example, we are using the <a href="https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz">MSMARCO triples</a> data which is having sentence pairs and labels.
The data contains triplets where the first entry is the query, second one is the context passage from which the query can be answered (positive passage) , while the third entry is a context passage from which the query cannot be answered (negative passage).

Data is transformed into sentence pair classification format, with query-positive context pair labeled as 1 (answerable) and query-negative context pair labeled as 0 (non-answerable)

The data can be downloaded using the following ``wget`` command and extracted using ``tar`` command. The data is fairly large to download (7.4GB). 

In [None]:
!wget https://msmarco.blob.core.windows.net/msmarcoranking/triples.train.small.tar.gz -P msmarco_data

In [None]:
!tar -xvzf msmarco_data/triples.train.small.tar.gz -C msmarco_data/

In [None]:
!rm msmarco_data/triples.train.small.tar.gz

# Step - 1: Transforming data

The data is present in *JSONL* format where each object contains a sample having the two sentences as ``sentence1`` and ``sentence2``. We consider ``gold_label`` field as the label which can have value: entailment, contradiction or neutral.

We already provide a sample transformation function ``msmarco_answerability_detection_to_tsv`` to convert this data to required tsv format. Data is transformed into sentence pair classification format, with query-positive context pair labeled as 1 (answerable) and query-negative context pair labeled as 0 (non-answerable)

Running data transformations will save the required train, dev and test tsv data files under ``data`` directory in root of library. For more details on the data transformation process, refer to <a href="https://multi-task-nlp.readthedocs.io/en/latest/data_transformations.html">data transformations</a> in documentation.

The transformation file should have the following details which is already created ``transform_file_snli.yml``.

```
transform1:
  transform_func: msmarco_answerability_detection_to_tsv
  transform_params:
    data_frac : 0.02
  read_file_names:
    - triples.train.small.tsv
  read_dir : msmarco_data
  save_dir: ../../data
  
 ```
 Following command can be used to run the data transformation for the tasks.

# Step -2 Data Preparation

For more details on the data preparation process, refer to <a href="https://multi-task-nlp.readthedocs.io/en/latest/training.html#running-data-preparation">data preparation</a> in documentation.

Defining tasks file for training single model for entailment task. The file is already created at ``tasks_file_answerability.yml``
```
answerability:
    model_type: BERT
    config_name: bert-base-uncased
    dropout_prob: 0.2
    class_num: 2
    metrics:
    - classification_accuracy
    loss_type: CrossEntropyLoss
    task_type: SentencePairClassification
    file_names:
    - msmarco_answerability_train.tsv
    - msmarco_answerability_dev.tsv
    - msmarco_answerability_test.tsv
```

In [None]:
!python ../../data_preparation.py \
    --task_file 'tasks_file_answerability.yml' \
    --data_dir '../../data' \
    --max_seq_len 324

# Step - 3 Running train

Following command will start the training for the tasks. The log file reporting the loss, metrics and the tensorboard logs will be present in a time-stamped directory.

For knowing more details about the train process, refer to <a href= "https://multi-task-nlp.readthedocs.io/en/latest/training.html#running-train">running training</a> in documentation.

In [None]:
!python ../../train.py \
    --data_dir '../../data/bert-base-uncased_prepared_data' \
    --task_file 'tasks_file_answerability.yml' \
    --out_dir 'msmarco_answerability_bert_base' \
    --epochs 3 \
    --train_batch_size 8 \
    --eval_batch_size 16 \
    --grad_accumulation_steps 2 \
    --log_per_updates 250 \
    --max_seq_len 324 \
    --save_per_updates 16000 \
    --eval_while_train \
    --test_while_train \
    --silent

# Step - 4 Infering

You can import and use the ``inferPipeline`` to get predictions for the required tasks.
The trained model and maximum sequence length to be used needs to be specified.

For knowing more details about infering, refer to <a href="https://multi-task-nlp.readthedocs.io/en/latest/infering.html">infer pipeline</a> in documentation.

In [None]:
import sys
sys.path.insert(1, '../../')
from infer_pipeline import inferPipeline