# Example - 4

**Tasks :- Query type detection**

**Tasks Description**

``querytype`` :- This is a single sentence classification task to determine what type (category) of answer is expected for the given query. The queries are divided into 5 major classes according to the answer expected for them.

**Conversational Utility** :-  While returning a response for a query, knowing what kind of answer is expected for the query can help in both curating and cross-verfying an answer according to the type.

**Data** :- In this example, we are using the <a href="https://microsoft.github.io/msmarco/">MSMARCO QnA</a> data. Queries are divided into 5 query types - NUMERIC, LOCATION, ENTITY, DESCRIPTION, PERSON.

The data can be downloaded using the following ``wget`` command and unzipped using ``unzip`` command.

In [None]:
!wget https://msmarco.blob.core.windows.net/msmarco/train_v2.1.json.gz -P msmarco_qna_data
!wget https://msmarco.blob.core.windows.net/msmarco/dev_v2.1.json.gz -P msmarco_qna_data
!wget https://msmarco.blob.core.windows.net/msmarco/eval_v2.1_public.json.gz -P msmarco_qna_data

In [None]:
!gunzip msmarco_qna_data/train_v2.1.json.gz
!gunzip msmarco_qna_data/dev_v2.1.json.gz
!gunzip msmarco_qna_data/eval_v2.1_public.json.gz

# Step - 1: Transforming data

The data is present in *JSON* format containing various data fields for each sample. We only consider the ``query`` and ``query_type`` in this example. The data is fairly large, hence we set ``data_frac`` to 0.2 by default. You can change this in case, you want to consider more data.

We already provide a sample transformation function ``msmarco_query_type_to_tsv`` to convert this data to required tsv format. 

Running data transformations will save the required train, dev and test tsv data files under ``data`` directory in root of library. For more details on the data transformation process, refer to <a href="https://multi-task-nlp.readthedocs.io/en/latest/data_transformations.html">data transformations</a> in documentation.

The transformation file should have the following details which is already created ``transform_file_querytype.yml``.

```
transform1:
  transform_func: msmarco_query_type_to_tsv
  transform_params:
    data_frac : 0.2
  read_file_names:
    - train_v2.1.json
    - dev_v2.1.json
    - eval_v2.1_public.json

  read_dir: msmarco_qna_data
  save_dir: ../../data
 ```
 Following command can be used to run the data transformation for the tasks.

In [None]:
!python ../../data_transformations.py \
    --transform_file 'transform_file_querytype.yml'

# Step -2 Data Preparation

For more details on the data preparation process, refer to <a href="https://multi-task-nlp.readthedocs.io/en/latest/training.html#running-data-preparation">data preparation</a> in documentation.

Defining tasks file for training single model for entailment task. The file is already created at ``tasks_file_querytype.yml``
```
querytype:
    model_type: BERT
    config_name: bert-base-uncased
    dropout_prob: 0.2
    label_map_or_file:
    - DESCRIPTION
    - ENTITY
    - LOCATION
    - NUMERIC
    - PERSON
    metrics:
    - classification_accuracy
    loss_type: CrossEntropyLoss
    task_type: SingleSenClassification
    file_names:
    - querytype_train_v2.1.tsv
    - querytype_dev_v2.1.tsv
    - querytype_eval_v2.1_public.tsv
```

In [None]:
!python ../../data_preparation.py \
    --task_file 'tasks_file_querytype.yml' \
    --data_dir '../../data' \
    --max_seq_len 60

# Step - 3 Running train

Following command will start the training for the tasks. The log file reporting the loss, metrics and the tensorboard logs will be present in a time-stamped directory.

For knowing more details about the train process, refer to <a href= "https://multi-task-nlp.readthedocs.io/en/latest/training.html#running-train">running training</a> in documentation.

In [None]:
!python ../../train.py \
    --data_dir '../../data/bert-base-uncased_prepared_data' \
    --task_file 'tasks_file_querytype.yml' \
    --out_dir 'msmarco_querytype_bert_base' \
    --epochs 4 \
    --train_batch_size 64 \
    --eval_batch_size 64 \
    --grad_accumulation_steps 1 \
    --log_per_updates 100 \
    --max_seq_len 60 \
    --eval_while_train \
    --test_while_train \
    --silent

# Step - 4 Infering

You can import and use the ``inferPipeline`` to get predictions for the required tasks.
The trained model and maximum sequence length to be used needs to be specified.

For knowing more details about infering, refer to <a href="https://multi-task-nlp.readthedocs.io/en/latest/infering.html">infer pipeline</a> in documentation.

In [1]:
import sys
sys.path.insert(1, '../../')
from infer_pipeline import inferPipeline

Using TensorFlow backend.
