# EXAMPLE - 7

**Tasks :- Query similarity**

**Tasks Description**

``Query similarity`` :- This is a sentence pair classification task which determines whether the second sentence in a sample can be inferred from the first.

**Conversational Utility** :-  In conversational AI context, this task can be seen as determining whether the second sentence is similar to first or not. Additionally, the probability score can also be used as a similarity score between the sentences. 

**Data** :- In this example, we are using the <a href="https://nlp.stanford.edu/projects/snli">SNLI</a> data which is having sentence pairs and labels.

The data can be downloaded using the following ``wget`` command and unzipped using ``unzip`` command.

In [None]:
!wget qim.fs.quoracdn.net/quora_duplicate_questions.tsv -P qqp_data/

# Step -1 Data Transformations

Defining transform file

```
sample_transform:
  transform_func: qqp_query_similarity_to_tsv
  read_file_names:
    - quora_duplicate_questions.tsv
  read_dir : qqp_data
  save_dir: ../../data
```

In [None]:
!python ../../data_transformations.py \
    --transform_file 'transform_file_qqp.yml'

# Step -2 Data Preparation

Defining task file for query similarity detection with QQP data

```
querysimilarity:
    model_type: BERT
    config_name: bert-base-uncased
    dropout_prob: 0.2
    metrics:
    - classification_accuracy
    loss_type: CrossEntropyLoss
    class_num: 2
    task_type: SentencePairClassification
    file_names:
    - qqp_query_similarity_train.tsv
    - qqp_query_similarity_dev.tsv
    - qqp_query_similarity_test.tsv
```

In [None]:
!python ../../data_preparation.py \
    --task_file 'tasks_file_qqp.yml' \
    --data_dir '../../data' \
    --max_seq_len 200

# Step - 3 Running train

Following command will start the training for the tasks. The log file reporting the loss, metrics and the tensorboard logs will be present in a time-stamped directory.

For knowing more details about the train process, refer to <a href= "https://multi-task-nlp.readthedocs.io/en/latest/training.html#running-train">running training</a> in documentation.

In [None]:
!python ../../train.py \
    --data_dir '../../data/bert-base-uncased_prepared_data' \
    --task_file 'tasks_file_qqp.yml' \
    --out_dir 'qqp_query_similarity_bert_base' \
    --epochs 3 \
    --train_batch_size 32 \
    --eval_batch_size 32 \
    --grad_accumulation_steps 2 \
    --log_per_updates 100 \
    --save_per_updates 3000 \
    --limit_save 6 \
    --max_seq_len 200 \
    --eval_while_train \
    --test_while_train \
    --silent

# Step - 4 Infering

You can import and use the ``inferPipeline`` to get predictions for the required tasks.
The trained model and maximum sequence length to be used needs to be specified.

For knowing more details about infering, refer to <a href="https://multi-task-nlp.readthedocs.io/en/latest/infering.html">infer pipeline</a> in documentation.

In [None]:
import sys
sys.path.insert(1, '../../')
from infer_pipeline import inferPipeline