# Tutorial for Run Your Own Task in MT-DNN
To run your own task with MT-DNN is actually easy with 3 steps:
1. add your task into task define config
2. prepare your task data with correct schema
3. specify your task name in args for train.py

## Step 1 - Add Your task into task define config
MT-DNN adapts [yaml](https://en.wikipedia.org/wiki/YAML) file as config file format. 
Here is a piece of example task define config :
<pre>snlisample:
  data_format: PremiseAndOneHypothesis
  enable_san: true
  labels:
  - contradiction
  - neutral
  - entailment
  metric_meta:
  - ACC
  loss: CeCriterion
  kd_loss: MseCriterion
  adv_loss: SymKlCriterion
  n_class: 3
  task_type: Classification</pre>
  
We will take this "snlisample" task as example to show you what are these fields and add them step by step. You can find the full config file under "tutorial" folder as well.


### Add task definition for your task

<pre>snlisample
  task_type: Classification
  n_class: 3</pre> 

  speicfy what task type it is in your own task, choose one from types in:
    1. Classification
    2. Regression
    3. Ranking
    4. Span
    5. SeqenceLabeling
    6. MaskLM
  More details in [data_utils/task_def.py](../data_utils/task_def.py)
  
Also, specify how many classes in total in your task, under "n_class" field.


### Add data information for your task

<pre>snlisample:
  data_format: PremiseAndOneHypothesis
  enable_san: true
  labels:
  - contradiction
  - neutral
  - entailment
  n_class: 3
  task_type: Classification
  </pre> 
  
  choose the correct data format based on your task, currently we support 4 types of data formats, coresponds to different tasks:
  1. "PremiseOnly" : single text, i.e. premise. Data format is "id" \t "label" \t "premise" .
  2. "PremiseAndOneHypothesis" : two texts, i.e. one premise and one hypothesis. Data format is "id" \t "label" \t "premise" \t "hypothesis".
  3. "PremiseAndMultiHypothesis" : one text as premise and multiple candidates of texts as hypothesis. Data format is "id" \t "label" \t "premise" \t "hypothesis_1" \t "hypothesis_2" \t ... \t "hypothesis_n".
  4. "Seqence" : sequence tagging. Data format is "id" \t "label" \t "premise".
  
  More details in [data_utils/task_def.py](../data_utils/task_def.py)
 
The code is using surfix to distinguish what type of set it is ("_train","_dev" and "_test"). So:
  1. make sure your train set is named as "TASK_train" (replace TASK with your task name)
  2. make sure your dev set and test set ends with "_dev" and "_test".

If you prefer using readable labels (text), you can specify what labels are there in your data set, under "labels" field.

### Add hyper-parameters for your task

<pre>snlisample:
  data_format: PremiseAndOneHypothesis
  enable_san: true
  labels:
  - contradiction
  - neutral
  - entailment
  n_class: 3
  task_type: Classification
  </pre>

   
  set "true" if you would like to use Stochastic Answer Networks([SAN](https://www.aclweb.org/anthology/P18-1157.pdf)) for your task;
  We also support assigning different dropout prob for different task as well, please assign the prob for your task in "dropout_p" field. More examples please refer to other glue task def files.

### Add metric and loss for your task

<pre>snlisample:
  data_format: PremiseAndOneHypothesis
  enable_san: true
  labels:
  - contradiction
  - neutral
  - entailment
  metric_meta:
  - ACC
  loss: CeCriterion
  kd_loss: MseCriterion
  adv_loss: SymKlCriterion
  n_class: 3
  task_type: Classification
  </pre>
  
  More details about metrics,please refer to [data_utils/metrics.py](../data_utils/metrics.py);
  
  you can choose loss, kd_loss (knowledge distillation) and adv_loss(adversarial training) from pre-defined losses in file [data_utils/loss.py](../data_utils/loss.py), and you can implement your customized losses into this file and specify it in the task config.

  

## Step 2 - Prepare your data in correct format

remember what "data_format" you set in config? please follow the detailed data format below, which we also mentioned above, to prepare your data:

1. "PremiseOnly" : single text, i.e. premise. Data format is "id" \t "label" \t "premise" .
2. "PremiseAndOneHypothesis" : two texts, i.e. one premise and one hypothesis. Data format is "id" \t "label" \t "premise" \t "hypothesis".
3. "PremiseAndMultiHypothesis" : one text as premise and multiple candidates of texts as hypothesis. Data format is "id" \t "label" \t "premise" \t "hypothesis_1" \t "hypothesis_2" \t ... \t "hypothesis_n".
4. "Seqence" : sequence tagging. Data format is "id" \t "label" \t "premise".

### Tokenization and Convert to Json

the training code reads tokenized data in json format. please use "prepro_std.py" to do tokenization and convert your data into json format.
For example, let's give a try on example task "snlisample"! please try with following commands to preprocess data for this task:

<pre>python prepro_std.py --model bert-base-uncased --root_dir tutorials/ --task_def tutorials/tutorial_task_def.yml</pre>

Example Output:
<pre>
07/02/2020 08:42:26 Task snlisample
07/02/2020 08:42:26 tutorials/bert_base_uncased/snlisample_train.json
07/02/2020 08:42:26 tutorials/bert_base_uncased/snlisample_dev.json
07/02/2020 08:42:26 tutorials/bert_base_uncased/snlisample_test.json
</pre>

## Step 3 - Onboard your task into training!

1. Add your piece of config into overall config for all tasks
2. if you are looking for multi-task learning, to join your new task with exsting tasks, please append your task and test_set prefix in train.py args : "--train_datasets EXISTING_TASKS,YOUR_NEW_TASK --test_datasets EXISTING_TASK_TEST_SETS,YOUR_NEW_TASK_SETS"; if you are looking for single task fine-tuning, please just leave your new task only in the args.


for example, we would like to finetune this "snlisample" task, with the sampled data in tutorials folder, here is the command:
<pre>
python train.py --task_def tutorials/tutorial_task_def.yml --data_dir tutorials/bert_base_uncased/ --train_datasets snlisample --test_datasets snlisample
</pre>

If you would like to add adversarial training, make sure you have "adv_loss" defined in task config file, and please add "--adv_train":
<pre>
python train.py --task_def tutorials/tutorial_task_def.yml --data_dir tutorials/bert_base_uncased/ --train_datasets snlisample --test_datasets snlisample --adv_train
</pre>

Example Output:
<pre>
07/02/2020 09:13:38 Total number of params: 109484547
07/02/2020 09:13:38 At epoch 0
07/02/2020 09:13:38 Task [ 0] updates[     1] train loss[1.25835] remaining[0:00:06]
predicting 0
07/02/2020 09:13:42 Task snlisample -- epoch 0 -- Dev ACC: 36.000
predicting 0
07/02/2020 09:13:42 [new test scores saved.]
07/02/2020 09:13:44 At epoch 1
predicting 0
07/02/2020 09:13:47 Task snlisample -- epoch 1 -- Dev ACC: 43.000
predicting 0
07/02/2020 09:13:48 [new test scores saved.]
</pre>