# Tutorial for Run Your Own Task in MT-DNN
To run your own task with MT-DNN is actually easy with 3 steps:
1. add your task into task define config
2. prepare your task data with correct schema
3. specify your task name in args for train.py

## Step 1 - Add Your task into task define config
MT-DNN adapts [yaml](https://en.wikipedia.org/wiki/YAML) file as config file format. 
Here is a piece of example task define config :
<pre>cola:
  data_format: PremiseOnly
  dropout_p: 0.05
  enable_san: false
  metric_meta:
  - ACC
  - MCC
  loss: CeCriterion
  n_class: 2
  task_type: Classification</pre>

We will take "mnli" as example to show you what are these fields and add them step by step.

### Add task definition for your task

<pre>mnli
  task_type: Classification
  n_class: 3</pre> 

  speicfy what task type it is in your own task, choose one from types in below:
  1. Classification
  2. Regression
  3. Ranking
  4. Span
  5. SeqenceLabeling
  More details in [data_utils/task_def.py](../data_utils/task_def.py)
  
Also, specify how many classes in total in your task, under "n_class" field.


### Add data information for your task

<pre>mnli
  task_type: Classification
  n_class: 3
  data_format: PremiseAndOneHypothesis 
  split_names: # optional when your sets are already named as TASK_train/TASK_dev/TASK_test
  - train
  - matched_dev
  - mismatched_dev
  - matched_test
  - mismatched_test
  labels: # optional when your labels are int or float
  - contradiction
  - neutral
  - entailment
  </pre> 
  
  choose the correct data format based on your task, currently we support 4 types of data formats, coresponds to different tasks:
  1. "PremiseOnly" : single text, i.e. premise. Data format is "id" \t "label" \t "premise" .
  2. "PremiseAndOneHypothesis" : two texts, i.e. one premise and one hypothesis. Data format is "id" \t "label" \t "premise" \t "hypothesis".
  3. "PremiseAndMultiHypothesis" : one text as premise and multiple candidates of texts as hypothesis. Data format is "id" \t "label" \t "premise" \t "hypothesis_1" \t "hypothesis_2" \t ... \t "hypothesis_n".
  4. "Seqence" : sequence tagging. Data format is "id" \t "label" \t "premise".
  
  More details in [data_utils/task_def.py](../data_utils/task_def.py)
 
The code is using surfix to distinguish what type of set it is ("_train","_dev" and "_test"). So:
  1. make sure your train set is named as "TASK_train" (replace TASK with your task name)
  2. make sure your dev set and test set ends with "_dev" and "_test".

If you prefer using readable labels (text), you can specify what labels are there in your data set, under "labels" field.

### Add hyper-parameters for your task

<pre>mnli
  task_type: Classification
  n_class: 3
  data_format: PremiseAndOneHypothesis 
  split_names: # optional when your sets are already named as TASK_train/TASK_dev/TASK_test
  - train
  - matched_dev
  - mismatched_dev
  - matched_test
  - mismatched_test
  labels: # optional when your labels are int or float
  - contradiction
  - neutral
  - entailment
  dropout_p: 0.3
  enable_san: true
  </pre>

  
  we support assigning different dropout prob for different task as well, please assign the prob for your task in "dropout_p" field;
  
  set "true" if you would like to use Stochastic Answer Networks([SAN](https://www.aclweb.org/anthology/P18-1157.pdf)) for your task.

### Add metric and loss for your task

<pre>mnli
  task_type: Classification
  n_class: 3
  data_format: PremiseAndOneHypothesis 
  split_names: # optional when your sets are already named as TASK_train/TASK_dev/TASK_test
  - train
  - matched_dev
  - mismatched_dev
  - matched_test
  - mismatched_test
  labels: # optional when your labels are int or float
  - contradiction
  - neutral
  - entailment
  dropout_p: 0.3
  enable_san: true
  metric_meta:
  - ACC
  loss: CeCriterion
  </pre>
  
  you can choose multiple metrics from : ACC,F1,MCC,Pearson,Spearman,AUC and SeqEval. More details in [data_utils/metrics.py](../data_utils/metrics.py);
  
  you can choose loss from : 
    CeCriterion,RegCriterion,RankCeCriterion,SpanCeCriterion and SeqCeCriterion. More details in [data_utils/loss.py](../data_utils/loss.py)
  

## Step 2 - Prepare your data in correct format

remember what "data_format" you set in config? please follow the detailed data format below, which we also mentioned above, to prepare your data:

1. "PremiseOnly" : single text, i.e. premise. Data format is "id" \t "label" \t "premise" .
2. "PremiseAndOneHypothesis" : two texts, i.e. one premise and one hypothesis. Data format is "id" \t "label" \t "premise" \t "hypothesis".
3. "PremiseAndMultiHypothesis" : one text as premise and multiple candidates of texts as hypothesis. Data format is "id" \t "label" \t "premise" \t "hypothesis_1" \t "hypothesis_2" \t ... \t "hypothesis_n".
4. "Seqence" : sequence tagging. Data format is "id" \t "label" \t "premise".

### Tokenization and Convert to Json

the training code reads tokenized data in json format. please use "prepro_std.py" to do tokenization and convert your data into json format.


## Step 3 - Onboard your task into training!

1. Add your piece of config into overall config for all tasks
2. append your task and test_set prefix in train.py args : "--train_datasets EXISTING_TASKS,mnli --test_datasets EXISTING_TASK_TEST_SETS,mnli_mismatched,mnli_matched"
