## Question Paraphrasing using BART

BART, a Sequence-to-Sequence Transformer Model.

*   BART uses a standard Transformer architecture (Encoder-Decoder) like the original Transformer model used for neural machine translation but also incorporates some changes from BERT (only uses the encoder) and GPT (only uses the decoder).
*   BART is pre-trained by minimizing the cross-entropy loss between the decoder output and the original sequence.


BART Paper : https://arxiv.org/pdf/1910.13461.pdf

Reference:
*   https://towardsdatascience.com/bart-for-paraphrasing-with-simple-transformers-7c9ea3dfdd8c
*   https://towardsdatascience.com/hyperparameter-optimization-for-optimum-transformer-models-b95a32b70949



In [None]:
pip install simpletransformers



In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Data preprocessing

In [None]:
import warnings
import pandas as pd

def clean_unnecessary_spaces(out_string):
    if not isinstance(out_string, str):
        warnings.warn(f">>> {out_string} <<< is not a string.")
        out_string = str(out_string)
    out_string = (
        out_string.replace(" .", ".")
        .replace(" ?", "?")
        .replace(" !", "!")
        .replace(" ,", ",")
        .replace(" ' ", "'")
        .replace(" n't", "n't")
        .replace(" 'm", "'m")
        .replace(" 's", "'s")
        .replace(" 've", "'ve")
        .replace(" 're", "'re")
    )
    return out_string

In [None]:
train_df = pd.read_csv('drive/My Drive/datasets/train.csv')
train_df = train_df.rename(
        columns={"question1": "input_text", "question2": "target_text"}
    )
train_df = train_df[["input_text", "target_text"]]
train_df["prefix"] = "paraphrase"
train_df.head()

Unnamed: 0,input_text,target_text,prefix
0,What company did Worldvision sell a portion of...,A portion of Worldvision's catalogue in 1990 w...,paraphrase
1,Approximately how many gems in Reverend Chaunc...,Approximately how many gems in Reverend Chaunc...,paraphrase
2,How long would Tesla spend gambling sometimes?,How long would Tesla spend gambling sometimes ...,paraphrase
3,How many bits are often in the primes used for...,How many bits are are typical for the primes u...,paraphrase
4,How many bits are typically used in the primes...,How many bits are frequently used for the prim...,paraphrase


In [None]:
eval_df = pd.read_csv('drive/My Drive/datasets/eval.csv')
eval_df = eval_df.rename(
        columns={"question1": "input_text", "question2": "target_text"}
    )
eval_df = eval_df[["input_text", "target_text"]]
eval_df["prefix"] = "paraphrase"
eval_df.head()

Unnamed: 0,input_text,target_text,prefix
0,"where are the harvard medical, dental and scho...","where are the harvard medical, dental, and sch...",paraphrase
1,which logo had the dw tardis insignia removed?,which logo had the dw tardis insignia for elim...,paraphrase
2,who did genghis khan unite before he began con...,who was genghis khan unite before he began con...,paraphrase
3,clergy are members of what group rather than o...,clergy are member of what group rather than of...,paraphrase
4,what does computational complexity theory most...,what does computational complexity theory most...,paraphrase


In [None]:
train_df = train_df.dropna()
eval_df = eval_df.dropna()

train_df["input_text"] = train_df["input_text"].apply(clean_unnecessary_spaces)
train_df["target_text"] = train_df["target_text"].apply(clean_unnecessary_spaces)

eval_df["input_text"] = eval_df["input_text"].apply(clean_unnecessary_spaces)
eval_df["target_text"] = eval_df["target_text"].apply(clean_unnecessary_spaces)

print(train_df)

                                             input_text  ...      prefix
0     What company did Worldvision sell a portion of...  ...  paraphrase
1     Approximately how many gems in Reverend Chaunc...  ...  paraphrase
2        How long would Tesla spend gambling sometimes?  ...  paraphrase
3     How many bits are often in the primes used for...  ...  paraphrase
4     How many bits are typically used in the primes...  ...  paraphrase
...                                                 ...  ...         ...
1057  what religion did the yuan discourage, to supp...  ...  paraphrase
1058  what did the early entrant program do for pote...  ...  paraphrase
1059         what church is organized into conferences?  ...  paraphrase
1060  what astronomers is also a university alumni m...  ...  paraphrase
1061  how long may the amazon rainforest be threaten...  ...  paraphrase

[1062 rows x 3 columns]


In [None]:
import os
from datetime import datetime
import logging
import pandas as pd
from sklearn.model_selection import train_test_split
from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs

logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.ERROR)

### Hyperparameter selections

In [None]:
model_args = Seq2SeqArgs()
model_args.do_sample = True
model_args.eval_batch_size = 15 # And I was using batch size of 64. So I just changed it to 15 and it worked for me. Reference : https://stackoverflow.com/questions/59129812/how-to-avoid-cuda-out-of-memory-in-pytorch
model_args.evaluate_during_training = True
model_args.evaluate_during_training_steps = 2500
model_args.evaluate_during_training_verbose = True
model_args.fp16 = False
model_args.learning_rate = 5e-5
model_args.max_length = 50
model_args.max_seq_length = 50
model_args.num_beams = None
model_args.num_return_sequences = 1
model_args.num_train_epochs = 2
model_args.overwrite_output_dir = True
model_args.reprocess_input_data = True
model_args.save_eval_checkpoints = False
model_args.save_steps = -1
model_args.top_k = 50
model_args.top_p = 0.95
model_args.train_batch_size = 8
model_args.use_multiprocessing = False
model_args.wandb_project = "Paraphrasing with BART"

In [None]:
model = Seq2SeqModel(
    encoder_decoder_type="bart",
    encoder_decoder_name="facebook/bart-large",
    args=model_args,
)

model.train_model(train_df, eval_data=eval_df)

INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/1062 [00:00<?, ?it/s]

INFO:simpletransformers.seq2seq.seq2seq_model: Training started


Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

VBox(children=(Label(value=' 0.05MB of 0.05MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
Training loss,0.89816
lr,0.0
global_step,266.0
_runtime,556.0
_timestamp,1617312508.0
_step,6.0
eval_loss,0.76989
train_loss,0.70868


0,1
Training loss,█▇▁▁▅
lr,█▆▄▃▁
global_step,▁▃▄▄▆▇█
_runtime,▁▂▃▅▆▆█
_timestamp,▁▂▃▅▆▆█
_step,▁▂▃▅▆▇█
eval_loss,▁█
train_loss,█▁


Running Epoch 0 of 2:   0%|          | 0/133 [00:00<?, ?it/s]

INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/checkpoint-133-epoch-1
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/56 [00:00<?, ?it/s]

INFO:simpletransformers.seq2seq.seq2seq_model:{'eval_loss': 0.8990960121154785}
INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/best_model


Running Epoch 1 of 2:   0%|          | 0/133 [00:00<?, ?it/s]

INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/checkpoint-266-epoch-2
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/56 [00:00<?, ?it/s]

INFO:simpletransformers.seq2seq.seq2seq_model:{'eval_loss': 0.8781525790691376}
INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/best_model
INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/
INFO:simpletransformers.seq2seq.seq2seq_model: Training of facebook/bart-large model complete. Saved to outputs/.


(266,
 {'eval_loss': [0.8990960121154785, 0.8781525790691376],
  'global_step': [133, 266],
  'train_loss': [1.2838181257247925, 0.8490852117538452]})

In [None]:
to_predict = [
    prefix + ": " + str(input_text)
    for prefix, input_text in zip(eval_df["prefix"].tolist(), eval_df["input_text"].tolist())
]
truth = eval_df["target_text"].tolist()

In [None]:
preds = model.predict(to_predict)

Generating outputs:   0%|          | 0/4 [00:00<?, ?it/s]

In [None]:
model.predict(["What type of competitors does the 1966 act help combat?","What county is Raleigh in?","What is the population of near by municipalities?","Bell implemented Gray's design as a what?"])

Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

[' type type of competitors does the 1966 act help combat?',
 ' county county is Raleigh in?',
 ' is is the population of near by municipalities?',
 "BellBell implemented Gray's design as a what?"]

In [None]:
model.predict(["Who, in 2007, frustrated Elizabeth?"])

Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

[',, in 2007, frustrated Elizabeth?']

In [None]:
model.predict(["In what year did the New Haven Black Panther trials take place in New Haven?"])

Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

[' yearWhat year did the New Haven Black Panther trials take place in New Haven?']

In [None]:
model.predict(["What sea was created by the Alps?"])

Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

['WhatWhat sea was created by the Alps?']

The results are unsatisfactory but I feel can be improved if we increase our dataset as current dataset used is very small
Other dataset for paraphrasing can be found here:


*   https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs
*   https://github.com/google-research-datasets/paws#paws-wiki

I have not tried using by combining these dataset as it will require a lot more compuation power and storage power.



## Some potential problem with BART:

1.   The generated sequence is almost identical to the original with only minor differences in a word or two.
2.   Incorrect or awkward grammar.
3.  Might not be as good on out of domain (from training data) inputs.



#### Using different hyperparameters

In [None]:
model_args = Seq2SeqArgs()
model_args.do_sample = True
model_args.eval_batch_size = 15 # And I was using batch size of 64. So I just changed it to 15 and it worked for me. Reference : https://stackoverflow.com/questions/59129812/how-to-avoid-cuda-out-of-memory-in-pytorch
model_args.evaluate_during_training = True
model_args.evaluate_during_training_steps = 2500
model_args.evaluate_during_training_verbose = True
model_args.fp16 = False
model_args.learning_rate = 1e-5
model_args.max_length = 50
model_args.max_seq_length = 50
model_args.num_beams = None
model_args.num_return_sequences = 1
model_args.num_train_epochs = 3
model_args.overwrite_output_dir = True
model_args.reprocess_input_data = True
model_args.save_eval_checkpoints = False
model_args.save_steps = -1
model_args.top_k = 50
model_args.top_p = 0.95
model_args.train_batch_size = 8
model_args.use_multiprocessing = False
model_args.wandb_project = "Paraphrasing with BART"

In [None]:
model = Seq2SeqModel(
    encoder_decoder_type="bart",
    encoder_decoder_name="facebook/bart-large",
    args=model_args,
)

model.train_model(train_df, eval_data=eval_df)

INFO:filelock:Lock 140545616186192 acquired on /root/.cache/huggingface/transformers/3f12fb71b844fcb7d591fdd4e55027da90d7b5dd6aa5430ad00ec6d76585f26c.01119ad5ed0734de7152ef51ba44fccefe008001bca9a6ddebeec1caf28f6bb8.lock


Downloading:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

INFO:filelock:Lock 140545616186192 released on /root/.cache/huggingface/transformers/3f12fb71b844fcb7d591fdd4e55027da90d7b5dd6aa5430ad00ec6d76585f26c.01119ad5ed0734de7152ef51ba44fccefe008001bca9a6ddebeec1caf28f6bb8.lock
INFO:filelock:Lock 140542966549520 acquired on /root/.cache/huggingface/transformers/d065edfe6954baf0b989a2063b26eb07e8c4d0b19354b5c74af9a51f5518df6e.6ca4df1a6ec59aa763989ceec10dff41dde19f0f0824b9f5d3fcd35a8abffdb2.lock


Downloading:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

INFO:filelock:Lock 140542966549520 released on /root/.cache/huggingface/transformers/d065edfe6954baf0b989a2063b26eb07e8c4d0b19354b5c74af9a51f5518df6e.6ca4df1a6ec59aa763989ceec10dff41dde19f0f0824b9f5d3fcd35a8abffdb2.lock
INFO:filelock:Lock 140542786479952 acquired on /root/.cache/huggingface/transformers/0d6fc8b2ef1860c1f8f0baff4b021e3426cc7d11b153f98e563b799603ee2f25.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05.lock


Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

INFO:filelock:Lock 140542786479952 released on /root/.cache/huggingface/transformers/0d6fc8b2ef1860c1f8f0baff4b021e3426cc7d11b153f98e563b799603ee2f25.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05.lock
INFO:filelock:Lock 140542786445968 acquired on /root/.cache/huggingface/transformers/6e75e35f0bdd15870c98387e13b93a8e100237eb33ad99c36277a0562bd6d850.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock


Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

INFO:filelock:Lock 140542786445968 released on /root/.cache/huggingface/transformers/6e75e35f0bdd15870c98387e13b93a8e100237eb33ad99c36277a0562bd6d850.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock
INFO:filelock:Lock 140542786413520 acquired on /root/.cache/huggingface/transformers/d94f53c8851dcda40774f97280e634b94b721a58e71bcc152b5f51d0d49a046a.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730.lock


Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

INFO:filelock:Lock 140542786413520 released on /root/.cache/huggingface/transformers/d94f53c8851dcda40774f97280e634b94b721a58e71bcc152b5f51d0d49a046a.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730.lock
INFO:filelock:Lock 140542786413328 acquired on /root/.cache/huggingface/transformers/1abf196c889c24daca2909359ca2090e5fcbfa21a9ea36d763f70adbafb500d7.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8.lock


Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

INFO:filelock:Lock 140542786413328 released on /root/.cache/huggingface/transformers/1abf196c889c24daca2909359ca2090e5fcbfa21a9ea36d763f70adbafb500d7.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8.lock
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/1062 [00:00<?, ?it/s]

INFO:simpletransformers.seq2seq.seq2seq_model: Training started


Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize


wandb: Paste an API key from your profile and hit enter: ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Running Epoch 0 of 3:   0%|          | 0/133 [00:00<?, ?it/s]

INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/checkpoint-133-epoch-1
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/56 [00:00<?, ?it/s]

INFO:simpletransformers.seq2seq.seq2seq_model:{'eval_loss': 0.8206565678119659}
INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/best_model


Running Epoch 1 of 3:   0%|          | 0/133 [00:00<?, ?it/s]

INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/checkpoint-266-epoch-2
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/56 [00:00<?, ?it/s]

INFO:simpletransformers.seq2seq.seq2seq_model:{'eval_loss': 0.7879861295223236}
INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/best_model


Running Epoch 2 of 3:   0%|          | 0/133 [00:00<?, ?it/s]

INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/checkpoint-399-epoch-3
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/56 [00:00<?, ?it/s]

INFO:simpletransformers.seq2seq.seq2seq_model:{'eval_loss': 0.7929145246744156}
INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/
INFO:simpletransformers.seq2seq.seq2seq_model: Training of facebook/bart-large model complete. Saved to outputs/.


(399,
 {'eval_loss': [0.8206565678119659, 0.7879861295223236, 0.7929145246744156],
  'global_step': [133, 266, 399],
  'train_loss': [0.6722872853279114, 0.6619437336921692, 0.3720759153366089]})

In [None]:
model.predict(["In what year did the New Haven Black Panther trials take place in New Haven?"])

Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

['In what year did the New Haven Black Panther trials take place in New Haven?']

In [None]:
model.predict(["What type of competitors does the 1966 act help combat?",
               "What county is Raleigh in?","What is the population of near by municipalities?",
               "Bell implemented Gray's design as a what?"])

Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

['What type of competitors does the 1966 act help combat?',
 'What county is Raleigh in?',
 'What is the population of near by municipalities?',
 "Bell implemented Gray's design as a what?"]