<a href="https://colab.research.google.com/github/krishnarevi/TSAI_END2.0_Session14/blob/main/part3_session14_BART.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install simpletransformers

Collecting simpletransformers
  Downloading simpletransformers-0.61.13-py3-none-any.whl (221 kB)
[K     |████████████████████████████████| 221 kB 15.7 MB/s 
[?25hCollecting tqdm>=4.47.0
  Downloading tqdm-4.62.0-py2.py3-none-any.whl (76 kB)
[K     |████████████████████████████████| 76 kB 5.2 MB/s 
[?25hCollecting streamlit
  Downloading streamlit-0.86.0-py2.py3-none-any.whl (8.0 MB)
[K     |████████████████████████████████| 8.0 MB 67.4 MB/s 
[?25hCollecting tensorboardx
  Downloading tensorboardX-2.4-py2.py3-none-any.whl (124 kB)
[K     |████████████████████████████████| 124 kB 68.3 MB/s 
[?25hCollecting tokenizers
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 62.8 MB/s 
Collecting transformers>=4.2.0
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 67.8 MB/s 
Collecting datasets

We will be using Quora question answer dataset to serve as training data for our BART Paraphrasing Model.

We also have a couple of helper functions, one to load data, and one to clean unnecessary spaces in the training data. 

In [2]:
import warnings

import pandas as pd


def load_data(
    file_path, input_text_column, target_text_column, label_column, keep_label=1
):
    df = pd.read_csv(file_path, sep="\t", error_bad_lines=False)
    df = df.loc[df[label_column] == keep_label]
    df = df.rename(
        columns={input_text_column: "input_text", target_text_column: "target_text"}
    )
    df = df[["input_text", "target_text"]]
    df["prefix"] = "paraphrase"

    return df


def clean_unnecessary_spaces(out_string):
    if not isinstance(out_string, str):
        warnings.warn(f">>> {out_string} <<< is not a string.")
        out_string = str(out_string)
    out_string = (
        out_string.replace(" .", ".")
        .replace(" ?", "?")
        .replace(" !", "!")
        .replace(" ,", ",")
        .replace(" ' ", "'")
        .replace(" n't", "n't")
        .replace(" 'm", "'m")
        .replace(" 's", "'s")
        .replace(" 've", "'ve")
        .replace(" 're", "'re")
    )
    return out_string

First, we import all the necessary stuff and set up logging.

In [3]:
import os
from datetime import datetime
import logging

import pandas as pd
from sklearn.model_selection import train_test_split
from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs



logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.ERROR)

Next, we load the dataset

In [4]:
# Mount Google Drive to this Notebook instance.
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
# Google Data
train_df = pd.read_csv("/content/drive/MyDrive/TSAI_data/paws_wiki_labeled/train.tsv", sep="\t").astype(str)
eval_df = pd.read_csv("/content/drive/MyDrive/TSAI_data/paws_wiki_labeled/dev.tsv", sep="\t").astype(str)

train_df = train_df.loc[train_df["label"] == "1"]
eval_df = eval_df.loc[eval_df["label"] == "1"]

train_df = train_df.rename(
    columns={"sentence1": "input_text", "sentence2": "target_text"}
)
eval_df = eval_df.rename(
    columns={"sentence1": "input_text", "sentence2": "target_text"}
)

train_df = train_df[["input_text", "target_text"]]
eval_df = eval_df[["input_text", "target_text"]]

train_df["prefix"] = "paraphrase"
eval_df["prefix"] = "paraphrase"

# # MSRP Data
# train_df = pd.concat(
#     [
#         train_df,
#         load_data("data/msr_paraphrase_train.txt", "#1 String", "#2 String", "Quality"),
#     ]
# )
# eval_df = pd.concat(
#     [
#         eval_df,
#         load_data("data/msr_paraphrase_test.txt", "#1 String", "#2 String", "Quality"),
#     ]
# )

# # Quora Data

# # The Quora Dataset is not separated into train/test, so we do it manually the first time.
# df = load_data(
#     "data/quora_duplicate_questions.tsv", "question1", "question2", "is_duplicate"
# )
# q_train, q_test = train_test_split(df)

# q_train.to_csv("data/quora_train.tsv", sep="\t")
# q_test.to_csv("data/quora_test.tsv", sep="\t")

# # The code block above only needs to be run once.
# # After that, the two lines below are sufficient to load the Quora dataset.

# # q_train = pd.read_csv("data/quora_train.tsv", sep="\t")
# # q_test = pd.read_csv("data/quora_test.tsv", sep="\t")

# train_df = pd.concat([train_df, q_train])
# eval_df = pd.concat([eval_df, q_test])

train_df = train_df[["prefix", "input_text", "target_text"]]
eval_df = eval_df[["prefix", "input_text", "target_text"]]

train_df = train_df.dropna()
eval_df = eval_df.dropna()

train_df["input_text"] = train_df["input_text"].apply(clean_unnecessary_spaces)
train_df["target_text"] = train_df["target_text"].apply(clean_unnecessary_spaces)

eval_df["input_text"] = eval_df["input_text"].apply(clean_unnecessary_spaces)
eval_df["target_text"] = eval_df["target_text"].apply(clean_unnecessary_spaces)
train_df =train_df.reset_index(drop =True)
train_df =train_df
eval_df =eval_df
eval_df=eval_df.reset_index(drop=True)

print(train_df.head())
len(train_df)
len(eval_df)

INFO:numexpr.utils:NumExpr defaulting to 4 threads.


       prefix  ...                                        target_text
0  paraphrase  ...  The 1975 -- 76 season of the National Basketba...
1  paraphrase  ...  The results are high when comparable flow rate...
2  paraphrase  ...  It is the seat of the district of Zerendi in A...
3  paraphrase  ...  William Henry Harman was born in Waynesboro, V...
4  paraphrase  ...  Given a discrete set of probabilities formula ...

[5 rows x 3 columns]


3539

Then, we set up the model and hyperparameter values. Note that we are using the pre-trained facebook/bart-large model, and fine-tuning it on our own dataset.
Finally, we’ll generate paraphrases for each of the sentences in the test data.

In [6]:

model_args = Seq2SeqArgs()
model_args.eval_batch_size = 8
model_args.evaluate_during_training = True
model_args.evaluate_during_training_steps = 2500
model_args.evaluate_during_training_verbose = True
model_args.fp16 = False
model_args.learning_rate = 5e-5
model_args.max_seq_length = 50
model_args.num_train_epochs = 1
model_args.overwrite_output_dir = True
model_args.reprocess_input_data = True
model_args.save_eval_checkpoints = False
model_args.save_steps = -1
model_args.train_batch_size = 32
model_args.use_multiprocessing = False

model_args.do_sample = True
model_args.num_beams = None
model_args.num_return_sequences = 3
model_args.max_length = 64
model_args.top_k = 50
model_args.top_p = 0.95

model_args.wandb_project = "Paraphrasing with BART"


model = Seq2SeqModel(
    encoder_decoder_type="bart",
    encoder_decoder_name="facebook/bart-large",
    args=model_args,
)

model.train_model(train_df, eval_data=eval_df)


INFO:filelock:Lock 140117741944720 acquired on /root/.cache/huggingface/transformers/3f12fb71b844fcb7d591fdd4e55027da90d7b5dd6aa5430ad00ec6d76585f26c.58d5dda9f4e9f44e980adb867b66d9e0cbe3e0c05360cefe3cd86f5db4fff042.lock


Downloading:   0%|          | 0.00/1.60k [00:00<?, ?B/s]

INFO:filelock:Lock 140117741944720 released on /root/.cache/huggingface/transformers/3f12fb71b844fcb7d591fdd4e55027da90d7b5dd6aa5430ad00ec6d76585f26c.58d5dda9f4e9f44e980adb867b66d9e0cbe3e0c05360cefe3cd86f5db4fff042.lock
INFO:filelock:Lock 140117754392208 acquired on /root/.cache/huggingface/transformers/d065edfe6954baf0b989a2063b26eb07e8c4d0b19354b5c74af9a51f5518df6e.6ca4df1a6ec59aa763989ceec10dff41dde19f0f0824b9f5d3fcd35a8abffdb2.lock


Downloading:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

INFO:filelock:Lock 140117754392208 released on /root/.cache/huggingface/transformers/d065edfe6954baf0b989a2063b26eb07e8c4d0b19354b5c74af9a51f5518df6e.6ca4df1a6ec59aa763989ceec10dff41dde19f0f0824b9f5d3fcd35a8abffdb2.lock
INFO:filelock:Lock 140117733487312 acquired on /root/.cache/huggingface/transformers/0d6fc8b2ef1860c1f8f0baff4b021e3426cc7d11b153f98e563b799603ee2f25.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05.lock


Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

INFO:filelock:Lock 140117733487312 released on /root/.cache/huggingface/transformers/0d6fc8b2ef1860c1f8f0baff4b021e3426cc7d11b153f98e563b799603ee2f25.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05.lock
INFO:filelock:Lock 140117733486800 acquired on /root/.cache/huggingface/transformers/6e75e35f0bdd15870c98387e13b93a8e100237eb33ad99c36277a0562bd6d850.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock


Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

INFO:filelock:Lock 140117733486800 released on /root/.cache/huggingface/transformers/6e75e35f0bdd15870c98387e13b93a8e100237eb33ad99c36277a0562bd6d850.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b.lock
INFO:filelock:Lock 140117733487312 acquired on /root/.cache/huggingface/transformers/d94f53c8851dcda40774f97280e634b94b721a58e71bcc152b5f51d0d49a046a.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730.lock


Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

INFO:filelock:Lock 140117733487312 released on /root/.cache/huggingface/transformers/d94f53c8851dcda40774f97280e634b94b721a58e71bcc152b5f51d0d49a046a.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730.lock
INFO:filelock:Lock 140117740290000 acquired on /root/.cache/huggingface/transformers/1abf196c889c24daca2909359ca2090e5fcbfa21a9ea36d763f70adbafb500d7.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8.lock


Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

INFO:filelock:Lock 140117740290000 released on /root/.cache/huggingface/transformers/1abf196c889c24daca2909359ca2090e5fcbfa21a9ea36d763f70adbafb500d7.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8.lock
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/21829 [00:00<?, ?it/s]

INFO:simpletransformers.seq2seq.seq2seq_model: Training started


Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Running Epoch 0 of 1:   0%|          | 0/683 [00:00<?, ?it/s]

INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/checkpoint-683-epoch-1
INFO:simpletransformers.seq2seq.seq2seq_utils: Creating features from dataset file at cache_dir/


  0%|          | 0/3539 [00:00<?, ?it/s]

INFO:simpletransformers.seq2seq.seq2seq_model:{'eval_loss': 0.48814797061695203}
INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/best_model
INFO:simpletransformers.seq2seq.seq2seq_model:Saving model into outputs/
INFO:simpletransformers.seq2seq.seq2seq_model: Training of facebook/bart-large model complete. Saved to outputs/.


(683,
 {'eval_loss': [0.48814797061695203],
  'global_step': [683],
  'train_loss': [0.48659810423851013]})

You can use the script below to test the model on any sentence.

In [7]:

to_predict = [
    prefix + ": " + str(input_text)
    for prefix, input_text in zip(
        eval_df["prefix"].tolist(), eval_df["input_text"].tolist()
    )
]
truth = eval_df["target_text"].tolist()

preds = model.predict(to_predict)

# Saving the predictions if needed
os.makedirs("predictions", exist_ok=True)

with open(f"predictions/predictions_{datetime.now()}.txt", "w") as f:
    for i, text in enumerate(eval_df["input_text"].tolist()):
        f.write(str(text) + "\n\n")

        f.write("Truth:\n")
        f.write(truth[i] + "\n\n")

        f.write("Prediction:\n")
        for pred in preds[i]:
            f.write(str(pred) + "\n")
        f.write(
            "________________________________________________________________________________\n"
        )

Generating outputs:   0%|          | 0/443 [00:00<?, ?it/s]

To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)


In [None]:
import logging

from simpletransformers.seq2seq import Seq2SeqModel


logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.ERROR)

model = Seq2SeqModel(
    encoder_decoder_type="bart", encoder_decoder_name="outputs"
)


while True:
    original = input("Enter text to paraphrase: ")
    to_predict = [original]

    preds = model.predict(to_predict)

    print("Original")
    print(original)

    print()
    print("Predictions >>>")
    for pred in preds[0]:
        print(pred)

    print("---------------------------------------------------------")
    print()

Enter text to paraphrase: In mathematical astronomy, his fame is due to the introduction of the astronomical globe, and his early contributions to understanding the movement of the planets.


Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

Original
In mathematical astronomy, his fame is due to the introduction of the astronomical globe, and his early contributions to understanding the movement of the planets.

Predictions >>>
In mathematical astronomy, his fame is due to the introduction of the astronomical globe and his early contributions to understanding the movement of the planets.
In mathematical astronomy, his fame is due to the introduction of the astronomical globe and his early contributions to understanding the movement of the planets.
In mathematical astronomy, his fame is due to the introduction of the astronomical globe and his early contributions to understanding the movement of the planets.
---------------------------------------------------------

Enter text to paraphrase: Why are people obsessed with Cara Delevingne?


Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

Original
Why are people obsessed with Cara Delevingne?

Predictions >>>
Why are people obsessed with Cara Delevingne?
Why are people obsessed with Cara Delevingne?
Why are people obsessed with Cara Delevingne?
---------------------------------------------------------

Enter text to paraphrase: Earl St Vincent was a British ship that was captured in 1803 and became a French trade man.


Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

Original
Earl St Vincent was a British ship that was captured in 1803 and became a French trade man.

Predictions >>>
Earl St Vincent was a British ship captured in 1803 and became a French trade man.
Earl St Vincent was a British ship captured in 1803 and became a French trade man.
Earl St Vincent was a British ship that was captured in 1803 and became a French trade man.
---------------------------------------------------------

Enter text to paraphrase: Worcester is a town and county city of Worcestershire in England.


Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

Original
Worcester is a town and county city of Worcestershire in England.

Predictions >>>
Worcester is a town and county city of Worcestershire in England.
Worcester is a town and county city of Worcestershire in England.
Worcester is a town and county city of Worcestershire in England.
---------------------------------------------------------

Enter text to paraphrase: The goal of any Deep Learning model is to take in an input and generate the correct output.


Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

Original
The goal of any Deep Learning model is to take in an input and generate the correct output.

Predictions >>>
The goal of any deep learning model is to take in an input and generate the correct output.
The goal of any deep learning model is to take in an input and generate the correct output.
The goal of any deep learning model is to take in an input and generate the correct output.
---------------------------------------------------------

Enter text to paraphrase: A recording of folk songs done for the Columbia society in 1942 was largely arranged by Pjetër Dungu.


Generating outputs:   0%|          | 0/1 [00:00<?, ?it/s]

Original
A recording of folk songs done for the Columbia society in 1942 was largely arranged by Pjetër Dungu.

Predictions >>>
A recording of folk songs done for the Columbia Society in 1942 was largely arranged by Pjetër Dungu.
A recording of folk songs done for the Columbia Society in 1942 was largely arranged by Pjetër Dungu.
A recording of folk songs done for the Columbia Society in 1942 was largely arranged by Pjetër Dungu.
---------------------------------------------------------

