# Paraphrasing


## BART (Bidirectional and Auto-Regressive Transformer)

https://huggingface.co/facebook/bart-base


In [1]:
# imports
from transformers import BartTokenizer, BartForConditionalGeneration

# Load pre-trained BART model and tokenizer
model_name = "facebook/bart-base"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

# Set up input sentences
sentences = [
    "She was a storm, not the kind you run from, but the kind you chase.",
    "In the end, we only regret the chances we didn't take.",
    "She wasn't looking for a knight, she was looking for a sword.",
    "I dreamt I am running on sand in the night",
]

# Paraphrase the sentences
for sentence in sentences:
    # Tokenize the input sentence
    input_ids = tokenizer.encode(sentence, return_tensors="pt")

    # Generate paraphrased sentence
    paraphrase_ids = model.generate(
        input_ids, num_beams=5, max_length=100, early_stopping=True
    )

    # Decode and print the paraphrased sentence
    paraphrase = tokenizer.decode(paraphrase_ids[0], skip_special_tokens=True)
    print(f"Original: {sentence}")
    print(f"Paraphrase: {paraphrase}")
    print()

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.72k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

Original: She was a storm, not the kind you run from, but the kind you chase.
Paraphrase: She was a storm, not the kind you run from, but the kind that you chase.

Original: In the end, we only regret the chances we didn't take.
Paraphrase: In the end, we only regret the chances we didn't take.

Original: She wasn't looking for a knight, she was looking for a sword.
Paraphrase: She wasn't looking at a knight, she was looking for a sword.

Original: I dreamt I am running on sand in the night
Paraphrase: I dreamt I am running on sand in the night



## T5 (Text-To-Text Transfer Transformer)

- https://huggingface.co/t5-base
- https://huggingface.co/docs/transformers/model_doc/t5


In [1]:
# imports
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load pre-trained T5 Base model and tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-large", model_max_length=1024)
model = T5ForConditionalGeneration.from_pretrained("t5-large")

# Set up input sentences
sentences = [
    "She was a storm, not the kind you run from, but the kind you chase.",
    "In the end, we only regret the chances we didn't take.",
    "She wasn't looking for a knight, she was looking for a sword.",
    "I dreamt I am running on sand in the night",
]

# Paraphrase the sentences
for sentence in sentences:
    # Tokenize the input sentence
    input_ids = tokenizer.encode(sentence, return_tensors="pt")

    # Generate paraphrased sentence
    paraphrase_ids = model.generate(
        input_ids, num_beams=5, max_length=100, early_stopping=True
    )

    # Decode and print the paraphrased sentence
    paraphrase = tokenizer.decode(paraphrase_ids[0], skip_special_tokens=True)
    print(f"Original: {sentence}")
    print(f"Paraphrase: {paraphrase}")
    print()

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


model.safetensors:   0%|          | 0.00/2.95G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Original: She was a storm, not the kind you run from, but the kind you chase.
Paraphrase: She was a storm. She was a storm. She was a storm. She was a storm. She was a storm. She was a storm. She was a storm. She was a storm. She was a storm. She was a storm. She was a storm. She was a storm. She was a storm.....

Original: In the end, we only regret the chances we didn't take.
Paraphrase: take. We only regret the chances we didn't take.. We only regret the chances we didn't take.... only regret the chances we didn't take....... We only regret the chances we didn't take...... only regret the chances we didn't take. . . ...

Original: She wasn't looking for a knight, she was looking for a sword.
Paraphrase: a knight, she wasn't looking for a sword.. a sword. a sword.a sword. a sword. She wasn't looking for a knight, she wasn't looking for a sword. a sword...... She was looking for a sword........ a sword.

Original: I dreamt I am running on sand in the night
Paraphrase: I am running on 

## Pegasus Paraphrase


In [2]:
# imports
from transformers import PegasusTokenizer, PegasusForConditionalGeneration

# load pre-trained Pegasus Paraphrase model and tokenizer
tokenizer = PegasusTokenizer.from_pretrained("tuner007/pegasus_paraphrase")
model = PegasusForConditionalGeneration.from_pretrained("tuner007/pegasus_paraphrase")

# input sentences
sentences = [
    "She was a storm, not the kind you run from, but the kind you chase.",
    "She wasn't looking for a knight, she was looking for a sword.",
    "In the end, we only regret the chances we didn't take.",
    "I dreamt I am running on sand in the night",
    "Long long ago, there lived a king and a queen. For a long time, they had no children.",
    "I am typing the best article on paraphrasing with Transformers.",
]

# Paraphrase the sentences
for sentence in sentences:
    # Tokenize the input sentence
    input_ids = tokenizer.encode(sentence, return_tensors="pt")

    # Generate paraphrased sentence
    paraphrase_ids = model.generate(
        input_ids, num_beams=5, max_length=100, early_stopping=True
    )

    # Decode and print the paraphrased sentence
    paraphrase = tokenizer.decode(paraphrase_ids[0], skip_special_tokens=True)
    print(f"Original: {sentence}")
    print(f"Paraphrase: {paraphrase}")
    print()

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at tuner007/pegasus_paraphrase and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Original: She was a storm, not the kind you run from, but the kind you chase.
Paraphrase: She was a storm, not the kind you run from, but the kind you chase.

Original: She wasn't looking for a knight, she was looking for a sword.
Paraphrase: She was looking for a sword, not a knight.

Original: In the end, we only regret the chances we didn't take.
Paraphrase: We regret the chances we didn't take.

Original: I dreamt I am running on sand in the night
Paraphrase: I ran on the sand in the night.

Original: Long long ago, there lived a king and a queen. For a long time, they had no children.
Paraphrase: They had no children for a long time.

Original: I am typing the best article on paraphrasing with Transformers.
Paraphrase: I am writing the best article on the subject.



In [4]:
# imports
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

# Load the Pegasus Paraphrase model and tokenizer
model_name = "tuner007/pegasus_paraphrase"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)


# function to paraphrase long texts by adjusting the input length
def paraphrase_paragraph(text):

    # Split the text into sentences
    sentences = text.split(".")
    paraphrases = []

    for sentence in sentences:
        # Clean up sentences

        # remove extra whitespace
        sentence = sentence.strip()

        # filter out empty sentences
        if len(sentence) == 0:
            continue

        # Tokenize the sentence
        inputs = tokenizer.encode_plus(
            sentence, return_tensors="pt", truncation=True, max_length=512
        )

        input_ids = inputs["input_ids"]
        attention_mask = inputs["attention_mask"]

        # paraphrase
        paraphrase = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            num_beams=4,
            max_length=100,
            early_stopping=True,
        )[0]
        paraphrased_text = tokenizer.decode(paraphrase, skip_special_tokens=True)

        paraphrases.append(paraphrased_text)

    # Combine the paraphrases
    combined_paraphrase = " ".join(paraphrases)

    return combined_paraphrase


# Example usage
text = "As Sir Henry and I sat at breakfast, the sunlight flooded in through the high mullioned windows, throwing watery patches of color from the coats of arms which covered them. The dark panelling glowed like bronze in the golden rays, and it was hard to realize that this was indeed the chamber which had struck such a gloom into our souls upon the evening before. But the evening before, Sir Henry's nerves were still handled the stimulant of suspense, and he came to breakfast, his cheeks flushed in the exhilaration of the early chase."
paraphrase = paraphrase_paragraph(text)
print(paraphrase)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at tuner007/pegasus_paraphrase and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


As Sir Henry and I sat at breakfast, the sunlight flooded in through the high windows, causing watery patches of color from the coats of arms. The dark panelling glowed like bronze in the golden rays, and it was hard to see that it was the chamber which had struck such a gloom into our souls the evening before. The evening before, Sir Henry's nerves were still handled and he came to breakfast, his cheeks flushed from the excitement of the early chase.


In [5]:
# importing the PEGASUS Transformer model
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

model_name = "tuner007/pegasus_paraphrase"
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)


# setting up the model
def get_response(input_text, num_return_sequences):
    batch = tokenizer.prepare_seq2seq_batch(
        [input_text],
        truncation=True,
        padding="longest",
        max_length=60,
        return_tensors="pt",
    ).to(torch_device)
    translated = model.generate(
        **batch,
        max_length=60,
        num_beams=10,
        num_return_sequences=num_return_sequences,
        temperature=1.5
    )
    tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
    return tgt_text

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at tuner007/pegasus_paraphrase and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
# test input sentence
text = "I will be showing you how to build a web application in Python using the SweetViz and its dependent library."

# printing response
get_response(text, 5)

`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more complete example, see the implementation of `prepare_seq2seq_batch`.



['I will show you how to use the SweetViz and its dependent library to build a web application.',
 'I will show you how to use the SweetViz library to build a web application.',
 'I will show you how to build a web application using the SweetViz and its dependent library.',
 'I will show you how to use the SweetViz and its dependent library to build a web application in Python.',
 'I will show you how to build a web application in Python using the SweetViz library.']

In [8]:
# Paragraph of text
context = "I will be showing you how to build a web application in Python using the SweetViz and its dependent library. Data science combines multiple fields, including statistics, scientific methods, artificial intelligence (AI), and data analysis, to extract value from data. Those who practice data science are called data scientists, and they combine a range of skills to analyze data collected from the web, smartphones, customers, sensors, and other sources to derive actionable insights."

# Takes the input paragraph and splits it into a list of sentences
from sentence_splitter import SentenceSplitter, split_text_into_sentences

splitter = SentenceSplitter(language="en")

sentence_list = splitter.split(context)
sentence_list

['I will be showing you how to build a web application in Python using the SweetViz and its dependent library.',
 'Data science combines multiple fields, including statistics, scientific methods, artificial intelligence (AI), and data analysis, to extract value from data.',
 'Those who practice data science are called data scientists, and they combine a range of skills to analyze data collected from the web, smartphones, customers, sensors, and other sources to derive actionable insights.']

In [7]:
!pip install sentence-splitter

Collecting sentence-splitter
  Downloading sentence_splitter-1.4-py2.py3-none-any.whl.metadata (2.8 kB)
Downloading sentence_splitter-1.4-py2.py3-none-any.whl (44 kB)
Installing collected packages: sentence-splitter
Successfully installed sentence-splitter-1.4


In [3]:
from sentence_splitter import SentenceSplitter, split_text_into_sentences
import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer

model_name = "tuner007/pegasus_paraphrase"
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)


def get_response(input_text, num_return_sequences):
    batch = tokenizer.prepare_seq2seq_batch(
        [input_text],
        truncation=True,
        padding="longest",
        max_length=60,
        return_tensors="pt",
    ).to(torch_device)
    translated = model.generate(
        **batch,
        max_length=60,
        num_beams=10,
        num_return_sequences=num_return_sequences,
        temperature=1.5
    )
    tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
    return tgt_text


# context = input("Enter Paragraph to be Paraphrased: ")
# print(context)

context = "Born in England in 1996, Tom Holland joined the London production of Billy Elliot the Musical in 2008. He soon found success in film, drawing strong reviews for his performance in The Impossible (2012). Tapped to take over the iconic role of Peter Parker/Spider-Man for the big screen, Holland made his debut as the superhero in Captain America: Civil War (2016), before earning the chance to carry his own feature with Spider-Man: Homecoming (2017)."

splitter = SentenceSplitter(language="en")

sentence_list = splitter.split(context)
sentence_list

paraphrase = []

for i in sentence_list:
    a = get_response(i, 1)
    paraphrase.append(a)

paraphrase2 = [" ".join(x) for x in paraphrase]
paraphrase2

paraphrase3 = [" ".join(x for x in paraphrase2)]
paraphrased_text = str(paraphrase3).strip("[]").strip("'")
paraphrased_text

print("Paragraph before Paraphrased \n" + context)
print("\n ------------------------------------------- \n")
print("Paragraph after Paraphrased \n" + paraphrased_text)

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at tuner007/pegasus_paraphrase and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
`prepare_seq2seq_batch` is deprecated and will be removed in version 5 of HuggingFace Transformers. Use the regular
`__call__` method to prepare your inputs and targets.

Here is a short example:

model_inputs = tokenizer(src_texts, text_target=tgt_texts, ...)

If you either need to use different keyword arguments for the source and target texts, you should do two calls like
this:

model_inputs = tokenizer(src_texts, ...)
labels = tokenizer(text_target=tgt_texts, ...)
model_inputs["labels"] = labels["input_ids"]

See the documentation of your specific tokenizer for more details on the specific arguments to the tokenizer of choice.
For a more compl

Paragraph before Paraphrased 
Born in England in 1996, Tom Holland joined the London production of Billy Elliot the Musical in 2008. He soon found success in film, drawing strong reviews for his performance in The Impossible (2012). Tapped to take over the iconic role of Peter Parker/Spider-Man for the big screen, Holland made his debut as the superhero in Captain America: Civil War (2016), before earning the chance to carry his own feature with Spider-Man: Homecoming (2017).

 ------------------------------------------- 

Paragraph after Paraphrased 
In 2008 Tom Holland joined the London production of Billy Elliot the Musical. He got good reviews for his performance in The Impossible. After taking over the role of Spider-Man for the big screen, Holland made his debut as the superhero in Captain America: Civil War, before earning the chance to carry his own feature with Spider-Man: Homecoming.


In [None]:
context = "Born in England in 1996, Tom Holland joined the London production of Billy Elliot the Musical in 2008. He soon found success in film, drawing strong reviews for his performance in The Impossible (2012). Tapped to take over the iconic role of Peter Parker/Spider-Man for the big screen, Holland made his debut as the superhero in Captain America: Civil War (2016), before earning the chance to carry his own feature with Spider-Man: Homecoming (2017)."

splitter = SentenceSplitter(language="en")

sentence_list = splitter.split(context)