**Installing necessary packages and importing necessary libraries**

In [None]:
from google.colab import drive
drive.mount('/content/drive')
!pip install -q simpletransformers rouge_score datasets evaluate torch accelerate tqdm nltk
import numpy as np
import pandas as pd
import datasets, nltk, torch, evaluate, warnings
nltk.download("punkt")
from datasets import Dataset, DatasetDict
from nltk.tokenize import sent_tokenize
from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, get_scheduler, pipeline, AutoTokenizer
from torch.utils.data import DataLoader
from torch.optim import AdamW
from accelerate import Accelerator
from tqdm.auto import tqdm
warnings.filterwarnings("ignore")

Mounted at /content/drive
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m315.5/315.5 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.1/290.1 kB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.7/101.7 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K    

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


**Reading the dataset**

In [None]:
df = pd.read_csv('/content/drive/MyDrive/text_summarization_dataset.csv')
df.head(5)

Unnamed: 0,headline,title,text
0,"\nMuddle the mint leaves, brown sugar and lime...",How to Make a Mojito Diablo,"Use a muddler, a pestle or the back of a spoo..."
1,"\nBuy resurrecting wings from the shop.,\nUse ...",How to Resurrect in Temple Run,You'll need 500 coins collected from your run...
2,"\nRinse your hands in vinegar.,\nMake a paste ...",How to Get a Bad Smell off Your Hands6,Vinegar is good for removing smells such as f...
3,\nApply a small amount of cleaning or metal po...,How to Remove a Scratch on Glass Cooktops2,";\n,\n\n\nThis procedure will test your cleani..."
4,"\nFind your birth animal.,\nRead about your zo...",How to Read Your Chinese Horoscope,Consult the chart below to find the year of y...


**Checking the shape of dataset**

In [None]:
df.shape

(5000, 3)

**Checking for missing values**

In [None]:
df.isnull().sum()

headline    18
title        0
text        22
dtype: int64

**Removing missing values and resetting index**

In [None]:
df = df.dropna()
df = df.reset_index(drop=True)
df.head(3)

Unnamed: 0,headline,title,text
0,"\nMuddle the mint leaves, brown sugar and lime...",How to Make a Mojito Diablo,"Use a muddler, a pestle or the back of a spoo..."
1,"\nBuy resurrecting wings from the shop.,\nUse ...",How to Resurrect in Temple Run,You'll need 500 coins collected from your run...
2,"\nRinse your hands in vinegar.,\nMake a paste ...",How to Get a Bad Smell off Your Hands6,Vinegar is good for removing smells such as f...


**Checking for and removing duplicates**

In [None]:
print (df.shape)
df = df.drop_duplicates()
print (df.shape)

(4978, 3)
(4978, 3)


**Selecting a reasonable datasize for model training**

In [None]:
df = df.iloc[:3000, :]
df.head(3)

Unnamed: 0,headline,title,text
0,"\nMuddle the mint leaves, brown sugar and lime...",How to Make a Mojito Diablo,"Use a muddler, a pestle or the back of a spoo..."
1,"\nBuy resurrecting wings from the shop.,\nUse ...",How to Resurrect in Temple Run,You'll need 500 coins collected from your run...
2,"\nRinse your hands in vinegar.,\nMake a paste ...",How to Get a Bad Smell off Your Hands6,Vinegar is good for removing smells such as f...


**Cleaning the title column**

In [None]:
# Remove numbers from the 'title' column
df['title'] = df['title'].str.replace('\d+', '', regex=True)
# Adding a : at the end of the 'title' column
df['title'] = df['title'].apply(lambda x: x + ':')

**Cleaning text and headline columns**

In [None]:
# Remove numbers and special characters except full stop, apostrophe and comma
df['headline'] = df['headline'].str.replace(r'[^a-zA-Z\s\'.]', '', regex=True).replace('\n', ' ', regex=True).replace('  ', ' ', regex=True)
df['text'] = df['text'].str.replace(r'[^a-zA-Z\s\'.]', '', regex=True).replace('\n', ' ', regex=True).replace('  ', ' ', regex=True)
df.head(2)

Unnamed: 0,headline,title,text
0,Muddle the mint leaves brown sugar and lime j...,How to Make a Mojito Diablo:,Use a muddler a pestle or the back of a spoon...
1,Buy resurrecting wings from the shop. Use the...,How to Resurrect in Temple Run:,You'll need coins collected from your runs. T...


**Making a somewhat larger summary by concatenating title and headline columns**



In [None]:
df['summary'] = df['title'] + ' ' + df['headline']
df.head(3)

Unnamed: 0,headline,title,text,summary
0,Muddle the mint leaves brown sugar and lime j...,How to Make a Mojito Diablo:,Use a muddler a pestle or the back of a spoon...,How to Make a Mojito Diablo: Muddle the mint ...
1,Buy resurrecting wings from the shop. Use the...,How to Resurrect in Temple Run:,You'll need coins collected from your runs. T...,How to Resurrect in Temple Run: Buy resurrect...
2,Rinse your hands in vinegar. Make a paste of ...,How to Get a Bad Smell off Your Hands:,Vinegar is good for removing smells such as f...,How to Get a Bad Smell off Your Hands: Rinse ...


**Dropping remaining columns**

In [None]:
df = df.iloc[:, 2:]
df.head(2)

Unnamed: 0,text,summary
0,Use a muddler a pestle or the back of a spoon...,How to Make a Mojito Diablo: Muddle the mint ...
1,You'll need coins collected from your runs. T...,How to Resurrect in Temple Run: Buy resurrect...


**Making train, test and validation splits**

In [None]:
from sklearn.model_selection import train_test_split
train_old, test = train_test_split(df, test_size = 0.2, random_state = 1)
train, val = train_test_split(train_old, test_size = 0.2, random_state = 1)
print (train.shape, test.shape, val.shape)

(1920, 2) (600, 2) (480, 2)


**Converting dataset to arrow format for faster training and removing newly made index column**

In [None]:
train = Dataset.from_pandas(train)
test = Dataset.from_pandas(test)
val = Dataset.from_pandas(val)

dataset = DatasetDict()

dataset['train'] = train
dataset['test'] = test
dataset['val'] = val

dataset = dataset.remove_columns(["__index_level_0__"])
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'summary'],
        num_rows: 1920
    })
    test: Dataset({
        features: ['text', 'summary'],
        num_rows: 600
    })
    val: Dataset({
        features: ['text', 'summary'],
        num_rows: 480
    })
})

**Filtering Dataset to Retain Elements with Summaries Longer Than Two Words**

In [None]:
dataset = dataset.filter(lambda x: len(x["summary"].split()) > 2)

Filter:   0%|          | 0/1920 [00:00<?, ? examples/s]

Filter:   0%|          | 0/600 [00:00<?, ? examples/s]

Filter:   0%|          | 0/480 [00:00<?, ? examples/s]

**Defining model checkpoint and initializing the tokenizer**

In [None]:
model_checkpoint = "facebook/bart-large"
tokenizer = AutoTokenizer.from_pretrained (model_checkpoint)

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.63k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

**Defining tokenization function and tokenizing the dataset**

In [None]:
max_input_length = 1024
max_target_length = 100


def preprocess_function(examples):
    model_inputs = tokenizer (examples["text"], max_length=max_input_length,
        truncation=True)
    labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)

Map:   0%|          | 0/1920 [00:00<?, ? examples/s]

Map:   0%|          | 0/600 [00:00<?, ? examples/s]

Map:   0%|          | 0/480 [00:00<?, ? examples/s]

**Defining evaluation metric and making a dictionary of various scores**

In [None]:
rouge_score = evaluate.load("rouge")

generated_summary = "I absolutely loved reading the Hunger Games"
reference_summary = "I loved reading the Hunger Games"

scores = rouge_score.compute(
    predictions=[generated_summary], references=[reference_summary])
scores

Downloading builder script:   0%|          | 0.00/6.27k [00:00<?, ?B/s]

{'rouge1': 0.923076923076923,
 'rouge2': 0.7272727272727272,
 'rougeL': 0.923076923076923,
 'rougeLsum': 0.923076923076923}

**Removing column names from tokenized dataset, extracting features from tokenized dataset and convering tokenized dataset to torch format**

In [None]:
tokenized_datasets = tokenized_datasets.remove_columns(dataset["train"].column_names)
features = [tokenized_datasets["train"][i] for i in range(2)]
tokenized_datasets.set_format("torch")

**Defining data postprocessing function**

In [None]:
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # ROUGE expects a newline after each sentence
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels

**Initializing the tokenizer, data collator, model, optimizer and accelerator**

In [None]:
model_checkpoint = "facebook/bart-large"
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained (model_checkpoint)
optimizer = AdamW(model.parameters(), lr=2e-5)
accelerator = Accelerator()
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

pytorch_model.bin:   0%|          | 0.00/1.02G [00:00<?, ?B/s]

**Preparing train and evaluation data loaders**

In [None]:
batch_size = 2
train_dataloader = DataLoader (tokenized_datasets["train"], shuffle=True, collate_fn=data_collator,
    batch_size=batch_size)

eval_dataloader = DataLoader(tokenized_datasets["val"], collate_fn=data_collator, batch_size=batch_size)

model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader)

**Defining training arguments**

In [None]:
num_train_epochs = 10
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler ("linear", optimizer=optimizer, num_warmup_steps=0,
    num_training_steps=num_training_steps, early_stopping=True,
                              num_beams=3, forced_bos_token_id= 0, forced_eos_token_id= 2)

In [None]:
def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # ROUGE expects a newline after each sentence
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels

**Defining the name and output directory for trained model**

In [None]:
model_name = "text_summarization_accelerate_own"
output_dir = "/content/drive/MyDrive/text summarization model/"

**Model training**

In [None]:
progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
            )

            generated_tokens = accelerator.pad_across_processes(
                generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
            )
            labels = batch["labels"]

            # If we did not pad to max length, we need to pad the labels too
            labels = accelerator.pad_across_processes(
                batch["labels"], dim=1, pad_index=tokenizer.pad_token_id
            )

            generated_tokens = accelerator.gather(generated_tokens).cpu().numpy()
            labels = accelerator.gather(labels).cpu().numpy()

            # Replace -100 in the labels as we can't decode them
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            if isinstance(generated_tokens, tuple):
                generated_tokens = generated_tokens[0]
            decoded_preds = tokenizer.batch_decode(
                generated_tokens, skip_special_tokens=True
            )
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

            decoded_preds, decoded_labels = postprocess_text(
                decoded_preds, decoded_labels
            )

            rouge_score.add_batch(predictions=decoded_preds, references=decoded_labels)

    # Compute metrics
    result = rouge_score.compute()
    # Extract the median ROUGE scores
    result = {key: value * 100 for key, value in result.items()}
    result = {k: round(v, 4) for k, v in result.items()}
    print(f"Epoch {epoch}:", result)

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)

  0%|          | 0/9600 [00:00<?, ?it/s]

Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


Epoch 0: {'rouge1': 27.9829, 'rouge2': 13.0631, 'rougeL': 24.9562, 'rougeLsum': 27.1676}


Non-default generation parameters: {'early_stopping': True, 'num_beams': 4, 'no_repeat_ngram_size': 3, 'forced_bos_token_id': 0, 'forced_eos_token_id': 2}


Epoch 1: {'rouge1': 26.7131, 'rouge2': 12.5356, 'rougeL': 23.8072, 'rougeLsum': 25.8921}


In [None]:
dataset["test"][5]["text"]

" You are probably angry or hurt which leads to you wanting to get away from this person whom you either loved or even still love but don't wish to remain with. It's a contradictory thing. If you are angry and still forced to be around this person it can lead to a blowup. It can also lead to sniping and arguments that bleed out any remaining good from a relationship.   Tell him or her you are feeling unhappy in the relationship and need some space to think and rid yourself of anger. It might take a firm tone to get this request across but do so and expect they other respects you enough to give you the time to think. Don't set aside a week to think and then hang out with them the next day. Distance yourself completely. Don't make or accept phone calls or texts. Don't see them or if you can't avoid that don't give them too much of your time. Make this time all about you even if you miss them. If you miss them too much try to put it into perspective. Make a pros and cons list. Make a list

In [None]:
dataset["test"][5]["summary"]

"How to Break Up with Someone Who Just Doesn't Get It:  Ask for space to allow you to work through your anger and to be certain of your decision. Assess what isn't working in the relationship. Consider whether or not you're willing to give a second chance. Be sure that you've worked through your anger as outlined in the previous section. Talk to your partner about what has led to this. Confirm the break up with firmness. Be ready for the possible responses to your firm breakup speech. Reiterate your reasons for the breakup if needed. Move on. Be kind to your former partner. Have others intervene on your behalf if your ex won't stop calling and contacting you. Realize that you may feel weary and shocked for a while."

**Taking input from user and generating summary from the trained model**

In [None]:
summarizer = pipeline("summarization", model="/content/drive/MyDrive/text summarization model/")

user_input = input("Please enter the text:")

truncated_input = user_input[:1024]

output = summarizer(truncated_input)
summary_text = output[0]['summary_text']

print("Summary:", summary_text)

Please enter the text: You are probably angry or hurt which leads to you wanting to get away from this person whom you either loved or even still love but don't wish to remain with. It's a contradictory thing. If you are angry and still forced to be around this person it can lead to a blowup. It can also lead to sniping and arguments that bleed out any remaining good from a relationship.   Tell him or her you are feeling unhappy in the relationship and need some space to think and rid yourself of anger. It might take a firm tone to get this request across but do so and expect they other respects you enough to give you the time to think. Don't set aside a week to think and then hang out with them the next day. Distance yourself completely. Don't make or accept phone calls or texts. Don't see them or if you can't avoid that don't give them too much of your time. Make this time all about you even if you miss them. If you miss them too much try to put it into perspective. Make a pros and c