<a href="https://colab.research.google.com/github/mayank-soni/text_summary/blob/transformer_train/notebooks/transformer_train_david.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install requirements

In [None]:
! pip install transformers datasets
! pip install rouge-score nltk
! pip install huggingface_hub

#Set parameters

In [2]:
model_checkpoint = 'sshleifer/distilbart-cnn-12-6'
dataset_name = 'xsum'
metric_name = 'rouge'

# Loading data

In [5]:
import transformers

In [6]:
from datasets import load_from_disk
raw_datasets_t = load_from_disk('train_data')
raw_datasets_v = load_from_disk('validation_data')

In [7]:
print (raw_datasets_t)
print(raw_datasets_v)

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 1020
})
Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 1232
})


In [8]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML


def show_random_elements(dataset, num_examples=5, random_seed=36):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    random.seed(random_seed)
    picks = random.sample(range(len(dataset)), num_examples)
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
      if isinstance(typ, datasets.ClassLabel):
        df[column] = df[column].transform(lambda i: typ.names[i])
    #display(HTML(df.to_html()))
    return df

In [9]:
data = show_random_elements(raw_datasets_t)
data.head()

Unnamed: 0,article,highlights,id
0,Siem de Jong played 45 minutes for Newcastle U...,Siem de Jong has made just one Premier League ...,f7b25ae2d51010ec62051aa98b16cd296e30ea8e
1,Reigning champion Novak Djokovic dug deep to a...,Novak Djokovic came from a set down to beat Al...,c85f506937c58a9c2d0b01a8f4d3ba8bc9dba746
2,Real Madrid’s La Liga and Champions League cha...,Luka Modric had to be replaced with a knee com...,7a186935a187d02a0103a15008e5eea42d6d7128
3,The Irish Football Association is hoping that ...,Northern Ireland beat Finland 2-1 in their Eur...,76aeceff1520b88a584a3235daf944b5cec41419
4,A young father who died in a paragliding accid...,Kyle Wittstock crashed into a garage door when...,0aa62c258c24ccec5d59272d0c7c04df9630d588


#Load metric

In [10]:
from datasets import load_metric
metric = load_metric(metric_name)

  This is separate from the ipykernel package so we can avoid doing imports until


Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

# Pre-process data

In [13]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

ReadTimeout: ignored

In [12]:
prefix_models = ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]
if model_checkpoint in prefix_models:
  prefix = "summarize: "
else:
  prefix = ''

In [13]:

def preprocess_function(data):
  inputs = [prefix + doc for doc in data["article"]]
  tokenized_data = tokenizer(text=inputs, truncation=True, text_target=data['highlights'])
  return tokenized_data 


In [14]:
preprocess_function(raw_datasets_t[:2])
preprocess_function(raw_datasets_v[:2])

{'input_ids': [[0, 1640, 16256, 43, 3399, 22965, 585, 307, 14, 24, 34, 4639, 63, 5436, 9, 1393, 12255, 6926, 611, 6, 442, 123, 4973, 7, 671, 7, 5, 2414, 1320, 480, 13176, 22, 5087, 27495, 9505, 72, 6926, 611, 21, 3456, 71, 10, 8951, 2366, 461, 303, 14, 37, 1153, 2021, 1897, 1476, 136, 39, 320, 6096, 6, 12372, 925, 4473, 3937, 4, 264, 1238, 5, 14177, 1393, 9, 16004, 69, 30, 5, 14599, 8, 26963, 69, 471, 136, 10, 2204, 11, 39, 4243, 184, 23, 8951, 18, 21860, 1016, 13243, 11, 772, 4, 34192, 6, 5, 10591, 4482, 968, 2234, 9826, 39, 27495, 5436, 8, 685, 258, 498, 4, 280, 2425, 37, 2039, 5, 191, 12, 12211, 19627, 1764, 25, 157, 25, 80, 7757, 13780, 968, 4694, 4, 125, 37, 197, 28, 441, 7, 3511, 149, 5, 1136, 6, 8, 10591, 161, 14, 24, 40, 27673, 63, 7404, 13, 123, 7, 3511, 11, 70, 2836, 1061, 4, 20, 403, 136, 6926, 611, 362, 10, 1233, 1004, 94, 186, 6, 77, 5, 8951, 641, 9, 1659, 585, 14, 1103, 74, 45, 28, 1658, 136, 123, 4, 22, 1620, 38, 33, 26, 31, 5, 1786, 6, 38, 222, 45, 6225, 1897, 2134, 60,

In [15]:
tokenized_datasets_t = raw_datasets_t.map(preprocess_function, batched=True)
tokenized_datasets_v = raw_datasets_v.map(preprocess_function, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

#Fine-tuning

TODO -> check if the unused weights are problematic

In [16]:
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_pt=True)

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBartForConditionalGeneration: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight']
- This IS expected if you are initializing TFBartForConditionalGeneration from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBartForConditionalGeneration from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBartForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartForConditionalGeneration for predictions without further training.


In [33]:
#batch_size = 8
batch_size = 1
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 1

In [34]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")
generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=128)

In [19]:
#tokenized_datasets["train"]
print(tokenized_datasets_t)
print(tokenized_datasets_v)

Dataset({
    features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1020
})
Dataset({
    features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1232
})


TODO: Understand why validation set is processed twice, once for validation and once for generation

In [35]:
train_dataset = model.prepare_tf_dataset(
    tokenized_datasets_t,
    batch_size=batch_size,
    shuffle=True,
    collate_fn=data_collator,
)

validation_dataset = model.prepare_tf_dataset(
    tokenized_datasets_v,
    batch_size=batch_size,
    shuffle=False,
    collate_fn=data_collator,
)

generation_dataset = model.prepare_tf_dataset(
    tokenized_datasets_v,
    batch_size=batch_size,
    shuffle=False,
    collate_fn=generation_data_collator
)


In [36]:
from transformers import AdamWeightDecay
import tensorflow as tf

optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model.compile(optimizer=optimizer)

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


Consider adding a KerasMetricCallback:
is a callback for computing advanced metrics. There are a number of common metrics in NLP like ROUGE which are hard to fit into your compiled training loop because they depend on decoding predictions and labels back to strings with the tokenizer, and calling arbitrary Python functions to compute the metric. The KerasMetricCallback will wrap a metric function, outputting metrics as training progresses.

In [37]:
import numpy as np
import nltk

def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # Rouge expects a newline after each sentence
    decoded_predictions = [
        "\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_predictions
    ]
    decoded_labels = [
        "\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels
    ]
    result = metric.compute(
        predictions=decoded_predictions, references=decoded_labels, use_stemmer=True
    )
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    # Add mean generated length
    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions
    ]
    result["gen_len"] = np.mean(prediction_lens)

    return result

In [52]:
import tensorflow as tf
from transformers.keras_callbacks import PushToHubCallback, KerasMetricCallback
from tensorflow.keras.callbacks import TensorBoard

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    tensorboard_callback = TensorBoard(log_dir="./summarization_model_save/logs")
    model.fit(train_dataset, validation_data=validation_dataset, epochs=10) 
    #callbacks=callbacks)
   # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
      logical_gpus = tf.config.list_logical_devices('GPU')
      print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Physical devices cannot be modified after being initialized


In [None]:
'''
from transformers.keras_callbacks import PushToHubCallback, KerasMetricCallback
from tensorflow.keras.callbacks import TensorBoard

tensorboard_callback = TensorBoard(log_dir="./summarization_model_save/logs")

push_to_hub_callback = PushToHubCallback(
    output_dir="./summarization_model_save",
    tokenizer=tokenizer,
    hub_model_id=push_to_hub_model_id,
)

metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=generation_dataset, predict_with_generate=True, use_xla_generation=True
)

callbacks = [metric_callback, tensorboard_callback, push_to_hub_callback]

model.fit(
    train_dataset, validation_data=validation_dataset, epochs=1) 
    #callbacks=callbacks) 
'''

In [None]:
'''
from transformers import AutoTokenizer, TFAutoModelForSeq2SeqLM

# You can of course substitute your own username and model here if you've trained and uploaded it!
#model_name = 'Rocketknight1/t5-small-finetuned-xsum'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
'''

In [1]:
document = 'The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.\nRepair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.\nTrains on the west coast mainline face disruption due to damage at the Lamington Viaduct.\nMany businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.\nFirst Minister Nicola Sturgeon visited the area to inspect the damage.\nThe waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.\nJeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.\nHowever, she said more preventative work could have been carried out to ensure the retaining wall did not fail.\n"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that - but it is almost like we\'re neglected or forgotten," she said.\n"That may not be true but it is perhaps my perspective over the last few days.\n"Why were you not ready to help us a bit more when the warning and the alarm alerts had gone out?"\nMeanwhile, a flood alert remains in place across the Borders because of the constant rain.\nPeebles was badly hit by problems, sparking calls to introduce more defences in the area.\nScottish Borders Council has put a list on its website of the roads worst affected and drivers have been urged not to ignore closure signs.\nThe Labour Party\'s deputy Scottish leader Alex Rowley was in Hawick on Monday to see the situation first hand.\nHe said it was important to get the flood protection plan right but backed calls to speed up the process.\n"I was quite taken aback by the amount of damage that has been done," he said.\n"Obviously it is heart-breaking for people who have been forced out of their homes and the impact on businesses."\nHe said it was important that "immediate steps" were taken to protect the areas most vulnerable and a clear timetable put in place for flood prevention plans.\nHave you been affected by flooding in Dumfries and Galloway or the Borders? Tell us about your experience of the situation and how it was handled. Email us on selkirk.news@bbc.co.uk or dumfries@bbc.co.uk.'
if 't5' in model_checkpoint: 
    document = "summarize: " + document
tokenized = tokenizer([document], return_tensors='np')
out = model.generate(**tokenized, max_length=128)

NameError: ignored

In [54]:
with tokenizer.as_target_tokenizer():
    print(tokenizer.decode(out[0]))

</s><s><s><s>The full cost of damage in Newton Stewart, Peeblesshire, is still being assessed.
Many businesses and householders were affected by flooding after the River Cree overflowed into the town.
The waters breached a retaining wall, flooding many of the area.
Labour Party's deputy Scottish leader Alex Rowley was in Hawick on Monday to see the situation first hand.
He backed calls for more defences in the area to improve infrastructure.
But Rowley said he was taken aback by the damage.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>
