<a href="https://colab.research.google.com/github/mayank-soni/text_summary/blob/transformer_train/transformer_train_david_(5_Dec).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install requirements

In [6]:
! pip install transformers datasets
! pip install rouge-score nltk
! pip install huggingface_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 7.7 MB/s 
[?25hCollecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 61.7 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 58.9 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 29.5 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 61.6 MB/s 
[?25hCollecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-

#Set parameters

In [None]:
model_checkpoint = 'sshleifer/distilbart-cnn-12-6'
dataset_name = 'xsum'
metric_name = 'rouge'

# Loading data

In [None]:
import transformers

In [8]:
from datasets import load_from_disk
raw_datasets_t = load_from_disk('train_data')
raw_datasets_v = load_from_disk('validation_data')

In [None]:
print (raw_datasets_t)
print(raw_datasets_v)

Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 1020
})
Dataset({
    features: ['article', 'highlights', 'id'],
    num_rows: 1232
})


In [None]:
import random
import pandas as pd
import datasets
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5, random_seed=36):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    random.seed(random_seed)
    picks = random.sample(range(len(dataset)), num_examples)
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
      if isinstance(typ, datasets.ClassLabel):
        df[column] = df[column].transform(lambda i: typ.names[i])
    #display(HTML(df.to_html()))
    return df

In [None]:
data = show_random_elements(raw_datasets_t)
data.head()

Unnamed: 0,article,highlights,id
0,Siem de Jong played 45 minutes for Newcastle U...,Siem de Jong has made just one Premier League ...,f7b25ae2d51010ec62051aa98b16cd296e30ea8e
1,Reigning champion Novak Djokovic dug deep to a...,Novak Djokovic came from a set down to beat Al...,c85f506937c58a9c2d0b01a8f4d3ba8bc9dba746
2,Real Madrid’s La Liga and Champions League cha...,Luka Modric had to be replaced with a knee com...,7a186935a187d02a0103a15008e5eea42d6d7128
3,The Irish Football Association is hoping that ...,Northern Ireland beat Finland 2-1 in their Eur...,76aeceff1520b88a584a3235daf944b5cec41419
4,A young father who died in a paragliding accid...,Kyle Wittstock crashed into a garage door when...,0aa62c258c24ccec5d59272d0c7c04df9630d588


#Load metric

In [None]:
from datasets import load_metric
#metric = load_metric(metric_name)
metric = load_metric("rouge")

  metric = load_metric("rouge")


Downloading builder script:   0%|          | 0.00/2.16k [00:00<?, ?B/s]

# Pre-process data

In [None]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [None]:
prefix_models = ["t5-small", "t5-base", "t5-large", "t5-3b", "t5-11b"]
if model_checkpoint in prefix_models:
  prefix = "summarize: "
else:
  prefix = ''

In [None]:

def preprocess_function(data):
  inputs = [prefix + doc for doc in data["article"]]
  tokenized_data = tokenizer(text=inputs, truncation=True, text_target=data['highlights'])
  return tokenized_data 


In [None]:
preprocess_function(raw_datasets_t[:2])
preprocess_function(raw_datasets_v[:2])

{'input_ids': [[0, 1640, 16256, 43, 3399, 22965, 585, 307, 14, 24, 34, 4639, 63, 5436, 9, 1393, 12255, 6926, 611, 6, 442, 123, 4973, 7, 671, 7, 5, 2414, 1320, 480, 13176, 22, 5087, 27495, 9505, 72, 6926, 611, 21, 3456, 71, 10, 8951, 2366, 461, 303, 14, 37, 1153, 2021, 1897, 1476, 136, 39, 320, 6096, 6, 12372, 925, 4473, 3937, 4, 264, 1238, 5, 14177, 1393, 9, 16004, 69, 30, 5, 14599, 8, 26963, 69, 471, 136, 10, 2204, 11, 39, 4243, 184, 23, 8951, 18, 21860, 1016, 13243, 11, 772, 4, 34192, 6, 5, 10591, 4482, 968, 2234, 9826, 39, 27495, 5436, 8, 685, 258, 498, 4, 280, 2425, 37, 2039, 5, 191, 12, 12211, 19627, 1764, 25, 157, 25, 80, 7757, 13780, 968, 4694, 4, 125, 37, 197, 28, 441, 7, 3511, 149, 5, 1136, 6, 8, 10591, 161, 14, 24, 40, 27673, 63, 7404, 13, 123, 7, 3511, 11, 70, 2836, 1061, 4, 20, 403, 136, 6926, 611, 362, 10, 1233, 1004, 94, 186, 6, 77, 5, 8951, 641, 9, 1659, 585, 14, 1103, 74, 45, 28, 1658, 136, 123, 4, 22, 1620, 38, 33, 26, 31, 5, 1786, 6, 38, 222, 45, 6225, 1897, 2134, 60,

In [None]:
tokenized_datasets_t = raw_datasets_t.map(preprocess_function, batched=True)
tokenized_datasets_v = raw_datasets_v.map(preprocess_function, batched=True)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/2 [00:00<?, ?ba/s]

#Fine-tuning

TODO -> check if the unused weights are problematic

In [None]:
from transformers import TFAutoModelForSeq2SeqLM, DataCollatorForSeq2Seq
model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, from_pt=True)

Downloading:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBartForConditionalGeneration: ['model.encoder.embed_tokens.weight', 'model.decoder.embed_tokens.weight']
- This IS expected if you are initializing TFBartForConditionalGeneration from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBartForConditionalGeneration from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBartForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartForConditionalGeneration for predictions without further training.


In [None]:
#batch_size = 8
batch_size = 1
learning_rate = 2e-5
weight_decay = 0.01
num_train_epochs = 1

In [None]:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")
generation_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf", pad_to_multiple_of=128)

In [None]:
#tokenized_datasets["train"]
print(tokenized_datasets_t)
print(tokenized_datasets_v)

Dataset({
    features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1020
})
Dataset({
    features: ['article', 'highlights', 'id', 'input_ids', 'attention_mask', 'labels'],
    num_rows: 1232
})


TODO: Understand why validation set is processed twice, once for validation and once for generation

In [None]:
train_dataset = model.prepare_tf_dataset(
    tokenized_datasets_t,
    batch_size=batch_size,
    shuffle=True,
    collate_fn=data_collator,
)

validation_dataset = model.prepare_tf_dataset(
    tokenized_datasets_v,
    batch_size=batch_size,
    shuffle=False,
    collate_fn=data_collator,
)

generation_dataset = model.prepare_tf_dataset(
    tokenized_datasets_v,
    batch_size=batch_size,
    shuffle=False,
    collate_fn=generation_data_collator
)

You're using a BartTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


In [None]:
from transformers import AdamWeightDecay
import tensorflow as tf

optimizer = AdamWeightDecay(learning_rate=learning_rate, weight_decay_rate=weight_decay)
model.compile(optimizer=optimizer, run_eagerly=True)

print("After compiling model :",tf.executing_eagerly())

No loss specified in compile() - the model's internal loss computation will be used as the loss. Don't panic - this is a common way to train TensorFlow models in Transformers! To disable this behaviour please pass a loss argument, or explicitly pass `loss=None` if you do not want your model to compute a loss.


After compiling model : True


Consider adding a KerasMetricCallback:
is a callback for computing advanced metrics. There are a number of common metrics in NLP like ROUGE which are hard to fit into your compiled training loop because they depend on decoding predictions and labels back to strings with the tokenizer, and calling arbitrary Python functions to compute the metric. The KerasMetricCallback will wrap a metric function, outputting metrics as training progresses.

In [None]:
import numpy as np
import nltk

def metric_fn(eval_predictions):
    predictions, labels = eval_predictions
    decoded_predictions = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    for label in labels:
        label[label < 0] = tokenizer.pad_token_id  # Replace masked label tokens
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # Rouge expects a newline after each sentence
    decoded_predictions = [
        "\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_predictions
    ]
    decoded_labels = [
        "\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels
    ]
    result = metric.compute(
        predictions=decoded_predictions, references=decoded_labels, use_stemmer=True
    )
    # Extract a few results
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    # Add mean generated length
    prediction_lens = [
        np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions
    ]
    result["gen_len"] = np.mean(prediction_lens)

    return result

In [None]:
import tensorflow as tf
from transformers.keras_callbacks import PushToHubCallback, KerasMetricCallback
from tensorflow.keras.callbacks import TensorBoard

#tensorboard_callback = TensorBoard(log_dir="./summarization_model_save/logs")

metric_callback = KerasMetricCallback(
    metric_fn, eval_dataset=generation_dataset, predict_with_generate=True, 
  use_xla_generation=True
)
callbacks = [metric_callback] #tensorboard_callback]

model.fit(train_dataset, validation_data=validation_dataset, epochs=1, verbose = 1, callbacks=callbacks)





KeyboardInterrupt: ignored

In [None]:
!mkdir -p saved_model
model.save_pretrained('saved_model/my_model')

new_model = TFAutoModelForSeq2SeqLM.from_pretrained('saved_model/my_model')
new_model.summary()

All model checkpoint layers were used when initializing TFBartForConditionalGeneration.

All the layers of TFBartForConditionalGeneration were initialized from the model checkpoint at saved_model/my_model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBartForConditionalGeneration for predictions without further training.


Model: "tf_bart_for_conditional_generation_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 model (TFBartMainLayer)     multiple                  305510400 
                                                                 
 final_logits_bias (BiasLaye  multiple                 50264     
 r)                                                              
                                                                 
Total params: 305,560,664
Trainable params: 305,510,400
Non-trainable params: 50,264
_________________________________________________________________


In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
from google.colab import auth
auth.authenticate_user()
project_id = 'text-summarisation-370107'
!gcloud config set project {project_id}
!gsutil ls

In [None]:
!gsutil config
!pip install gsutil 

In [None]:
bucket_name = 'sports-model'
!gsutil -m cp -r /content/data/pretrained_model/* gs://{bucket_name}/

In [None]:
predicted =[]
for document in raw_list:
  #document = '''"But the physical and emotional toll of a first return home for three years, as the British Open champion, to collect a Greg Norman medal, a third Australian PGA Championship and adulation even when he ate, caught up with Smith in Melbourne.The world No.3 called his reintroduction to the sandbelt in the opening round of the Australian Open on Thursday "pretty rubbish".Hopes of turning that around with a course more from Victoria to Kingston Heath on Friday didn't materialise, a bogey on is opening hole the start of another lacklustre round which included taking another unplayable lie after a drive he thought was "perfect" on the way to a second straight round of one-over par. His name hovered just a single shot below the cut line all afternoon before finally rising above just before the sun set meaning he gets to front up again on Saturday and try and work his way back into the tournament. That means trying to lift himself in to the top 30 for the second Saturday cut at the first-ever dual gender Australian Open, which could require some sort of turnaround for the Queenslander. . Smith was already thinking about the time off he was going to enjoy before the afternoon players teed off, before some late bogeys secured his passage to the weekend and at least one more round before a well-earned break. "I was just really uncomfortable all day kind of similar to yesterday just couldn't quite hit the ball out the middle of the club face for some reason or another," Smith said, more defeated than he was after his opening round. "My mind was a little bit foggy obviously a little bit tired as well. Last week had been such a big week so yeah, just pretty disappointing."I tend to really struggle on back-to-back weeks I think because I do put so much into that first week. Getting more mentally prepared for the week after is definitely something that can improve."I mean, a lot's changed in the three years since I've been here and you know, just (getting attention) going into shops and dinner and stuff is a lot different so it gets frustrating at times. But it is what it is."The disappointment was as much for those who had waited so long to see Smith, back in Melbourne for the first time since 2019, and playing Kingston Heath in conditions which made it "the easiest this place is gonna get"."I think I had a lot of adrenaline going, especially out in the golf course. I think you know, the crowds were awesome out there," he said. "Obviously I had friends and family that I want to play well for them and I think it's just kind of all hit me at once and just got a little bit tired but you know, I need to play better than that. "That was pretty rubbish out there today."His efforts nearly brought an abrupt end to his performances in 2022 which were the complete opposite of rubbish from Smith, a year in which he rivalled Rory McIlroy as the best player in the world.Five wins, a major breakthrough, victory at the Players Championship before his $140 million move to LIV golf. It was a year to be proud of, and one which had earned him the long break he'll now enjoy, in Australia, with family and friends he didn't see for so long, when his Open is finally done. ."I can't wait for a sleep. I've played a lot more golf than I thought I would have at the start of the year," Smith said. "I'm looking forward to four or five weeks off here and just kind of mentally reset I think. The brain's been going pretty hard the last few months, so yeah, it would be a good time to sit down on a beach somewhere and have a few margaritas. "SMITH BY THE NUMBERSCameron Smith's two Australian Open rounds,- 1-over 71 @ Victoria, 1-over 72 at Kingston Heath- seven birdies- nine bogeys- 20 pars- +2 on par 3s- +1 on par 4s- 1-under on par 5s'''
  tokenized = tokenizer([document], return_tensors='np')
  out = model.generate(**tokenized, max_length=128)
  with tokenizer.as_target_tokenizer():
    predict.append (tokenizer.decode(out[0]))

predict[O:3]

In [None]:
with tokenizer.as_target_tokenizer():
    print(tokenizer.decode(out[0]))

</s><s><s><s>British Open champion Cameron Smith is struggling again in Australia.
The world No.3 leads the first round of one-over par at Kingston Heath.
Smith's first return to Australia has taken the toll on himself.
But the physical and emotional toll of a first return home caught up with Smith in Melbourne.
His name hovered just a single shot below the cut line all afternoon before finally rising above on Friday.</s><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad>




In [17]:
import pandas as pd 

raw = pd.DataFrame(raw_datasets_v['highlights'])
raw_list = raw.values.tolist()
raw_list[0:2] 

[["The lack of charges against Busch expedited the decision, a NASCAR official says .\nKurt Busch was accused of grabbing his ex-girlfriend by the throat, slamming her head .\nHe twice appealed NASCAR's indefinite suspension and lost ."],
 ['Arsenal beats Man Utd 2-1 in FA Cup quarterfinal .\nFormer Manchester United player Danny Welbeck scores winner .\nHolders Arsenal took the lead through Nacho Monreal before Wayne Rooney equalized .\nAngel Di Maria sent off for shoving referee in second half .']]

In [2]:
pip install rouge

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting rouge
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge
Successfully installed rouge-1.0.1


In [4]:
import rouge

In [None]:
#Calculate ROUGE score.
#:parameter    
    #:param y_test: string or list    
    #:param predicted: string or list
    
def evaluate_summary(y_test, predicted):    
   rouge_score = rouge.Rouge()    
   scores = rouge_score.get_scores(y_test, predicted, avg=True)       
   score_1 = round(scores['rouge-1']['f'], 2)    
   score_2 = round(scores['rouge-2']['f'], 2)    
   score_L = round(scores['rouge-l']['f'], 2)    
   print("rouge1:", score_1, "| rouge2:", score_2, "| rougeL:",
         score_L, "--> avg rouge:", round(np.mean(
         [score_1,score_2,score_L]), 2))
    
evaluate_summary(raw_list, predicted) 