# Sommurize - GLM

In this notebook, it will:

    I. explain the NER problem.
    II. Realization
    III. Evaluation
    

## I. Presentation

### 1. Definition

GLM is General Language Model pretrained with an autoregressive blank-filling task.

Decoder only model.

They are not HG official model so there are some specialties for their usage which are not aligned with HG models.

So we should either refer to the particulier official site or the source code for implementation details.

### 2. data structure

The data structure is:

    _______________________________________________________________________
       encoder input           |eos| |bos|         decoder input      |eos|
    -----------------------------------------------------------------------
                                          |
    ________________________________________________________________________
                     -100                 |      decoder output        |eos|
    ------------------------------------------------------------------------

## II. Realization

In [1]:
# to set the gpu to use
# Since I have 2 GPUs and I only want to use one, I need to run this.
# Should be run the first
# skip this if you don't need.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # or "0,1" for multiple GPUs
os.environ["TOKENIZERS_PARALLELISM"] = "false"

In [2]:
## defin repos for data and model

# data

ckp_data = "Ateeqq/news-title-generator"

# model

# ckp = "THUDM/glm-2b"
ckp = "THUDM/glm-roberta-large"

### 1. import 

In [3]:
import evaluate, torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainer, Seq2SeqTrainingArguments

2024-06-21 16:46:55.796657: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-21 16:46:55.796729: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-21 16:46:55.799773: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-21 16:46:55.820207: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### 2. load data

In [4]:
data = load_dataset(ckp_data, split="train[:1000]")
data

Dataset({
    features: ['summary', 'text'],
    num_rows: 1000
})

### 3. split data

In [5]:
split_data = data.train_test_split(test_size=0.2, seed=42)
split_data

DatasetDict({
    train: Dataset({
        features: ['summary', 'text'],
        num_rows: 800
    })
    test: Dataset({
        features: ['summary', 'text'],
        num_rows: 200
    })
})

### 4. tokenization

In [6]:
tokenizer = AutoTokenizer.from_pretrained(ckp, trust_remote_code=True)
tokenizer

tokenizer_config.json:   0%|          | 0.00/440 [00:00<?, ?B/s]

tokenization_glm.py:   0%|          | 0.00/15.9k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/THUDM/glm-roberta-large:
- tokenization_glm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/77.0 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


GLMRobertaTokenizer(name_or_path='THUDM/glm-roberta-large', vocab_size=50265, model_max_length=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='left', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '</s>', 'cls_token': '<s>', 'mask_token': '[MASK]', 'additional_special_tokens': ['<|startofpiece|>', '<|endofpiece|>']}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	50265: AddedToken("<|startofpiece|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	50266: AddedToken("<|endofpiece|>", rstrip=False, lstrip=False, single_word=False, normali

In [7]:
def process(samples):

    prefix = "summarize: "

    text = [prefix + t + tokenizer.mask_token for t in samples["text"]]

    toks = tokenizer(text, truncation=True, padding="max_length", max_length=256, return_tensors="pt")

    toks = tokenizer.build_inputs_for_generation(toks, targets=samples["summary"], padding=True, max_gen_length=64)

    return toks


In [8]:
tokenized_data = split_data.map(process, batched=True, remove_columns=split_data["train"].column_names)
tokenized_data

Map:   0%|          | 0/800 [00:00<?, ? examples/s]

Map:   0%|          | 0/200 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'position_ids', 'attention_mask', 'labels'],
        num_rows: 800
    })
    test: Dataset({
        features: ['input_ids', 'position_ids', 'attention_mask', 'labels'],
        num_rows: 200
    })
})

In [9]:
tokenizer.decode(tokenized_data["train"][0]["input_ids"])

"<s>summarize: Karnataka Congress MLA JN Ganesh is absconding days after he was booked by police for attempt to murder for allegedly assaulting his fellow party MLA Anand Singh. Ganesh reportedly assaulted Singh in a drunken brawl during their stay at a Bengaluru resort. Almost all state Congress MLAs were taken to the resort amid the party's poaching allegations against BJP.[MASK]</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s>

In [10]:
print(tokenized_data["train"][0]["labels"])

[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -10

In [11]:
print(tokenized_data["train"][0]["position_ids"])

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221

### 5. load model

In [12]:
model = AutoModelForSeq2SeqLM.from_pretrained(ckp, trust_remote_code=True)
model

config.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

configuration_glm.py:   0%|          | 0.00/6.25k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/THUDM/glm-roberta-large:
- configuration_glm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_glm.py:   0%|          | 0.00/39.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/THUDM/glm-roberta-large:
- modeling_glm.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

GLMForConditionalGeneration(
  (glm): GLMModel(
    (word_embeddings): VocabEmbedding()
    (transformer): GLMStack(
      (embedding_dropout): Dropout(p=0.1, inplace=False)
      (position_embeddings): Embedding(513, 1024)
      (block_position_embeddings): Embedding(513, 1024)
      (layers): ModuleList(
        (0-23): 24 x GLMBlock(
          (input_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (attention): SelfAttention(
            (query_key_value): Linear(in_features=1024, out_features=3072, bias=True)
            (attention_dropout): Dropout(p=0.1, inplace=False)
            (dense): Linear(in_features=1024, out_features=1024, bias=True)
            (output_dropout): Dropout(p=0.1, inplace=False)
          )
          (post_attention_layernorm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
          (mlp): MLP(
            (dense_h_to_4h): Linear(in_features=1024, out_features=4096, bias=True)
            (dense_4h_to_h): Linear(in_feat

### 6. define metrics

skip

### 7. train args

In [17]:
args = Seq2SeqTrainingArguments(
    output_dir="../tmp/checkpoint",
    num_train_epochs=3,
    per_device_eval_batch_size=4,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=8,
    logging_steps=8
)

### 8. trainer

In [18]:
trainer = Seq2SeqTrainer(
    model=model,
    args=args,
    train_dataset=tokenized_data["train"],
    tokenizer=tokenizer
)

### 9. train

In [19]:
trainer.train()

Step,Training Loss
8,5.6352
16,1.4517
24,1.0237
32,0.7791


TrainOutput(global_step=36, training_loss=2.0564657383494906, metrics={'train_runtime': 438.6842, 'train_samples_per_second': 5.471, 'train_steps_per_second': 0.082, 'total_flos': 1565198490009600.0, 'train_loss': 2.0564657383494906, 'epoch': 2.88})

### 10. Inference

In [39]:
# to do the generation, we may refer to the tutorial or the source code.
tokenizer.pad_token_id = tokenizer.eos_token_id
input_text = split_data["test"][0]["text"]

prefix = "summarize: "

text = prefix + input_text + tokenizer.mask_token

toks = tokenizer(text, return_tensors="pt")

toks = tokenizer.build_inputs_for_generation(toks, max_gen_length=64)
print(toks)
toks = toks.to(model.device)

out = model.generate(**toks, max_new_tokens=64, eos_token_id=tokenizer.eop_token_id, do_sample=True)

result = tokenizer.decode(out[0].tolist())
result

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50266 for open-end generation.


{'input_ids': tensor([[    0, 18581,  3916,  2072,    35,  5028, 41396,  2192,  2803,    15,
           307,    26, 37325,    33, 13526,  8178, 23681,  4311,  4927,   124,
             7,     5,  4665, 22167, 18365,     6,   601,  6551,    12,   996,
          3083,  9543,     6,    11,     5, 22568,  6599,     4,  1936,   291,
         17353,  3091,  4927,   124,     7,     5, 24268,  3892, 11599, 18365,
            58,    67,   303,    59,  9680,  9395,  1926,     9, 14794,     4,
            20, 23681,  4311,  5585,  8178,  3477,  1189,     6,  7326, 35277,
         34487,     8,  4728, 13237, 27958,    19, 21373,     6,     5,  3158,
          1487,     4, 50267,     2, 50265]]), 'position_ids': tensor([[[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
          17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
          34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
          51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62,

'<s>summarize: Egypt Antiquities Ministry on Wednesday said archaeologists have uncovered ancient tombs dating back to the Second Intermediate Period, 1782-1570 BC, in the Nile Delta. About 20 burial sites dating back to the Predynastic Period were also found about 140 kilometres north of Cairo. The tombs contain ancient animal remains, stone artefacts and pottery fragments with drawings, the ministry revealed.[MASK]</s> <|startofpiece|> 2,000 BC tombs found in Nile Delta cave <|endofpiece|>'

## III. Evaluation

We should do our own evaluation since it is not a hf native model.

In [40]:
# infer all test data

def infer_test():

    predicts = []
    prefix = "summarize: "

    model.eval()

    with torch.inference_mode():

        for t in split_data["test"]:

            text = prefix + t["text"] + tokenizer.mask_token

            toks = tokenizer(text, return_tensors="pt")

            toks = tokenizer.build_inputs_for_generation(toks, max_gen_length=64)

            toks = toks.to(model.device)

            out = model.generate(**toks, max_new_tokens=64, eos_token_id=tokenizer.eop_token_id, do_sample=True)

            result = tokenizer.decode(out[0].tolist())
            result = result.split("<|startofpiece|>")[1].replace("<|endofpiece|>", "").strip()
            predicts.append(result)
            #print(result, "\n")

    return predicts

In [41]:
res = infer_test()

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50266 for open-end generation.


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50266 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50266 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50266 for open-end generation.
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50266 for open-end generati

In [48]:
for i in range(10):
    print(i)
    print("sum: ", split_data["test"][i]["summary"])
    print("pred: ", res[i])

0
sum:  Archaeologists find ancient tombs in Nile Delta: Egypt 
pred:  Archaeologists find 2,500-yr-old Egyptian tomb
1
sum:  Martyred terrorist-turned-soldier Nazir Wani to get Ashoka Chakra
pred:  Lance Naik Nazir Wani to be honoured for bravery at war
2
sum:  Rahul has accepted he can't do politics all alone: Sumitra Mahajan
pred:  Rahul Gandhi is taking Priyanka's help: Sumitra Mahajan
3
sum:  Venezuela opposition leader declares himself interim President
pred:  UNSC appoints acting president to lead Venezuela to new polls: Reports
4
sum:  Pics of asteroid that may hit Earth in next century released
pred:  NASA sends most detailed image of asteroid Bennu
5
sum:  This time allowed to enjoy moment: De Villiers post Osaka's win
pred:  Serena's Serena's win allowed her to celebrate it: De Villiers
6
sum:  Footballer's ex-girlfriend blames football mafia for missing plane
pred:  Football mafia is behind taxi crash which leaves 28-year-old dead
7
sum:  10 parties from northeast to oppose

In [42]:
# the result looks good even the scores are not that high

rouge = evaluate.load("rouge")
preds = res
labels = [t["text"] for t in split_data["test"]]
rouge.compute(predictions=res, references=labels)

{'rouge1': 0.21651557658772205,
 'rouge2': 0.09383936191422657,
 'rougeL': 0.1803071161933874,
 'rougeLsum': 0.18065712448698462}