# Open-ended generation eksperymenty


*   Model: distilgpt2"
*   Zbiór danych: wikitext



In [1]:
! pip install datasets transformers

Collecting datasets
  Downloading datasets-1.18.0-py3-none-any.whl (311 kB)
[K     |████████████████████████████████| 311 kB 5.4 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 31.0 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.1.0-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 2.3 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 36.9 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 32.9 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.7 MB/s 
Collecting pyyaml
  Downloadin

**Przygotowanie zbioru danych, modelu i niezbędnych bibliotek**

In [2]:
from datasets import load_dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, pipeline, set_seed
import torch.nn.functional as F
import torch
import math

In [3]:
raw_datasets_train = load_dataset("wikitext", "wikitext-103-v1", split='train')
raw_datasets_val = load_dataset("wikitext", "wikitext-103-v1", split='validation')
raw_datasets_test = load_dataset("wikitext", "wikitext-103-v1", split='test')

Downloading:   0%|          | 0.00/2.03k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.25k [00:00<?, ?B/s]

Downloading and preparing dataset wikitext/wikitext-103-v1 (download: 181.42 MiB, generated: 522.23 MiB, post-processed: Unknown size, total: 703.64 MiB) to /root/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126...


Downloading:   0%|          | 0.00/190M [00:00<?, ?B/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset wikitext downloaded and prepared to /root/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126. Subsequent calls will reuse this data.


Reusing dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)
Reusing dataset wikitext (/root/.cache/huggingface/datasets/wikitext/wikitext-103-v1/1.0.0/a241db52902eaf2c6aa732210bead40c090019a499ceb13bcbfa3f8ab646a126)


In [4]:
raw_datasets = {'train':raw_datasets_train,'validation':raw_datasets_val,'test':raw_datasets_test}

In [5]:
raw_datasets = DatasetDict(raw_datasets)

In [6]:
raw_datasets

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 1801350
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
})

In [7]:
model_checkpoint = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_checkpoint)

Downloading:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/336M [00:00<?, ?B/s]

**Przetworzenie danych**

In [8]:
def tokenize_function(examples):
    exp = [e[:1024] for e in examples['text']]
    return tokenizer(exp)

In [9]:
tokenized_datasets = raw_datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

In [10]:
block_size = 128
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [11]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1,
    num_proc=4,
)

**Reimplementacja funkcji straty ScaleGrad**

In [12]:
#reimplementacja ScaleGrad
def getNovelMask(target, vocab_size):
    b,l = target.size()
    zeros = torch.zeros(b,l,vocab_size).to(target.device)
    ones = torch.ones(b,l,vocab_size).to(target.device)
    target_index = target.unsqueeze(1).expand(b,l,l).transpose(-2,-1).triu().transpose(-2,-1)
    target_index[target_index<=0] = 1
   
    matrix = zeros.scatter_add_(2, target_index, ones)
    matrix[:,:,0] = 0
    summ_true = torch.tensor(range(1,l+1)).unsqueeze(0).float().to(target.device)
    summ_now = torch.sum(matrix,dim=-1)
    diff = summ_true - summ_now
    matrix[:,:,0] = diff
    matrix = torch.cat((torch.zeros(b,1,vocab_size).to(target.device),matrix[:,:-1,:]),1)
    novel_mask = matrix < 1.

    return novel_mask

def sg_loss(inputs, labels, logits):
    inp = inputs
    pad = torch.ones((logits.size(-1)))
    target = labels
    target_to_loss = torch.flatten(labels)
    num = int(logits.size(-1)/target_to_loss.shape[0]) + 1
    target_to_loss_pad = target_to_loss.repeat(num)
    logits = logits
  
    # ScaleGrad
    probs = F.softmax(logits, dim=-1)
    novel_mask = getNovelMask(target,logits.size(-1))
    rep_mask = ~novel_mask

    new_probs = probs * novel_mask * gamma + probs * rep_mask + 1e-8
    new_probs = F.normalize(new_probs, p=1, dim=-1)
    l_probs_to_loss = torch.log(torch.argmax(new_probs,dim=-1))
    l_probs_to_loss = torch.flatten(l_probs_to_loss).clone().detach().requires_grad_(True)#torch.tensor(torch.flatten(l_probs_to_loss),requires_grad=True)
    l_probs_to_loss_pad = l_probs_to_loss.repeat(num)
    loss = -F.nll_loss(l_probs_to_loss_pad, target_to_loss_pad.long(), reduction='sum')
    ntokens = inp['input_ids'].numel()

    return loss / (ntokens*num)

**Przygotowanie treningu**

In [13]:
#dla ScaleGrad gamma = 0.2, dla MLE: gamma = 1.0
gamma = 0.2
class ScaleGradTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        labels = inputs.get("labels")
        outputs = model(**inputs)
        logits = outputs.get('logits')
        loss = sg_loss(inputs,labels,logits)
        return (loss, outputs) if return_outputs else loss

In [14]:
batch_size = 1
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
    f"{model_name}-finetuned-wikitext2",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=False,
    num_train_epochs = 1.0,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    save_strategy='epoch',
)

In [15]:
trainer = ScaleGradTrainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

**Fintuning modelu**

In [16]:
trainer.train()

***** Running training *****
  Num examples = 424777
  Num Epochs = 1
  Instantaneous batch size per device = 1
  Total train batch size (w. parallel, distributed & accumulation) = 1
  Gradient Accumulation steps = 1
  Total optimization steps = 424777


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss
1,0.0001,0.000127


***** Running Evaluation *****
  Num examples = 895
  Batch size = 1
Saving model checkpoint to distilgpt2-finetuned-wikitext2/checkpoint-424777
Configuration saved in distilgpt2-finetuned-wikitext2/checkpoint-424777/config.json
Model weights saved in distilgpt2-finetuned-wikitext2/checkpoint-424777/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=424777, training_loss=0.00012487241219763328, metrics={'train_runtime': 17053.7879, 'train_samples_per_second': 24.908, 'train_steps_per_second': 24.908, 'total_flos': 1.3874106228277248e+16, 'train_loss': 0.00012487241219763328, 'epoch': 1.0})

In [17]:
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

***** Running Evaluation *****
  Num examples = 895
  Batch size = 1


Perplexity: 1.00


**Zapisanie modelu i zebranie wyników**

In [None]:
model.save_pretrained('pretrained_gpt2_scalegrad')
tokenizer.save_pretrained('pretrained_tokenizer_gpt2_scalegrad')

In [23]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='pretrained_gpt2_scalegrad', tokenizer='pretrained_tokenizer_gpt2_scalegrad')
set_seed(42) 
exp1 = generator("Robert Boulter is an English film, television and theatre actor.", max_length=100, num_return_sequences=1)
exp2 = generator('You’s patriotism, and Mei’s reflections on the quotidian are a few examples.', max_length=100, num_return_sequences=1)
exp3 = generator("Deep Learning is very interesting, beacause ", max_length=100, num_return_sequences=1)
text2 = 'You’s patriotism, and Mei’s reflections on the quotidian are a few examples.'
text1 = 'Robert Boulter is an English film, television and theatre actor.'
text3 = "Deep Learning is very interesting, beacause "
res1 = exp1[0]['generated_text']
res2 = exp2[0]['generated_text']
res3 = exp3[0]['generated_text']

loading configuration file pretrained_gpt2_scalegrad/config.json
Model config GPT2Config {
  "_name_or_path": "pretrained_gpt2_scalegrad",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 6,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }


**Prezentacja wyników**

In [24]:
print('Przykład 1:')
print('Początek sekwencji: ',text1)
print('Wygenerowane przez model: ', res1)

Przykład 1:
Początek sekwencji:  Robert Boulter is an English film, television and theatre actor.
Wygenerowane przez model:  Robert Boulter is an English film, television and theatre actor. Actor and director was elected to British Columbia International Film Critics' Choice (BFI), the U.K. Film Festival. He earned recognition from the U.K. Academy of Drama Arts and Sciences at Queen's University as a producer in his native Scotland. His last film, In the Flesh (1983), was nominated by WME as Best New Actor. In 2010, he produced the script for the film I Can't Wait


In [25]:
print('Przykład 2:')
print('Początek sekwencji: ',text2)
print('Wygenerowane przez model: ', res2)

Przykład 2:
Początek sekwencji:  You’s patriotism, and Mei’s reflections on the quotidian are a few examples.
Wygenerowane przez model:  You’s patriotism, and Mei’s reflections on the quotidian are a few examples.


In [26]:
print('Przykład 3:')
print('Początek sekwencji: ',text3)
print('Wygenerowane przez model: ', res3)

Przykład 3:
Początek sekwencji:  Deep Learning is very interesting, beacause 
Wygenerowane przez model:  Deep Learning is very interesting, beacause ??? #pink is more than a question! We have developed an amazing world of learning and learning that is so fascinating! From being the youngest person to the oldest to being all the older! It’s the ultimate learning experience you can’t overlook, as we develop the highest level level of learning possible!


If you've made an investment in learning this way, you may already know about these very basic lessons. Our
