<a href="https://colab.research.google.com/github/nedokormysh/GB_NLP_intro/blob/lesson14/NLP_intro_hw_14.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1.  Дообучить берт на задачу NER

In [90]:
! pip install datasets transformers seqeval -q

In [91]:
task = "ner" # Should be one of "ner", "pos" or "chunk"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16

## Loading the dataset

Для обучения мы возьмём [Russian Drug Reaction Corpus](https://github.com/cimm-kzn/RuDReC): размеченный корпус русскоязычных отзывов на лекарства. 

Загрузим мы его библиотекой corus, потому что это удобно 

In [92]:
from datasets import load_dataset, load_metric

In [93]:
datasets = load_dataset("conll2003")
print(len(datasets))



  0%|          | 0/3 [00:00<?, ?it/s]

3


Пример документа:

In [94]:
datasets["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [95]:
datasets["train"].features[f"ner_tags"]

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [96]:
label_list = datasets["train"].features[f"{task}_tags"].feature.names
label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [97]:
datasets["train"][0]['tokens']

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']

In [98]:
def extract_labels(item):
    words = item['tokens']
    word_labels = item['ner_tags']

    return {'tokens': words, 'tags': word_labels}

In [99]:
extract_labels(datasets["train"][0])

{'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [100]:
from sklearn.model_selection import train_test_split
ner_data = [extract_labels(item) for item in datasets["train"]]
ner_train, ner_test = train_test_split(ner_data, test_size=0.1, random_state=1)

Пример данных

In [101]:
import pandas as pd
pd.options.display.max_colwidth = 300
pd.DataFrame(ner_train).sample(3)

Unnamed: 0,tokens,tags
2215,"[10., Kispest, 3, 1, 1, 1, 6, 7, 4]","[0, 3, 0, 0, 0, 0, 0, 0, 0]"
8999,"[National, League]","[7, 8]"
3386,"[Motor, gasoline, stocks, dipped, slightly, as, barges, left, for, Germany, ,, but, there, were, few, inflows, of, cargoes, .]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0]"


Соберём все виды меток в список. 

In [102]:
label_list = sorted({label for item in ner_train for label in item['tags']})
if 'O' in label_list:
    label_list.remove('O')
    label_list = ['O'] + label_list
label_list

[0, 1, 2, 3, 4, 5, 6, 7, 8]

Сложим наши данные в объект [`DatasetDict`](https://huggingface.co/docs/datasets/package_reference/main_classes.html#datasetdict), нативный для huggingface.

In [103]:
from datasets import Dataset, DatasetDict

In [104]:
ner_data = DatasetDict({
    'train': Dataset.from_pandas(pd.DataFrame(ner_train)),
    'test': Dataset.from_pandas(pd.DataFrame(ner_test))
})
ner_data

DatasetDict({
    train: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 12636
    })
    test: Dataset({
        features: ['tokens', 'tags'],
        num_rows: 1405
    })
})

## Preprocessing the data

In [105]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [106]:
example = ner_train[5]
print(example["tokens"])

['"', 'My', 'application', 'this', 'year', 'has', 'been', 'strange', ',', '"', 'Norman', 'said', '.', '"']


In [107]:
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(tokens)

['[CLS]', '"', 'my', 'application', 'this', 'year', 'has', 'been', 'strange', ',', '"', 'norman', 'said', '.', '"', '[SEP]']


Чтобы перейти с уровня слов на уровень subword tokens, нужно ещё раз предобработать тексты.

In [108]:
len(example["tags"]), len(tokenized_input["input_ids"])

(14, 16)

Thankfully, the tokenizer returns outputs that have a `word_ids` method which can help us.

In [109]:
print(tokenized_input.word_ids())

[None, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, None]


As we can see, it returns a list with the same number of elements as our processed input ids, mapping special tokens to `None` and all other tokens to their respective word. This way, we can align the labels with the processed input ids.

In [110]:
word_ids = tokenized_input.word_ids()
aligned_labels = [-100 if i is None else example["tags"][i] for i in word_ids]
print(len(aligned_labels), len(tokenized_input["input_ids"]))

16 16


Here we set the labels of all special tokens to -100 (the index that is ignored by PyTorch) and the labels of all other tokens to the label of the word they come from. Another strategy is to set the label only on the first token obtained from a given word, and give a label of -100 to the other subtokens from the same word. We propose the two strategies here, just change the flag `label_all_tokens`.

In [111]:
def tokenize_and_align_labels(examples, label_all_tokens=True):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples['tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        label_ids = [label_list.index(idx) if isinstance(idx, str) else idx for idx in label_ids]

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [112]:
tokenize_and_align_labels(ner_data['train'][22:23])

{'input_ids': [[101, 8856, 5529, 5054, 1006, 5842, 1007, 3786, 4754, 21298, 2121, 1006, 2660, 1007, 1021, 1011, 1020, 1006, 1023, 1011, 1021, 1007, 1020, 1011, 1017, 1021, 1011, 1020, 1006, 1022, 1011, 1020, 1007, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 1, 2, 2, 0, 5, 0, 0, 1, 2, 2, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -100]]}

In [113]:
tokenized_datasets = ner_data.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/12636 [00:00<?, ? examples/s]

Map:   0%|          | 0/1405 [00:00<?, ? examples/s]

## Fine-tuning the model

In [114]:
label_list

[0, 1, 2, 3, 4, 5, 6, 7, 8]

In [115]:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint, num_labels=len(label_list))
model.config.id2label = dict(enumerate(label_list))
model.config.label2id = {v: k for k, v in model.config.id2label.items()}

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForTokenClassification: ['vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN t

In [116]:
args = TrainingArguments(
    "ner",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=10,
    weight_decay=0.01,
    save_strategy='no',
    report_to='none',
)

In [117]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

In [118]:
metric = load_metric("seqeval")

In [119]:
example = ner_train[4]
labels = example['tags']
metric.compute(predictions=[labels], references=[labels])

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  avg = a.mean(axis)
  ret = ret.dtype.type(ret / rcount)


{'overall_precision': 0.0,
 'overall_recall': 0.0,
 'overall_f1': 0.0,
 'overall_accuracy': 1.0}

In [120]:
import numpy as np

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels, zero_division=0)
    return {
        # "precision": results["overall_precision"],
        # "recall": results["overall_recall"],
        # "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

In [121]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [122]:
trainer.evaluate()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.




{'eval_loss': 2.2215518951416016,
 'eval_accuracy': 0.07958658409072629,
 'eval_runtime': 4.2149,
 'eval_samples_per_second': 333.341,
 'eval_steps_per_second': 20.878}

В начале обучения заморозим все параметры в модели, кроме последнего слоя, и посмотрим, насколько хорошо она обучится.

In [123]:
model.parameters

<bound method Module.parameters of DistilBertForTokenClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout):

In [124]:
for param in model.parameters():
    param.requires_grad = False

In [125]:
for name, param in model.named_parameters():
    if param.requires_grad:
        print(name)
        print(param)

We can now finetune our model by just calling the `train` method:

In [126]:
import logging
from transformers.trainer import logger as noisy_logger
noisy_logger.setLevel(logging.WARNING)

In [127]:
# trainer.train()

Модель недообучилась: похоже, что нужно обучить больше слоёв. Разморозим их все (но, воможно, более правильно было бы разморозить лишь несколько верхних), и поучимся ещё эпох 20.

In [128]:
# разморозка
for param in  model.parameters():
    param.requires_grad = True

In [129]:
args = TrainingArguments(
    "ner",
    evaluation_strategy = "epoch",
    learning_rate=1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=20,
    weight_decay=0.01,
    save_strategy='no',
    report_to='none',
)

In [130]:
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [131]:
trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy
1,0.3381,0.081135,0.97575
2,0.0699,0.067725,0.978806
3,0.0495,0.055973,0.984557
4,0.0323,0.057646,0.984919
5,0.024,0.06062,0.985844
6,0.015,0.061848,0.985884
7,0.0122,0.061759,0.986407
8,0.0098,0.064467,0.986809
9,0.0076,0.066416,0.986689
10,0.0055,0.068102,0.986045


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3381,0.081135,0.97575
2,0.0699,0.067725,0.978806
3,0.0495,0.055973,0.984557
4,0.0323,0.057646,0.984919
5,0.024,0.06062,0.985844
6,0.015,0.061848,0.985884
7,0.0122,0.061759,0.986407
8,0.0098,0.064467,0.986809
9,0.0076,0.066416,0.986689
10,0.0055,0.068102,0.986045


TrainOutput(global_step=15800, training_loss=0.025175261531449573, metrics={'train_runtime': 1327.975, 'train_samples_per_second': 190.305, 'train_steps_per_second': 11.898, 'total_flos': 3068814187430712.0, 'train_loss': 0.025175261531449573, 'epoch': 20.0})

In [133]:
trainer.evaluate()

{'eval_loss': 0.07764552533626556,
 'eval_accuracy': 0.9872114533901714,
 'eval_runtime': 2.7333,
 'eval_samples_per_second': 514.036,
 'eval_steps_per_second': 32.196,
 'epoch': 20.0}

In [134]:
from sklearn.metrics import confusion_matrix
import pandas as pd

In [135]:
cm = pd.DataFrame(
    confusion_matrix(sum(true_labels, []), sum(true_predictions, []), labels=label_list),
    index=label_list,
    columns=label_list
)
cm

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,19433,4,2,32,27,6,0,22,5
1,7,1158,2,8,0,0,0,0,0
2,0,3,966,0,0,0,0,0,0
3,16,11,0,950,9,8,0,12,0
4,24,0,3,6,496,1,4,0,6
5,7,0,0,15,2,966,5,1,0
6,3,0,0,0,1,3,100,0,0
7,15,6,0,13,0,8,0,380,3
8,8,0,2,0,3,0,2,3,99


In [None]:
model.save_pretrained('ner_bert.bin')
tokenizer.save_pretrained('ner_bert.bin')

# 2.  Дообучить GPT на генерацию текста

взять данные из
https://www.kaggle.com/datasets/mrapplexz/bashim-quotes

обучить модель GPT для генерации своих цитат

In [147]:
!pip install transformers sentencepiece --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [148]:
! mkdir ~/.kaggle
from google.colab import files

files.upload()

! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json


In [149]:
! kaggle datasets download mrapplexz/bashim-quotes

Downloading bashim-quotes.zip to /content
  0% 0.00/13.1M [00:00<?, ?B/s] 69% 9.00M/13.1M [00:00<00:00, 80.9MB/s]
100% 13.1M/13.1M [00:00<00:00, 104MB/s] 


In [150]:
! unzip -q '/content/bashim-quotes.zip'

In [152]:
DATASET_PATH = '/content/dataset.jsonl'

with open(DATASET_PATH) as f:
     df = pd.read_json(DATASET_PATH, lines=True).set_index('id')
df.head(3)

Unnamed: 0_level_0,date,rating,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,2004-08-30 11:24:00+00:00,22010.0,"<Ares> ppdv, все юниксы очень дружелюбны.. они просто очень разборчивы в друзьях ;)"
2,2004-08-30 11:25:00+00:00,25105.0,"<томатик_рад> а ты не чувствуешь красоту мира?\n<fox> честно говоря, я сейчас чувствую только отсутствие http.\n<томатик_рад> не туда смотришь, глянь вокруг!\n<fox> как я гляну, если http не работает? :/"
3,2004-08-30 11:27:00+00:00,7192.0,"<Дор> ""мышка, почему у тебя такие большие глаза?"" УЙДИ!!! я ХАРАКИРИ делаю!!!!!!"


In [153]:
df.drop(['date', 'rating'], axis=1, inplace=True)
df.head()

Unnamed: 0_level_0,text
id,Unnamed: 1_level_1
1,"<Ares> ppdv, все юниксы очень дружелюбны.. они просто очень разборчивы в друзьях ;)"
2,"<томатик_рад> а ты не чувствуешь красоту мира?\n<fox> честно говоря, я сейчас чувствую только отсутствие http.\n<томатик_рад> не туда смотришь, глянь вокруг!\n<fox> как я гляну, если http не работает? :/"
3,"<Дор> ""мышка, почему у тебя такие большие глаза?"" УЙДИ!!! я ХАРАКИРИ делаю!!!!!!"
4,"<PPDV[os2]> ""Мальчики, вы что больные, бегать в палату к девочкам?! - Если б мы были больные - мы б бегали к другим мальчикам"""
5,<Ohtori_Akio> мы - как разработчики - живём с субейзом под одбц. \n<Ohtori_Akio> лучше бы мы жили в пещере с гоблинами.


In [158]:
import pandas as pd
import json
import torch
import random
from transformers import AutoTokenizer, AutoModelForCausalLM, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments

In [159]:
model_name = 'sberbank-ai/rugpt3small_based_on_gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [160]:
sep = '\n***\n'

prefix = sep.join([''] + random.sample(list(df['text']), k=5) + [''])

tokens = tokenizer(prefix, return_tensors='pt')
tokens = {k: v.to(model.device) for k, v in tokens.items()}
end_token_id = tokenizer.encode('***')[0]
print(prefix)


***
Dago:
у нас в офисе новое растение
Grey:
Еще одного бухгалтера взяли?
***
--> Lizaveta has joined this channel (679@62.105.15.72).
<Lizaveta> Эй ей
<VorpalBunny> здраствуйте Лизочка
<VorpalBunny> скажите, какого числа вы родились?
<Lizaveta> 17
<VorpalBunny> какого месяца?
<Lizaveta> 06
<VorpalBunny> какого года?
<Lizaveta> 1987
<VorpalBunny> какого хуя?
<Lizaveta> ??????????????????????
***
Функция Wolf() в полнолуние void на луну!
***
vision: в общем, если я не могу заставить организм спать, ща пойду мыть посуду
vision: а то там небось уже тараканы в царя горы играют
***
xxx: Знакомые рссказывали. Мужик садился в машину на гаражах. Куда-то запропастился навесной замок, он второпях просто замотал дверь проволокой и уехал.
Ночью воры вскрыли несколько гаражей - три влево и три вправо от гаража этого мужика. Его гараж - с проволокой вместо замка - не тронули, а на дверях красовалась надпись баллончиком "Суперзамки не взламываем"
***



In [161]:
size = tokens['input_ids'].shape[1]
output = model.generate(
    **tokens, 
    do_sample=False, 
    max_length=size+50, 
    repetition_penalty=5., 
    temperature=0.5,
    num_beams=10,
)
decoded = tokenizer.decode(output[0])
result = decoded[len(prefix):]
print(result)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp


Дообучим модель

In [162]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df.loc[:10000, 'text'], test_size=0.15)

In [163]:
import re

def build_text_files(data_json, dest_path):
    f = open(dest_path, 'w')
    data = ''
    for texts in data_json:
        summary = str(texts).strip()
        summary = re.sub(r"", "", summary)
        summary = re.sub(r"<[\w+,\!, -]>", "", summary)
        summary = re.sub(r"<\w+>", "", summary)
        summary = re.sub(r"\s", " ", summary)
        data += summary + "  "
    f.write(data)
  
build_text_files(train,'./train_dataset.txt')
build_text_files(test,'./test_dataset.txt')

In [164]:
print("Train dataset length: "+ str(len(train)))
print("Test dataset length: "+ str(len(test)))

Train dataset length: 1666
Test dataset length: 294


In [165]:
train_path = './train_dataset.txt'
test_path = './test_dataset.txt'

def load_dataset(train_path, test_path, tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)

    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=128)

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset, test_dataset, data_collator

train_dataset, test_dataset, data_collator = load_dataset(train_path, test_path, tokenizer)



In [166]:
training_args = TrainingArguments(
    output_dir="./GPT/gpt2-train", 
    overwrite_output_dir=True, 
    num_train_epochs=3, 
    per_device_train_batch_size=4, 
    per_device_eval_batch_size=4,  
    eval_steps = 400, 
    save_steps=800, 
    warmup_steps=500,
    )

In [167]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

In [168]:
trainer.train()



Step,Training Loss
500,4.2073


TrainOutput(global_step=531, training_loss=4.184442078326382, metrics={'train_runtime': 95.4413, 'train_samples_per_second': 22.192, 'train_steps_per_second': 5.564, 'total_flos': 138354130944000.0, 'train_loss': 4.184442078326382, 'epoch': 3.0})

In [None]:
trainer.save_model()
tokenizer.save_pretrained('gdrive/MyDrive/GPT/gpt2-train')
model.save_pretrained('gdrive/MyDrive/GPT/model_gpt2')

Загрузим модель.

In [170]:
tokenizer = AutoTokenizer.from_pretrained("gdrive/MyDrive/GPT/gpt2-train")
model_new = AutoModelForCausalLM.from_pretrained("gdrive/MyDrive/GPT/model_gpt2")

In [171]:
size = tokens['input_ids'].shape[1]
output = model_new.generate(
    **tokens, 
    do_sample=False, 
    max_length=size+100, 
    repetition_penalty=5., 
    temperature=0.5,
    num_beams=10,
)
decoded = tokenizer.decode(output[0])
result = decoded[len(prefix):]
print(result)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


REal_SM[wrk]] is now known as [REal_SM[wrk]] <REal_SM[wrk]> А что это такое? * Ritsuko задумался о чем-то очень серьезном...   у меня есть один знакомый программист по имени Линуксоид. У него два высших образования : математическое и гуманитарное. И вот как-то раз его попросили написать программу для того,
