# JavaSpektrum 04/2023 - Example code to fine tune simple GPT-2 in combination with ClearML


The notebook runs only on Linux machines, due dependency of bitsandbytes. A CUDA environment is recommended to train the small model

[Source code is inspired from this repository](https://github.com/philschmid/fine-tune-GPT-2/tree/master)

[Download the data from here and store it as recipes.json](https://www.kaggle.com/sterby/german-recipes-dataset)

## Initialize ClearML

In [2]:
%env CLEARML_WEB_HOST=https://app.clear.ml
%env CLEARML_API_HOST=https://api.clear.ml
%env CLEARML_FILES_HOST=https://files.clear.ml
%env CLEARML_API_ACCESS_KEY=<put your key here>
%env CLEARML_API_SECRET_KEY=<put your secret here>

env: CLEARML_WEB_HOST=https://app.clear.ml
env: CLEARML_API_HOST=https://api.clear.ml
env: CLEARML_FILES_HOST=https://files.clear.ml
env: CLEARML_API_ACCESS_KEY=RNV1UP4HQOPBQJ3ME8DC
env: CLEARML_API_SECRET_KEY=gV2Jwh4f6PxQTAUKPF9JpnjRtvoCJhqZYLnfrUeL1vm0HP6sLT


In [3]:
from clearml import Task

task = Task.create(project_name='JavaSpektrumArtikel', task_name='train_gpt-2', task_type='training')

## Prepare the dataset for training

In [4]:
import re
import json
from sklearn.model_selection import train_test_split

with open('recipes.json') as f:
    data = json.load(f)

def build_text_files(data_json, dest_path):
    f = open(dest_path, 'w', encoding="utf-8")
    data = ''
    for texts in data_json:
        summary = str(texts['Instructions']).strip()
        summary = re.sub(r"\s", " ", summary)
        data += summary + "  "
    f.write(data)

train, test = train_test_split(data,test_size=0.15)


build_text_files(train,'./train_dataset.txt')
build_text_files(test,'./test_dataset.txt')

print(f"Train dataset length is {len(train)}")
print(f"Test dataset length is {len(test)}")


Train dataset length is 10361
Test dataset length is 1829


## Download tokenizer from HuggingFace, use pretrained `german-gpt2`

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("anonymous-german-nlp/german-gpt2")

train_path = './train_dataset.txt'
test_path = './test_dataset.txt'

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


## Load datasets with `TextDataset` and use `DataCollator` for the model

In [6]:
from transformers import TextDataset,DataCollatorForLanguageModeling

def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)

    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=128)

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)



# Initialize `Trainer` with `TrainingArguments` and GPT-2 model


In [7]:
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead

model = AutoModelWithLMHead.from_pretrained("anonymous-german-nlp/german-gpt2")

training_args = TrainingArguments(
    output_dir="./gpt-rezepte",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    eval_steps = 400,
    save_steps=800,
    warmup_steps=500,
    prediction_loss_only=True,
    )
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)



# Train and save the model

In [8]:
trainer.train()

ClearML Task: created new task id=c122928e700b47da9228c2d4b45d6e01
2023-07-27 16:16:10,574 - clearml.Task - INFO - Storing jupyter notebook directly as code
ClearML results page: https://app.clear.ml/projects/432a958e01444ec18f42b051c34ab02f/experiments/c122928e700b47da9228c2d4b45d6e01/output/log


Unsupported key of type '<class 'int'>' found when connecting dictionary. It will be converted to str


Step,Training Loss
500,3.0157
1000,2.4668
1500,2.2561
2000,2.1195
2500,1.9509
3000,1.8901
3500,1.8373
4000,1.7751
4500,1.6884
5000,1.6396


TrainOutput(global_step=5910, training_loss=1.9978587813788864, metrics={'train_runtime': 620.8795, 'train_samples_per_second': 76.121, 'train_steps_per_second': 9.519, 'total_flos': 3087296004096000.0, 'train_loss': 1.9978587813788864, 'epoch': 3.0})

In [9]:
trainer.save_model()

# Now we use and test the model

In [10]:
from transformers import pipeline

recipe_assistant = pipeline('text-generation',model='./gpt-rezepte', tokenizer='anonymous-german-nlp/german-gpt2')

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [11]:
generation = recipe_assistant('Die Nudeln Kochen, Fleisch anbraten')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [12]:
print (generation[0]["generated_text"])

Die Nudeln Kochen, Fleisch anbraten und mit der Sojasilie und dem Currypulver würzen.Den Mais in kleine Stücke schneiden und dazu geben. Gut verrühren und mit Salz und Zucker abschmecken. Die Nudeln


# Generate some predictions to calculate the BLEU score later

In [13]:
original_and_predictions = []
for test_dict in test[0:2]:
    original_recipe = test_dict["Instructions"]
    generation = recipe_assistant(original_recipe[0:60])
    original_and_predictions.append((original_recipe,generation[0]["generated_text"]))



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


# Test score with average BLEU metric

In [14]:
import statistics
from nltk.translate.bleu_score import sentence_bleu

scores=[]

for tuple in original_and_predictions:
  reference = tuple[0]
  candidate = tuple[1]
  scores.append(sentence_bleu(reference, candidate))

print (statistics.mean(scores))

1.1440072913618045e-231


The hypothesis contains 0 counts of 2-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 3-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()


# Log BLEU score in ClearML

In [15]:
task.get_logger().report_scalar("Average BLEU score", "score", statistics.mean(scores), 0)