# Summary

n this notebook, we will train a [Causal Language Model](https://huggingface.co/docs/transformers/v4.41.2/en/tasks/language_modeling#causal-language-modeling) using the Hugging Face [Transformers library](https://huggingface.co/docs/transformers/en/index). We wll use the run_clm.py script to fine-tune the [GPT-2](https://huggingface.co/openai-community/gpt2) model on a custom dataset which is already available on [Hugging Face Hub](https://huggingface.co/datasets)

In [2]:
%pip install --quiet -U huggingface_hub transformers accelerate evaluate datasets

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m401.7/401.7 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.1/9.1 MB[0m [31m48.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m35.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m48.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━

In [3]:
import torch

print("Torch version:", torch.__version__)

Torch version: 2.3.0+cu121


In [4]:
import transformers

print("Transformers version:", transformers.__version__)

Transformers version: 4.41.1


In [10]:
from huggingface_hub import notebook_login

In [None]:
notebook_login()

## Test runs

In [5]:
from transformers import pipeline, set_seed

set_seed(47)

In [7]:
generator = pipeline('text-generation', model='gpt2')

In [8]:
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, one of the most important languages, I use the GNU C# Language API on my workstation, and it"},
 {'generated_text': "Hello, I'm a language model, but it's not like every single part of my project is actually a language model—it seems much a mix"},
 {'generated_text': "Hello, I'm a language model, and you must be a language model too.\n\nMy goal is to create a framework that allows many different"},
 {'generated_text': "Hello, I'm a language model, my program is a syntax model.\n\nWhat is it that makes it so that I can understand more complex"},
 {'generated_text': "Hello, I'm a language model, not a programmer. I'm teaching all types in one language: PHP. No, I'm not just putting"}]

# Train

In [9]:
!mkdir -p data

In [19]:
!python run_clm.py \
    --model_name_or_path openai-community/gpt2 \
    --dataset_name ndamulelonemakh/zabantu-data \
    --dataset_config_name eng \
    --validation_split_percentage 10 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --do_train \
    --do_eval \
    --output_dir data/zaf-gpt2-v1

2024-05-28 10:11:18.200346: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-28 10:11:18.200403: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-28 10:11:18.201731: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
05/28/2024 10:11:22 - INFO - __main__ - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,

In [None]:
# # Troubleshoot transformers version
# !pip uninstall transformers -y
# !git clone https://github.com/huggingface/transformers --depth 1
# !cd transformers && pip install .

# Test New Model

In [34]:
from transformers import AutoModel, AutoTokenizer, GPT2LMHeadModel

In [35]:
new_model = GPT2LMHeadModel.from_pretrained('./data/zaf-gpt2-v1')
tokenizer = AutoTokenizer.from_pretrained("openai-community/gpt2")

In [36]:
generator_new = pipeline('text-generation', model=new_model, tokenizer=tokenizer)

In [38]:
generator_new("The president of South Africa,",
          max_length=30,
          num_return_sequences=5)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The president of South Africa, Dr Bimbo Mpumalanga, congratulated his supporters on the results and thanked them for their hard work in'},
 {'generated_text': 'The president of South Africa, President Jacob Zuma will lead delegation of Ministers to South Africa from 3 to 4 August 2007 to attend an African Union ('},
 {'generated_text': 'The president of South Africa, Dr Thabo Mbeki, will deliver the South African National Debate Programme at the SANDF Summit in Durban on'},
 {'generated_text': 'The president of South Africa, Mr Nair, will address the conference on the 25th anniversary of the arrest and conviction of apartheid leader Nelson Mandela on'},
 {'generated_text': 'The president of South Africa, Dr Mark Lekgotla, took this moment to congratulate South Africa on the contribution it made to Africa since democracy.'}]

In [39]:
generator_new("The springboks coach,",
          max_length=300,
          num_return_sequences=1)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "The springboks coach, Tshwane Pathanathwa, who won the men's 100m freestyle relay in the 100m final of the men's women's 100m butterfly final in the 100m freestyle final.Mr Molotswane Kwa-Pini as the Chief Financial Officer of the Public Provinces Water Board.Mr Mabuzini Htatshu as the Deputy Chairperson of the Department of Water Affairs.South African Airways, which will receive an additional R300 million for the fourth quarter of 2016/17, will become the fourth carrier to roll out hybrid passenger services while the country remains committed to supporting the country economy through low cost of fuel and sustainable use of natural resources.Issues In the Environment.It will be able to meet the high demand scenario and is expected to reach capacity before the end of.The NOPMA is an annual report made annually to the South African National Police Advisory Council and was commissioned to be submitted to the President in May.Dr Gail MzwuluMandela Zondo as the Chief Fi