## Installing Required Libraries
The following code installs the necessary libraries for our project. `datasets` is used for handling various datasets, and `transformers` provides access to pre-trained transformer models from Hugging Face. These libraries enable us to load and preprocess data efficiently and apply state-of-the-art machine learning models.


In [2]:
pip install datasets transformers

Note: you may need to restart the kernel to use updated packages.


## Loading the Dataset
We load the dataset using the `load_dataset` function from the `datasets` library.

In [3]:
from datasets import load_dataset
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/733k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/6.36M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

In [4]:
datasets["train"][10]

{'text': ' The game \'s battle system , the BliTZ system , is carried over directly from Valkyira Chronicles . During missions , players select each unit using a top @-@ down perspective of the battlefield map : once a character is selected , the player moves the character around the battlefield in third @-@ person . A character can only act once per @-@ turn , but characters can be granted multiple turns at the expense of other characters \' turns . Each character has a field and distance of movement limited by their Action Gauge . Up to nine characters can be assigned to a single mission . During gameplay , characters will call out if something happens to them , such as their health points ( HP ) getting low or being knocked out by enemy attacks . Each character has specific " Potentials " , skills unique to each character . They are divided into " Personal Potential " , which are innate skills that remain unaltered unless otherwise dictated by the story and can either help or impede

## Function to Display Random Dataset Elements
This function `show_random_elements` helps us visualize random samples from the dataset. It takes a dataset and a specified number of examples, then displays them in a tabular format. The `ClassLabel` feature allows us to convert integer labels into human-readable names, improving readability. This is useful for quickly inspecting the data.


In [5]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

## Displaying Random Samples from the Dataset
Here we use the previously defined `show_random_elements` function to view a random selection of samples from the training dataset. This helps us verify that the data was loaded correctly and provides insights into its structure before further processing.


In [6]:
show_random_elements(datasets["train"])

Unnamed: 0,text
0,
1,"Kieswetter 's performances in the 2009 season led to his inclusion in the England Performance Programme squad in November and December of that year , and he was part of the England Lions squad which toured the United Arab Emirates in early 2010 , along with Somerset team @-@ mate Peter Trego . His full England debut came shortly after in Bangladesh . Additionally , two of Somerset 's young players , Jos Buttler and Calum Haggett played for England Under @-@ 19s during the English winter . \n"
2,
3,= Mothers of the Disappeared = \n
4,Florida panther ( P. c. coryi ) \n
5,"John Balfour was the British ambassador in Argentina during the Perón regime , and describes Evita 's popularity : \n"
6,
7,
8,
9,


In [7]:
model_checkpoint = "openai-community/gpt2"
tokenizer_checkpoint = "openai-community/gpt2"

In [8]:
from transformers import AutoTokenizer
    
tokenizer = AutoTokenizer.from_pretrained(tokenizer_checkpoint)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



## Tokenization Function
This code defines the `tokenize_function`, which tokenizes each text example in the dataset. Tokenization is a critical preprocessing step in NLP as it breaks down the input text into tokens (words or subwords), which are the basic units processed by transformer models. We will use this function to prepare the dataset for model training.


In [9]:
def tokenize_function(examples):
    return tokenizer(examples["text"])

## Applying Tokenization to the Dataset
This cell applies the `tokenize_function` we defined earlier to the entire dataset. The `map` method allows us to transform each example in the dataset by applying our tokenization function. Tokenizing the dataset is essential for preparing the text data so that it can be processed by the model.


In [10]:
tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])


Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

In [11]:
tokenized_datasets["train"][1]


{'input_ids': [796, 569, 18354, 7496, 17740, 6711, 796, 220, 198],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

In [12]:
# block_size = tokenizer.model_max_length
block_size = 128

## Function: `group_texts`

The `group_texts` function processes a dictionary of text examples, transforming them into fixed-size blocks suitable for model training or evaluation. Below is a comprehensive overview of its functionality:


In [13]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [14]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

In [15]:
tokenizer.decode(lm_datasets["train"][1]["input_ids"])

' game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . \n The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series newcomers . Character designer Raita Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Oz'

## Loading the Model and Training Arguments

In [16]:
from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained(model_checkpoint)
model = AutoModelForCausalLM.from_config(config)

In [17]:
from transformers import Trainer, TrainingArguments


## Defining Training Arguments
The `TrainingArguments` object defines the configuration for training the model. These arguments include parameters such as the learning rate, weight decay , evaluation strategy, and where to store model checkpoints. By adjusting these parameters, we can fine-tune the training process to fit our hardware and requirements.


In [18]:
training_args = TrainingArguments(
    f"{model_checkpoint}-wikitext2",
    eval_strategy = "epoch",
    learning_rate=2e-5,
    weight_decay=0.01,
    push_to_hub=True
)

In [19]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

## Initializing the Trainer

The following code snippet demonstrates how to create a `Trainer` instance for training a language model using the Hugging Face Transformers library. The `Trainer` class simplifies the training and evaluation process by encapsulating common functionalities.

In [20]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
)

In [21]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

  ········································


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


VBox(children=(Label(value='Waiting for wandb.init()...\r'), FloatProgress(value=0.011112763166666657, max=1.0…

  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


Epoch,Training Loss,Validation Loss
1,6.6329,6.468777
2,6.2558,6.238887
3,6.0552,6.169648


  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):
  with torch.cuda.device(device), torch.cuda.stream(stream), autocast(enabled=autocast_enabled):


TrainOutput(global_step=3501, training_loss=6.428747082873434, metrics={'train_runtime': 1580.4784, 'train_samples_per_second': 35.431, 'train_steps_per_second': 2.215, 'total_flos': 3657957801984000.0, 'train_loss': 6.428747082873434, 'epoch': 3.0})

## Evaluating the Model and Calculating Perplexity

This code snippet demonstrates how to evaluate a trained language model using the `Trainer` class from the Hugging Face Transformers library. It also calculates the perplexity of the model based on the evaluation loss.

In [22]:
import math
eval_results = trainer.evaluate()
print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")

Perplexity: 478.02


In [24]:
trainer.push_to_hub("gpt2-model")

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/MohamedTalaat91/gpt2-wikitext2/commit/c39ebbbc171839c218fc35248d9ed3939534cdec', commit_message='gpt2-model', commit_description='', oid='c39ebbbc171839c218fc35248d9ed3939534cdec', pr_url=None, repo_url=RepoUrl('https://huggingface.co/MohamedTalaat91/gpt2-wikitext2', endpoint='https://huggingface.co', repo_type='model', repo_id='MohamedTalaat91/gpt2-wikitext2'), pr_revision=None, pr_num=None)

In [35]:
trainer.save_model("gpt2-wikitext2")

No files have been modified since last commit. Skipping to prevent empty commit.


In [33]:
tokenizer.save_pretrained("gpt2-wikitext2")

('gpt2-wikitext2/tokenizer_config.json',
 'gpt2-wikitext2/special_tokens_map.json',
 'gpt2-wikitext2/vocab.json',
 'gpt2-wikitext2/merges.txt',
 'gpt2-wikitext2/added_tokens.json',
 'gpt2-wikitext2/tokenizer.json')

## Loading a Retrained Model and Tokenizer

This code snippet demonstrates how to load a pre-trained language model and its associated tokenizer using the Hugging Face Transformers library. It prepares an input text for further processing.


In [36]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the retrained model
model_path = '/kaggle/working/gpt2-wikitext2'
model = AutoModelForCausalLM.from_pretrained(model_path)

# Load the tokenizer associated with the model
tokenizer = AutoTokenizer.from_pretrained(model_path)

In [37]:
input_text = "year study of the distribution"
inputs = tokenizer(input_text, return_tensors="pt")


## Generating Text with the Model

This code snippet demonstrates how to generate text based on an input using a pre-trained language model from the Hugging Face Transformers library. The model generates a continuation of the input text with specified parameters to control the generation process.


In [73]:
# Generate text based on the input
generated_ids = model.generate(
    inputs['input_ids'], 
    max_length=100,  # Adjust the max length as needed
    num_return_sequences=1,  # Number of texts to generate
    do_sample=True,  # Enable sampling (as opposed to greedy search)
    top_k=50,  # Top-k sampling to introduce diversity
    temperature=0.9  # Controls randomness in sampling
)




The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


In [74]:
# Decode the generated text to human-readable format
generated_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)

print(generated_text)

year study of the distribution of the time of the only two part as the first , this time the American government which were still have been to have been a great part of the two months . 
 = = = = = 
 
 The same new year was named by the season of the United Kingdom of the storm and their own . The first , a single was released on the state as a new main series of the first time . The town was not be of a number of the island
