# GTP2

GPT-2 is a transformers model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.

- https://huggingface.co/openai-community/gpt2

- https://github.com/openai/gpt-2?tab=readme-ov-file

- https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

  
Natural language processing tasks, such as question answering, machine translation, reading comprehension, and summarization, are typically approached with supervised learning on taskspecific datasets. We demonstrate that language models begin to learn these tasks without any explicit supervision when trained on a new dataset
of millions of webpages called WebText. When conditioned on a document plus questions, the answers generated by the language model reach 55 F1 on the CoQA dataset - matching or exceeding the performance of 3 out of 4 baseline systems without using the 127,000+ training examples.
The capacity of the language model is essential to the success of zero-shot task transfer and increasing it improves performance in a log-linear fashion across tasks. Our largest model, GPT-2, is a 1.5B parameter Transformer that achieves state of the art results on 7 out of 8 tested language modeling datasets in a zero-shot setting
but still underfits WebText. Samples from the model reflect these improvements and contain coherent paragraphs of text. These findings suggest a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.


![image.png](nano.png)

https://paperswithcode.com/dataset/webtext

```
@article{radford2019language,
  title={Language Models are Unsupervised Multitask Learners},
  author={Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya},
  year={2019}
}
```


# nanoGPT --> credits to Andrej karpathy, who created this excellent Repository

https://github.com/karpathy/nanoGPT

- https://github.com/triton-lang/triton/
- https://triton-lang.org/main/getting-started/installation.html
# nanoGPT - LoRA
https://github.com/danielgrittner/nanoGPT-LoRA

![image.png](image2.png)

In [None]:
#!pip install torch numpy transformers datasets tiktoken wandb tqdm numoy scipy

In [1]:
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)




config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Hello, I'm a language model, but what I'm really doing is making a human-readable document. There are other languages, but those are"},
 {'generated_text': "Hello, I'm a language model, not a syntax model. That's why I like it. I've done a lot of programming projects.\n"},
 {'generated_text': "Hello, I'm a language model, and I'll do it in no time!\n\nOne of the things we learned from talking to my friend"},
 {'generated_text': "Hello, I'm a language model, not a command line tool.\n\nIf my code is simple enough:\n\nif (use (string"},
 {'generated_text': "Hello, I'm a language model, I've been using Language in all my work. Just a small example, let's see a simplified example."}]

# Prepare Dataset

In [2]:
import os
import pickle
import requests
import numpy as np

In [9]:
ROOT_DIR = os.getcwd()
ROOT_DIR

'/mnt/d/repos2/nanoGPT'

In [5]:
input_file_path = "data/shakespeare_char/book.txt"
with open(input_file_path, 'r') as f:
    data = f.read()
print(f"length of dataset in characters: {len(data):,}")

length of dataset in characters: 290,101


In [6]:
# get all the unique characters that occur in this text
chars = sorted(list(set(data)))
vocab_size = len(chars)
print("all the unique characters:", ''.join(chars))
print(f"vocab size: {vocab_size:,}")

# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
def encode(s):
    return [stoi[c] for c in s] # encoder: take a string, output a list of integers
def decode(l):
    return ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

all the unique characters: 
 !#$%()*,-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]abcdefghijklmnopqrstuvwxyzçéêô —‘’“”•…™
vocab size: 93


In [10]:


# create the train and test splits
n = len(data)
train_data = data[:int(n*0.9)]
val_data = data[int(n*0.9):]

# encode both to integers
train_ids = encode(train_data)
val_ids = encode(val_data)
print(f"train has {len(train_ids):,} tokens")
print(f"val has {len(val_ids):,} tokens")

train has 261,090 tokens
val has 29,011 tokens


In [15]:
train_ids[:20]

[41, 71, 68, 63, 58, 56, 73, 1, 32, 74, 73, 58, 67, 55, 58, 71, 60, 1, 58, 27]

In [16]:
data[:20]

'Project Gutenberg eB'

In [24]:
stoi["P"]

41

In [19]:
# export to bin files
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)
train_ids.tofile(os.path.join(ROOT_DIR, "data" , "shakespeare_char",'train.bin'))
val_ids.tofile(os.path.join(ROOT_DIR, "data" , "shakespeare_char", 'val.bin'))

# save the meta information as well, to help us encode/decode later
meta = {
    'vocab_size': vocab_size,
    'itos': itos,
    'stoi': stoi,
}
with open(os.path.join(ROOT_DIR, "data" , "shakespeare_char", 'meta.pkl'), 'wb') as f:
    pickle.dump(meta, f)

# Train_model in GPU

In [21]:
#! python train.py config/train_shakespeare_char.py

### If you dont have a GPU
```
python train.py config/train_shakespeare_char.py --device=cpu --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0
```

# Generate Model

In [25]:
# GPU
! python sample.py --out_dir=out-shakespeare-char


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Overriding: out_dir = out-shakespeare-char
  checkpoint = torch.load(ckpt_path, map_location=device)
number of parameters: 10.66M
Loading meta from data/shakespeare_char/meta.pkl...

nover to you among now me. I wasn’t think you—were could be there?”

“No, you hear in truck about the house of the balls at the removies the garage mile heatte
driver.”

“What’s sound.”

“That’s business?”

“Jordan Baker name interestion and as ‘There’s then Baker’s
somebody is growns.”

“I’ll let here a moment that yourself many thing of the dead.”

“What’s you haven’t a cigar.” He said agained much over the
table. Finny-she didn’t knew that had never made me all affair the
dict of the corner.”
---------------

perceived with just still beside out its with toward about the
room.

“What’s a weathory?”

“Bring?”

Her voice foot. Well, what he say had pointed so now of the presence
in the coance. They’ll we’ve to get out on the enormous
Carray.”

“You see,” said the went of the next door. “I went into then t

In [None]:
# CPU
! python sample.py --out_dir=out-shakespeare-char --device=cpu

# Fine Tune Model

In [None]:
! python train.py config/finetune_shakespeare.py

In [None]:
# Inference from gpt2-xl model
! python sample.py --init_from=gpt2-xl  --start="What is the answer to life, the universe, and everything?"  --num_samples=5 --max_new_tokens=100