### Project setup

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
import torch
from torch.utils.data import Dataset, random_split

model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

print(device)
model = model.to(torch.device(device))

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

cpu


### Test out GPT2 as-is

In [2]:
text = "Data scientists use VS Code as their tool"

if device == 'cuda':
    input_ids = tokenizer.encode(text, return_tensors='pt').cuda()
else:
    input_ids = tokenizer.encode(text, return_tensors='pt')


output = model.generate(input_ids, max_length=100, do_sample=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [3]:
print(tokenizer.decode(output[0], skip_special_tokens=True))

Data scientists use VS Code as their toolset. VS Code gives you a clean, yet easily understandable syntax with just a few lines of code.

VS Code makes it much easier than ever to write your own code for VS Code applications. VS Code is an open source project (source on GitHub here), and even we can find it on GitHub!

We know about many other frameworks and libraries out there which make VS Code easy to use and maintain. Let's look at some of them


### Fine-tune model using Medium posts with 'Technology' tag
Data from [Kaggle](https://www.kaggle.com/datasets/fabiochiusano/medium-articles), filtered to posts with `Technology` tag

In [4]:
import pandas as pd
#data = pd.read_csv('./resources/medium-articles-technology.csv')
data = pd.read_csv(r'C:\Users\Store\CODE\AIML\0DataSets\medium_articles\medium-articles-technology.csv', encoding='utf-8')
data.head()

Unnamed: 0.1,Unnamed: 0,text,text_len
0,135143,⭐A Target Package is short for Target Package ...,9176
1,135145,‘WATCH’ ~ New Series HDTV! ~ The Good Fight Se...,11175
2,135146,⭐A Target Package is short for Target Package ...,9175
3,135152,⭐A Target Package is short for Target Package ...,9195
4,135153,⭐A Target Package is short for Target Package ...,9212


In [5]:
texts = pd.Series(data.text)

max_length = min(max(len(tokenizer.encode(text)) for text in texts), 1024)

Token indices sequence length is longer than the specified maximum sequence length for this model (2048 > 1024). Running this sequence through the model will result in indexing errors


In [6]:
tokenizer = AutoTokenizer.from_pretrained(model_name, bos_token='<|startoftext|>',
    eos_token='<|endoftext|>', pad_token='<|pad|>')
model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Embedding(50259, 768)

In [7]:
class TextsDataset(Dataset):
    def __init__(self, txt_list, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for txt in txt_list:
            encodings_dict = tokenizer('<|startoftext|>' + txt + '<|endoftext|>', truncation=True,
                max_length=max_length, padding="max_length")
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

In [8]:
dataset = TextsDataset(texts, tokenizer, max_length=max_length)
train_size = int(0.9 * len(dataset))
train_dataset, val_dataset = random_split(dataset, [train_size, len(dataset) - train_size])

In [10]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    logging_steps=100,
    save_steps=5000,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    warmup_steps=100,
    weight_decay=0.05,
    logging_dir='./logs',
    report_to='none',
    gradient_accumulation_steps=4,  # Use gradient accumulation
    #fp16=True,                      # Enable mixed precision training
    fp16=False,
    learning_rate=5e-5,             # Adjust the learning rate
)

In [11]:
Trainer(model=model, args=training_args, train_dataset=train_dataset, 
        eval_dataset=val_dataset, data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()



  0%|          | 0/96 [00:00<?, ?it/s]

In [11]:
input_ids = tokenizer.encode("<|startoftext|> " + text, return_tensors='pt').cuda()

output = model.generate(input_ids, max_length=100, do_sample=True)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [12]:
print(tokenizer.decode(output[0], skip_special_tokens=True))

 Data scientists use VS Code as their tool of choice.

The code is easy to code, but it is so incredibly difficult to comprehend this concept and execute. But the more complex data scientists have, the better we would be able to better understand it, and learning it.
So, I understand like this. As well, I have been curious, how many different ways to communicate with friends, how many different ways to communicate with some of them, and how many various methods


### Compare outputs
Comparing generated texts from original GPT2 model & fine-tuned GPT2 model, with starter sentence "Data scientists use VS Code as their tool"

**Original GPT2:**<br>
Data scientists use VS Code as their tool to produce scientific experiments that are able to be studied together by colleagues on the project's two continents. Unlike normal collaboration research, these experiments are divided according to the types of results they will show. "We're going to be using data from all areas that have a large share of clinical data from this project, which isn't even the largest area for this project. The reason we're doing that is simple: One way to use this data as a

**Fine-tuned GPT2 with Medium `Technology` posts:**<br>
Data scientists use VS Code as their tool of choice code. The code is written using the Language of choice made by the Code. The code is written using the Language as its foundation in the collection of applications and the resulting Collection. The resulting collection of applications is an example of the Code in the code of application. The resulting Code consists of different elements of the Application and an object from the code. The object contains various aspects of the application. The object consists of the application. The object

## Save the model

In [13]:
model.save_pretrained("./model/medium-tech")
tokenizer.save_pretrained("./model/medium-tech")

('./model/medium-tech/tokenizer_config.json',
 './model/medium-tech/special_tokens_map.json',
 './model/medium-tech/vocab.json',
 './model/medium-tech/merges.txt',
 './model/medium-tech/added_tokens.json',
 './model/medium-tech/tokenizer.json')