## Project setup

In [1]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
import torch
from torch.utils.data import Dataset, random_split

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

print(device)
model = model.to(torch.device(device))

Downloading: 100%|██████████| 665/665 [00:00<00:00, 669kB/s]
Downloading: 100%|██████████| 1.04M/1.04M [00:00<00:00, 2.36MB/s]
Downloading: 100%|██████████| 456k/456k [00:00<00:00, 1.24MB/s]
Downloading: 100%|██████████| 1.36M/1.36M [00:00<00:00, 3.00MB/s]
Downloading: 100%|██████████| 548M/548M [00:07<00:00, 77.5MB/s] 


cuda


## Test out GPT2 as-is

In [43]:
text = "Data scientists use VS Code as their tool"

device == 'cuda'

if device == 'cuda':
    input_ids = tokenizer.encode(text, return_tensors='pt').cuda()
else:
    input_ids = tokenizer.encode(text, return_tensors='pt')

output = model.generate(input_ids, max_length=100, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Data scientists use VS Code as their tool to produce scientific experiments that are able to be studied together by colleagues on the project's two continents. Unlike normal collaboration research, these experiments are divided according to the types of results they will show.

"We're going to be using data from all areas that have a large share of clinical data from this project, which isn't even the largest area for this project. The reason we're doing that is simple: One way to use this data as a


## Fine-tune model using Medium posts with 'Technology' tag
Data from [Kaggle](https://www.kaggle.com/datasets/fabiochiusano/medium-articles?resource=download) and filtered to posts with `Technology` tag

In [46]:
import pandas as pd
data = pd.read_csv('./resources/medium-articles-technology.csv')
data.head()

Unnamed: 0,text
0,AI creating Human-Looking Images and Tracking ...
1,The Sustainable Element-Technology Nexus that ...
2,Photo by rawpixel on Unsplash\n\nIt is very ea...
3,Despite the terrible SARS-CoV-2 pandemic curre...
4,We were engaged in a spirited debate about whe...


In [48]:
texts = pd.Series(data.text)
print(texts)

0      AI creating Human-Looking Images and Tracking ...
1      The Sustainable Element-Technology Nexus that ...
2      Photo by rawpixel on Unsplash\n\nIt is very ea...
3      Despite the terrible SARS-CoV-2 pandemic curre...
4      We were engaged in a spirited debate about whe...
                             ...                        
109    The Overlooked Conservative Case for Reining i...
110    Last year I had just landed my first job as So...
111    Do not confuse it with Dark Mode\n\nThe hype o...
112    Revising What Makes Covid-19 Special: It’s Not...
113    The Link Between Flu Vaccines and Heart Diseas...
Name: text, Length: 114, dtype: object


In [51]:
max_length = max(len(tokenizer.encode(text)) for text in texts)\
    if max(len(tokenizer.encode(text)) for text in texts) < 1024 else 1024

In [54]:
tokenizer = AutoTokenizer.from_pretrained(model_name, bos_token='<|startoftext|>',
    eos_token='<|endoftext|>', pad_token='<|pad|>')
model = AutoModelForCausalLM.from_pretrained(model_name)
model.resize_token_embeddings(len(tokenizer))

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Embedding(50259, 768)

In [55]:
class TextsDataset(Dataset):
    def __init__(self, txt_list, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for txt in txt_list:
            encodings_dict = tokenizer('<|startoftext|>' + txt + '<|endoftext|>', truncation=True,
                max_length=max_length, padding="max_length")
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return self.input_ids[idx], self.attn_masks[idx]

In [56]:
dataset = TextsDataset(texts, tokenizer, max_length=max_length)
train_size = int(0.9 * len(dataset))
train_dataset, val_dataset = random_split(dataset, [train_size, len(dataset) - train_size])

In [57]:
training_args = TrainingArguments(output_dir='./results', num_train_epochs=1, logging_steps=100, save_steps=5000,
    per_device_train_batch_size=1, per_device_eval_batch_size=1,
    warmup_steps=10, weight_decay=0.05, logging_dir='./logs', report_to = 'none')

In [58]:
Trainer(model=model, args=training_args, train_dataset=train_dataset, 
        eval_dataset=val_dataset, data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

Step,Training Loss
100,9.6592


TrainOutput(global_step=102, training_loss=9.518519013535743, metrics={'train_runtime': 14.2915, 'train_samples_per_second': 7.137, 'train_steps_per_second': 7.137, 'total_flos': 53303574528000.0, 'train_loss': 9.518519013535743, 'epoch': 1.0})

In [61]:
input_ids = tokenizer.encode("<|startoftext|> " + text, return_tensors='pt').cuda()

output = model.generate(input_ids, max_length=100, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


 Data scientists use VS Code as their tool of choice code. The code is written using the Language of choice made by the Code. The code is written using the Language as its foundation in the collection of applications and the resulting Collection. The resulting collection of applications is an example of the Code in the code of application. The resulting Code consists of different elements of the Application and an object from the code. The object contains various aspects of the application. The object consists of the application. The object


## Compare outputs
Comparing generated texts from original GPT2 model & fine-tuned GPT2 model, with starter sentence "Data scientists use VS Code as their tool"

**Original GPT2:**<br>
Data scientists use VS Code as their tool to produce scientific experiments that are able to be studied together by colleagues on the project's two continents. Unlike normal collaboration research, these experiments are divided according to the types of results they will show. "We're going to be using data from all areas that have a large share of clinical data from this project, which isn't even the largest area for this project. The reason we're doing that is simple: One way to use this data as a

**Fine-tuned GPT2 with Medium `Technology` posts:**<br>
Data scientists use VS Code as their tool of choice code. The code is written using the Language of choice made by the Code. The code is written using the Language as its foundation in the collection of applications and the resulting Collection. The resulting collection of applications is an example of the Code in the code of application. The resulting Code consists of different elements of the Application and an object from the code. The object contains various aspects of the application. The object consists of the application. The object

## Save the model

In [62]:
model.save_pretrained("./model/medium-tech")
tokenizer.save_pretrained("./model/medium-tech")

Configuration saved in ./model/medium-tech/config.json
Model weights saved in ./model/medium-tech/pytorch_model.bin
tokenizer config file saved in ./model/medium-tech/tokenizer_config.json
Special tokens file saved in ./model/medium-tech/special_tokens_map.json


('./model/medium-tech/tokenizer_config.json',
 './model/medium-tech/special_tokens_map.json',
 './model/medium-tech/vocab.json',
 './model/medium-tech/merges.txt',
 './model/medium-tech/added_tokens.json',
 './model/medium-tech/tokenizer.json')