<a href="https://colab.research.google.com/github/pierretfie/python_world/blob/main/brain_ai/brain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
#!git pull https://github.com/pierretfie/python_world.git
%cd /content/python_world/brain_ai/
!pip install -r requirements.txt

/content/python_world/brain_ai
Collecting datasets (from -r requirements.txt (line 1))
  Downloading datasets-3.0.2-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets->-r requirements.txt (line 1))
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets->-r requirements.txt (line 1))
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets->-r requirements.txt (line 1))
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.2-py3-none-any.whl (472 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.7/472.7 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading multiprocess-0.70.16-py310-none-

In [13]:

from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import Dataset
from os import path





# Load GPT-2 tokenizer and model
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model_path = '/content/python_world/brain_ai'
# Set the pad_token to eos_token to avoid padding issues
tokenizer.pad_token = tokenizer.eos_token

model = GPT2LMHeadModel.from_pretrained(model_name)

# Dataset for fine-tuning
data = {
    'text': [
        "User: Hello!\nBot: Hi there! How can I assist you?",
        "User: Hey!\nBot: Hello! How can I help you today?",
        "User: Hi\nBot: Hey! How are you?"
    ]
}

dataset = Dataset.from_dict(data)
# Tokenize the data with labels
def tokenize_function(example):
    # Tokenize the input text
    tokenized = tokenizer(example['text'], padding='max_length', truncation=True)
    # Use input_ids as labels for training (labels should be identical to input_ids)
    tokenized['labels'] = tokenized['input_ids'].copy()
    return tokenized

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Fine-tuning GPT-2
# Define training arguments
training_args = TrainingArguments(
    output_dir=path.expanduser(model_path),  # where to save the model
    num_train_epochs=3,  # number of training epochs
    per_device_train_batch_size=2,
    save_steps=10_000,
    save_total_limit=2,  # only keep the latest two models
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets
)

# Train the model
trainer.train()

# Test the fine-tuned model
input_text = "User: Hello!\nBot:"
inputs = tokenizer.encode(input_text, return_tensors='pt', padding='max_length', truncation=True)
attention_mask = (inputs != tokenizer.pad_token_id).long()

# Generate a response
outputs = model.generate(inputs, attention_mask=attention_mask, max_new_tokens=50, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)



Map:   0%|          | 0/3 [00:00<?, ? examples/s]

Step,Training Loss


This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (1024). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.


User: Hello!
Bot:

The following is a list of all the bots that have been created by the bot.


In [None]:
%cd /content/python_world
!git add .
!git commit -m "update"
!git push https://github.com/pierretfie/python_world.git