<a href="https://colab.research.google.com/github/kaballas/AutoGPT/blob/master/benchmark/notebooks/combined_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install datasets transformers nltk



In [2]:
import re
import numpy as np
from nltk.corpus import stopwords
from datasets import load_dataset
import nltk

# Download NLTK stopwords
nltk.download('stopwords')

def preprocess_text(text):
    text = re.sub(r'[^A-Za-z0-9\s.,!?]', '', text).lower()
    words = text.split()
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    return words

# Load dataset from Hugging Face
dataset = load_dataset("Kaballas/HRMIS_MASTER", split="train")

# Preprocess the dataset
preprocessed_data = []
for example in dataset:
    words = preprocess_text(example['questions'] + example['answers'])
    preprocessed_data.extend(words)

# Now preprocessed_data contains all the preprocessed words from the dataset
print(f"Total preprocessed words: {len(preprocessed_data)}")
print(f"First 10 words: {preprocessed_data[:10]}")

# Save preprocessed data to a text file
output_file = "preprocessed_data.txt"
with open(output_file, "w", encoding="utf-8") as file:
    for word in preprocessed_data:
        file.write(word + "\n")

print(f"Preprocessed data saved to {output_file}")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Total preprocessed words: 89845
First 10 words: ['purpose', 'parallel', 'pay', 'run', 'test', 'strategy', 'document?the', 'purpose', 'parallel', 'pay']
Preprocessed data saved to preprocessed_data.txt


In [3]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

def encode_input(text):
    return tokenizer.encode(text, return_tensors='pt')

def decode_output(tokens):
    return tokenizer.decode(tokens, skip_special_tokens=True)


In [4]:
import torch

def generate_story_gpt2(seed_text, max_length=100, top_k=50, top_p=0.95):
    input_ids = encode_input(seed_text)
    sample_outputs = model.generate(
        input_ids,
        do_sample=True,
        max_length=max_length,
        top_k=top_k,
        top_p=top_p,
        num_return_sequences=1
    )
    return decode_output(sample_outputs[0])

seed_text = "What data is used for Playbacks 1 and 2 in the Data environment?"
story = generate_story_gpt2(seed_text)
print(story)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What data is used for Playbacks 1 and 2 in the Data environment?

For Playbacks 1 and 2 the data environment is used by default. You can change these parameters to whatever you want if you do not wish to.

Playbacks 1 & 2 Data Environment

If you need to use a different data environment, you can create one which does not yet support this setting.

Data Environment for Playbacks 1 and 2

If you want to use a different Data


In [5]:
from nltk.translate.bleu_score import sentence_bleu
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Example calculation of BLEU score
reference = "What data is used for Playbacks 1 and 2 in the Data environment?".split()
candidate = "What data is used for Playbacks 1 and 2 in the Data environment?".split()
bleu_score = sentence_bleu([reference], candidate)
print(f'BLEU score: {bleu_score}')


BLEU score: 1.0


In [6]:
def filter_inappropriate_content(text):
    inappropriate_keywords = ['badword1', 'badword2']
    for word in inappropriate_keywords:
        if word in text:
            return True
    return False

story = generate_story_gpt2(seed_text)
if filter_inappropriate_content(story):
    print("Inappropriate content detected.")
else:
    print(story)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What data is used for Playbacks 1 and 2 in the Data environment?

There is a very large set of data used to determine Playback-related statistics for each game in a game on the system. Players are often asked to record and share playbacks. They are asked to perform the same tasks as their opponents. What is the reason for doing this data collection?

The reason for performing this data collection is to obtain all of the playback data. We use that data to


In [7]:
!pip install transformers[torch]
!pip install accelerate -U



In [8]:
from transformers import TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments

def fine_tune_model(train_file):
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    model = GPT2LMHeadModel.from_pretrained(model_name)

    dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=train_file,
        block_size=128
    )

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False,
    )

    training_args = TrainingArguments(
        output_dir='./results',
        overwrite_output_dir=True,
        num_train_epochs=100,
        per_device_train_batch_size=32,
        save_steps=10_000,
        learning_rate=1e-4,  # Use a smaller learning rate
        weight_decay=0.01,  # Add weight decay for regularization
        save_total_limit=2,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=dataset,
    )

    trainer.train()
    output_dir='./results'
    # Save the model and tokenizer
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

fine_tune_model('preprocessed_data.txt')




Step,Training Loss


In [9]:
def add_story_structure(text):
    return f"<BOS> {text[:int(len(text)/3)]} <MID> {text[int(len(text)/3):int(2*len(text)/3)]} <EOS> {text[int(2*len(text)/3):]}"

structured_text = add_story_structure("What data is used for Playbacks 1 and 2 in the Data environment?...")
print(structured_text)


<BOS> What data is used for  <MID> Playbacks 1 and 2 in t <EOS> he Data environment?...


In [10]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    fp16=True,  # Enable mixed precision training
    per_device_train_batch_size=4,
    num_train_epochs=3,
)


In [11]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Define the model name and load the tokenizer and model
model_name = 'gpt2'
tokenizer = GPT2Tokenizer.from_pretrained('./results')  # Path to the fine-tuned model
model = GPT2LMHeadModel.from_pretrained('./results')


In [12]:
import torch

def generate_story(seed_text, max_length=100, top_k=50, top_p=0.95):
    # Encode the input text
    input_ids = tokenizer.encode(seed_text, return_tensors='pt')

    # Generate text
    sample_outputs = model.generate(
        input_ids,
        do_sample=True,
        max_length=max_length,
        top_k=top_k,
        top_p=top_p,
        num_return_sequences=1
    )

    # Decode the generated text
    generated_text = tokenizer.decode(sample_outputs[0], skip_special_tokens=True)

    return generated_text

def format_story(story, line_length=80):
    words = story.split()
    formatted_story = ''
    line = ''
    for word in words:
        if len(line) + len(word) + 1 > line_length:
            formatted_story += line.strip() + '\n'
            line = ''
        line += word + ' '
    formatted_story += line.strip()
    return formatted_story

# Example usage
seed_text = "What data is used for Playbacks 1 and 2 in the Data environment?"
story = generate_story(seed_text, max_length=512)
formatted_story = format_story(story, line_length=80)
print(formatted_story)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


What data is used for Playbacks 1 and 2 in the Data environment?the following
table describes the types data used playbacks in data environment. purpose
playbacks 1 2 development environment?the development environment used fix unit
test defects found production testing activities. often development environment
migrated test environment subject approval. often development environment
migrated test environment subject approval. application used migrate test
environment subject approval?the application used migrate test environment
subject approval. often development environment migrated test environment
unscrambled data performed parallel payroll environment?the development
environment used migrate test environment unscrambled data performed parallel
payroll environment highlighted blue figure 8 uat stage. data environment
longer available?the data environment longer available 2027 hours period low
usage minimize risk impact. main purpose test environment build stage?the test
environmen

In [None]:
from transformers import TrainerCallback
from tqdm import tqdm

class ProgressBarCallback(TrainerCallback):
    def __init__(self):
        super().__init__()
        self.epoch_pbar = None
        self.step_pbar = None

    def on_train_begin(self, args, state, control, **kwargs):
        self.epoch_pbar = tqdm(total=args.num_train_epochs, desc="Epochs")
        self.step_pbar = tqdm(total=state.max_steps, desc="Steps")

    def on_epoch_end(self, args, state, control, **kwargs):
        self.epoch_pbar.update(1)

    def on_step_end(self, args, state, control, **kwargs):
        self.step_pbar.update(1)

    def on_train_end(self, args, state, control, **kwargs):
        self.epoch_pbar.close()
        self.step_pbar.close()


In [None]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments

def fine_tune_model(train_file, model_name='gpt2', output_dir='./results'):
    tokenizer = GPT2Tokenizer.from_pretrained(model_name)
    model = GPT2LMHeadModel.from_pretrained(model_name)

    dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=train_file,
        block_size=128
    )

    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False,
    )

    training_args = TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        num_train_epochs=100,
        per_device_train_batch_size=64,
        save_steps=10_000,
        save_total_limit=2,
        learning_rate=1e-4,  # Use a smaller learning rate
        weight_decay=0.01,  # Add weight decay for regularization
        warmup_ratio=0.1,  # Use a smaller warmup ratio
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=dataset,
        callbacks=[ProgressBarCallback()]  # Add the custom progress bar callback
    )

    trainer.train()

    # Save the model and tokenizer
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)

# Fine-tune the model and save it
fine_tune_model('preprocessed_data.txt')


Epochs:   0%|          | 0/100 [00:00<?, ?it/s]
Steps:   0%|          | 0/200 [00:00<?, ?it/s][A
Steps:   0%|          | 1/200 [00:02<02:57,  1.12it/s][A

Step,Training Loss



Epochs:   1%|          | 1/100 [00:02<04:15,  2.58s/it]
Steps:   2%|▏         | 3/200 [00:04<03:43,  1.13s/it][A
Epochs:   2%|▏         | 2/100 [00:04<03:45,  2.30s/it]
Steps:   2%|▎         | 5/200 [00:05<03:29,  1.07s/it][A
Epochs:   3%|▎         | 3/100 [00:06<03:41,  2.28s/it]
Steps:   4%|▎         | 7/200 [00:07<03:45,  1.17s/it][A
Epochs:   4%|▍         | 4/100 [00:08<03:02,  1.90s/it]
Steps:   4%|▍         | 9/200 [00:09<02:58,  1.07it/s][A
Epochs:   5%|▌         | 5/100 [00:10<03:24,  2.15s/it]
Steps:   6%|▌         | 11/200 [00:11<03:34,  1.13s/it][A
Epochs:   6%|▌         | 6/100 [00:12<03:18,  2.12s/it]
Steps:   6%|▋         | 13/200 [00:13<03:31,  1.13s/it][A
Epochs:   7%|▋         | 7/100 [00:14<03:14,  2.09s/it]
Steps:   8%|▊         | 15/200 [00:15<03:29,  1.13s/it][A
Epochs:   8%|▊         | 8/100 [00:16<02:49,  1.84s/it]
Steps:   8%|▊         | 17/200 [00:17<02:48,  1.08it/s][A
Epochs:   9%|▉         | 9/100 [00:18<02:45,  1.82s/it]
Steps:  10%|▉         | 19/

In [None]:
import matplotlib.pyplot as plt

def plot_training_metrics(training_args, trainer):
    logs = trainer.state.log_history
    steps = [log['step'] for log in logs if 'step' in log]
    losses = [log['loss'] for log in logs if 'loss' in log]

    plt.figure(figsize=(10, 5))
    plt.plot(steps, losses, label='Loss')
    plt.xlabel('Steps')
    plt.ylabel('Loss')
    plt.title('Training Loss Over Time')
    plt.legend()
    plt.show()

# Plot training metrics after training
plot_training_metrics(training_args, trainer)
