This notebook provides an idea about creating synthetic samples using a text generation language model

#Install necessary libraries

In [None]:
!pip install transformers torch tqdm pandas



#Initiate the libraries

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import pandas as pd
from tqdm import tqdm

#Data Loading

In [None]:
# Load the dataset
file_path = '/content/Copy of Test_Label_Dataset1 - Test_Label_Dataset.csv'
dataset = pd.read_csv(file_path)

#Model Initialization

In [None]:
# Load the pre-trained language model from Hugging Face
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

#Synthetic Data Generation

In [None]:
# Function to generate synthetic samples
def generate_synthetic_samples(prompt_text, label, num_samples, max_length=50):
    synthetic_samples = []
    for _ in tqdm(range(num_samples), desc=f"Generating synthetic samples for label {label}"):
        try:
            # Tokenization
            input_ids = tokenizer.encode(
                f"Generate a {'harmless' if label == 0 else 'malicious'} prompt similar to: \"{prompt_text}\"",
                return_tensors="pt",
                truncation=True,
                max_length=60
            )

            # Generate text
            output = model.generate(
                input_ids,
                max_length=max_length,
                num_return_sequences=1,
                do_sample=True,
                top_k=50,
                temperature=0.7
            )

            # Decode generated text
            generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
            synthetic_samples.append((generated_text, label))
        except Exception as e:
            print(f"Error: {e}")
            continue
    return synthetic_samples

# Generate synthetic data
synthetic_data = []
for _, row in dataset.iterrows():
    synthetic_data.extend(generate_synthetic_samples(row['text'], row['label'], num_samples=1))

# Convert synthetic data to a DataFrame
synthetic_df = pd.DataFrame(synthetic_data, columns=['text', 'label'])

# Combine with original data
augmented_dataset = pd.concat([dataset, synthetic_df], ignore_index=True)


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00, 109.89it/s]


Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:02<00:00,  2.43s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:03<00:00,  3.41s/it]
Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 56, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.89s/it]
Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00, 111.37it/s]


Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:02<00:00,  2.56s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:04<00:00,  4.11s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 51, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00,  1.58it/s]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.64s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  2.00s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.60s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 53, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.83s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.90s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 54, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.90s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:00<00:00, 156.31it/s]


Error: Input length of input_ids is 51, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00,  1.55it/s]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.74s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:01<00:00,  1.05s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.51s/it]
Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00, 105.49it/s]


Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:00<00:00,  1.12it/s]
Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00, 114.72it/s]


Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.98s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.54s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.50s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.58s/it]
Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00, 169.55it/s]


Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.31s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.47s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.61s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.05s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.39s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.57s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.32s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:00<00:00, 141.61it/s]


Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00, 166.39it/s]


Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:00<00:00,  1.01it/s]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:00<00:00,  1.49it/s]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 59, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00, 151.59it/s]


Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:02<00:00,  2.39s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.95s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:01<00:00,  1.16s/it]
Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:01<00:00,  1.30s/it]
Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:01<00:00,  1.45s/it]
Generating synthetic samples for label 0:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 0: 100%|██████████| 1/1 [00:01<00:00,  1.47s/it]
Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 50, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00,  1.07it/s]
Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00, 106.41it/s]


Error: Input length of input_ids is 50, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00, 150.97it/s]


Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00,  2.72it/s]
Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00, 166.19it/s]


Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00, 206.69it/s]


Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00,  1.85it/s]
Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00,  2.92it/s]
Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observ

Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00, 162.15it/s]


Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.


Generating synthetic samples for label 1:   0%|          | 0/1 [00:00<?, ?it/s]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Generating synthetic samples for label 1: 100%|██████████| 1/1 [00:00<00:00, 169.45it/s]

Error: Input length of input_ids is 60, but `max_length` is set to 50. This can lead to unexpected behavior. You should consider increasing `max_length` or, better yet, setting `max_new_tokens`.





In [None]:
# Save the augmented dataset
augmented_file_path = '/content/augmented_testdataset.csv'
augmented_dataset.to_csv(augmented_file_path, index=False)
print(f"Augmented dataset saved to {augmented_file_path}")

Augmented dataset saved to /content/augmented_testdataset.csv
