<a href="https://colab.research.google.com/github/jeffreylowzg/LLM_homework6/blob/jeffrey-commits/data_clean.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install -U "huggingface_hub[cli]"
!pip install torch transformers[torch] numpy tqdm datasets peft accelerate

Download dataset and saves 5%

In [None]:
from datasets import load_dataset
import pandas as pd
import os

# Load the dataset from Hugging Face
dataset = load_dataset("dmitva/human_ai_generated_text", split="train")

# Calculate 5% of the dataset size
sample_size = int(0.05 * len(dataset))

# Sample 5% of the data
sampled_dataset = dataset.shuffle(seed=42).select(range(sample_size))

# Convert to pandas DataFrame for easier handling
df = pd.DataFrame(sampled_dataset)

# Ensure the 'data' directory exists
os.makedirs("data", exist_ok=True)

# Save to a CSV file in the 'data' folder
df.to_csv("data/sample_5_percent.csv", index=False)

print("5% of the dataset has been saved to 'data/sample_5_percent.csv'")


Read saved data and split into labels 0 (for human) and 1 (for ai)

In [None]:
import json

# Read the sampled CSV file
df = pd.read_csv("data/sample_5_percent.csv")

# Initialize an empty list to hold the new records
data = []

# Process each row to create two entries: one for human text, one for AI text
for _, row in df.iterrows():
    # Append the human text with label 0
    data.append({
        "text": row["human_text"],
        "instructions": row["instructions"],
        "label": 0
    })

    # Append the AI text with label 1
    data.append({
        "text": row["ai_text"],
        "instructions": row["instructions"],
        "label": 1
    })

# Save the processed data to a JSON file
outfile = "data/sample_5_percent.jsonl"
with open(outfile, "w") as f:
    for d in data:
        json.dump(d, f)
        f.write("\n")

print(f"The dataset has been saved to {outfile} with the specified format.")

split dataset into train and test

In [None]:
import json
from sklearn.model_selection import train_test_split

# Paths
original_data_path = "data/sample_5_percent.jsonl"
train_data_path = "data/train.jsonl"
test_data_path = "data/test.jsonl"

# Function to split JSONL file
def split_jsonl_file(input_path, train_path, test_path, test_size=0.2):
    with open(input_path, "r") as f:
        lines = [json.loads(line) for line in f]
    
    train_lines, test_lines = train_test_split(lines, test_size=test_size, random_state=42)
    
    # Save split datasets
    with open(train_path, "w") as train_file:
        for line in train_lines:
            train_file.write(json.dumps(line) + "\n")
    
    with open(test_path, "w") as test_file:
        for line in test_lines:
            test_file.write(json.dumps(line) + "\n")

if __name__ == "__main__":
    # Perform the split
    split_jsonl_file(original_data_path, train_data_path, test_data_path)
    print(f"Data split completed. Train: {train_data_path}, Test: {test_data_path}")


In [1]:
!mkdir -p models/pythia-160m
!huggingface-cli download EleutherAI/pythia-160m --local-dir ./models/pythia-160m
!cd ../..

Fetching 8 files:   0%|                                   | 0/8 [00:00<?, ?it/s]Downloading 'README.md' to 'models/pythia-160m/.cache/huggingface/download/README.md.2e8b2b93bf534833600f862c90c0b53fc6f76a46.incomplete'
Downloading 'special_tokens_map.json' to 'models/pythia-160m/.cache/huggingface/download/special_tokens_map.json.0204ed10c186a4c7c68f55dff8f26087a45898d6.incomplete'
Downloading 'pytorch_model.bin' to 'models/pythia-160m/.cache/huggingface/download/pytorch_model.bin.8d856725c4a8266f10568cc1269948fc851e2aed1fb230036af2eac68e9df072.incomplete'

README.md: 100%|███████████████████████████| 13.6k/13.6k [00:00<00:00, 26.4MB/s][A
Download complete. Moving file to models/pythia-160m/README.md
Downloading 'config.json' to 'models/pythia-160m/.cache/huggingface/download/config.json.b8368ff94f3bcf3088de5e9912251fc0208ae524.incomplete'

special_tokens_map.json: 100%|███████████████| 99.0/99.0 [00:00<00:00, 1.16MB/s][A
Download complete. Moving file to models/pythia-160m/special_to

In [None]:
!wandb login 7077b7416aa6d8dd6e87ab0b9150b82abed30bd1

train + evaluate (freeze 6 layers + lora r(16) alpha(32))

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from sklearn.metrics import accuracy_score
import numpy as np

# Paths for train and test data
train_data_path = "data/train.jsonl"
test_data_path = "data/test.jsonl"

# Specify the local directory where the model was downloaded
model_path = "./models/pythia-160m"

# Load the tokenizer and model for sequence classification
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=2)  # Binary classification

# Add padding token if it doesn't exist and set it as the pad token
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))  # Resize model embeddings to match the new pad token

# Explicitly set pad_token_id in model configuration
model.config.pad_token_id = tokenizer.pad_token_id

# LoRA Configuration
lora_config = LoraConfig(
    task_type="SEQ_CLS",   # Sequence classification
    inference_mode=False,
    r=16,                  # LoRA rank
    lora_alpha=32,         # Scaling factor
    lora_dropout=0.1       # Regularization
)

# Wrap the model with LoRA
model = get_peft_model(model, lora_config)

# Freeze the first few layers of GPT-NeoX
num_layers_to_freeze = 6  # Adjust based on model depth and dataset size

# For GPT-NeoX, transformer layers are in model.base_model.gpt_neox.layers
for layer in model.base_model.gpt_neox.layers[:num_layers_to_freeze]:
    for param in layer.parameters():
        param.requires_grad = False

# Always ensure the classification head and LoRA layers are trainable
model.print_trainable_parameters()  # Check trainable parameters

# Load the split datasets
train_dataset = load_dataset("json", data_files=train_data_path)["train"]
test_dataset = load_dataset("json", data_files=test_data_path)["train"]

# Preprocessing function for tokenization and label mapping
def preprocess_function(examples):
    # Tokenize the text
    inputs = tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)
    inputs["labels"] = examples["label"]  # Use label for classification
    return inputs

# Tokenize the datasets
tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True)
tokenized_test_dataset = test_dataset.map(preprocess_function, batched=True)

# Define a function to compute accuracy
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)  # Take the highest probability class
    accuracy = accuracy_score(labels, predictions)
    return {"accuracy": accuracy}

# Set up training arguments
training_args = TrainingArguments(
    output_dir="./models/pythia-160m-finetuned-classifier-lora",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
    save_strategy="epoch",     # Save the model at the end of each epoch
    evaluation_strategy="epoch",
    save_total_limit=2,
    logging_dir='./logs',
    logging_steps=100,
    load_best_model_at_end=True,
    learning_rate=1e-4,        # Adjusted for PEFT
    fp16=True,                 # Enable mixed precision training if supported
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Save the final model
model.save_pretrained("./models/pythia-160m-finetuned-classifier-lora")
tokenizer.save_pretrained("./models/pythia-160m-finetuned-classifier-lora")

print("Model fine-tuning completed and saved to './models/pythia-160m-finetuned-classifier-lora'")

# Evaluate the model
eval_results = trainer.evaluate()
print(f"Evaluation Results: {eval_results}")


Evaluation on untrained model

In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
from peft import LoraConfig, get_peft_model
from sklearn.metrics import accuracy_score
import numpy as np

# Specify the paths for the train and test datasets
train_data_path = "data/train.jsonl"
test_data_path = "data/test.jsonl"

# Specify the local directory where the model was downloaded
model_path = "./models/pythia-160m"

# Load the tokenizer and model for sequence classification
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path, num_labels=2)  # Binary classification

# Add padding token if it doesn't exist and set it as the pad token
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})
    model.resize_token_embeddings(len(tokenizer))  # Resize model embeddings to match the new pad token

# Explicitly set pad_token_id in model configuration
model.config.pad_token_id = tokenizer.pad_token_id

# LoRA Configuration
lora_config = LoraConfig(
    task_type="SEQ_CLS",   # Sequence classification
    inference_mode=False,
    r=16,                  # LoRA rank
    lora_alpha=32,         # Scaling factor
    lora_dropout=0.1       # Regularization
)

# Wrap the model with LoRA
model = get_peft_model(model, lora_config)

# Load the train and test datasets
train_dataset = load_dataset("json", data_files=train_data_path)["train"]
test_dataset = load_dataset("json", data_files=test_data_path)["train"]

# Preprocessing function for tokenization and label mapping
def preprocess_function(examples):
    # Tokenize the text
    inputs = tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)
    inputs["labels"] = examples["label"]  # Use label for classification
    return inputs

# Tokenize the datasets
tokenized_train_dataset = train_dataset.map(preprocess_function, batched=True)
tokenized_test_dataset = test_dataset.map(preprocess_function, batched=True)

# Define a function to compute accuracy
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)  # Take the highest probability class
    accuracy = accuracy_score(labels, predictions)
    return {"accuracy": accuracy}

# Set up evaluation arguments
evaluation_args = TrainingArguments(
    output_dir="./models/pythia-160m-eval",
    per_device_eval_batch_size=8,
    logging_dir='./logs',
    fp16=True,  # Enable mixed precision evaluation if supported
)

# Initialize the Trainer for evaluation only
trainer = Trainer(
    model=model,
    args=evaluation_args,
    train_dataset=tokenized_train_dataset,  # Optional: If you're training as well
    eval_dataset=tokenized_test_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Evaluate the untrained model
eval_results = trainer.evaluate()
print(f"Evaluation Results (Untrained Model): {eval_results}")


printing generated outputs before classification head.

In [14]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import json
import numpy as np
from datasets import load_dataset
from peft import PeftModel

# Path to the fine-tuned model and test data
model_path = "./models/pythia-160m-finetuned-classifier-lora"
base_model_path = "./models/pythia-160m"  # Base pre-trained model path
test_data_path = "data/test.jsonl"

# Load the tokenizer from the fine-tuned directory
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Add padding token if not already defined
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Load the base model
base_model = AutoModelForSequenceClassification.from_pretrained(
    base_model_path,
    num_labels=2
)

# Resize the base model's embedding layer to match the tokenizer
base_model.resize_token_embeddings(len(tokenizer))

# Set the padding token ID in the model configuration
base_model.config.pad_token_id = tokenizer.pad_token_id

# Load the LoRA adapters into the resized base model
model = PeftModel.from_pretrained(base_model, model_path)

# Ensure the model is in evaluation mode
model.eval()

# Load the test dataset
test_dataset = load_dataset("json", data_files=test_data_path)["train"]

# Extract the text prompts from the dataset
test_prompts = test_dataset["text"][:10]  # Select only the first 10 inputs

# Tokenize the test prompts
inputs = tokenizer(
    test_prompts,
    padding=True,  # Enable padding
    truncation=True,
    max_length=128,
    return_tensors="pt"
)

# Move tensors to the appropriate device (GPU if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
inputs = {key: value.to(device) for key, value in inputs.items()}

# Pass the inputs through the model to get hidden states
with torch.no_grad():
    outputs = model.base_model(**inputs, output_hidden_states=True)
    # Extract the last hidden state (before the classification head)
    hidden_states = outputs.hidden_states[-1]  # Last layer's hidden states
    pooled_embeddings = hidden_states[:, 0, :]  # CLS token's embedding for each prompt

# Convert embeddings to a numpy array for saving
pooled_embeddings_np = pooled_embeddings.cpu().numpy()

# Compute softmax values for pooled_embeddings_np[0]
softmax_values = torch.nn.functional.softmax(torch.tensor(pooled_embeddings_np[1]), dim=0).numpy()

# Get the top 10 highest values and their indices
top_10_indices = np.argsort(softmax_values)[-10:][::-1] 
top_10_values = softmax_values[top_10_indices]

# Print the top 10 highest values and their corresponding indices
print("Top 10 softmax values:")
for i, (index, value) in enumerate(zip(top_10_indices, top_10_values)):
    print(f"{i + 1}: Index {index}, Value {value}")


Some weights of GPTNeoXForSequenceClassification were not initialized from the model checkpoint at ./models/pythia-160m and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Top 10 softmax values:
1: Index 650, Value 0.13100165128707886
2: Index 544, Value 0.0750051960349083
3: Index 212, Value 0.06678420305252075
4: Index 759, Value 0.052072495222091675
5: Index 69, Value 0.03131868690252304
6: Index 632, Value 0.029305055737495422
7: Index 78, Value 0.025107571855187416
8: Index 423, Value 0.023854373022913933
9: Index 372, Value 0.023483315482735634
10: Index 286, Value 0.020786363631486893


Temp workspace for generative output

In [15]:
from datasets import load_dataset, DatasetDict
import json

# Load the dataset from Hugging Face
dataset = load_dataset("dmitva/human_ai_generated_text", split="train")
train_testvalid = dataset.train_test_split(test_size=0.2)
test_valid = train_testvalid['test'].train_test_split(test_size=0.2)
dataset = DatasetDict({
    'train': train_testvalid['train'],
    'test': test_valid['test'],
    'dev': test_valid['train']})

for split in ["train", "test", "dev"]: 
    with open(f"data/processed_{split}.jsonl", "w") as f: 
    # Process each row to create two entries: one for human text, one for AI text
        for row in dataset[split].iter(batch_size=1):
            # Append the human text with label 0
            human = {
                "text": row["human_text"][0],
                "instructions": row["instructions"][0],
                "label": 0
            }

            # Append the AI text with label 1
            ai = {
                "text": row["ai_text"][0],
                "instructions": row["instructions"][0],
                "label": 1
            }

            json.dump(human, f)
            f.write("\n")
            json.dump(ai, f)
            f.write("\n")



In [18]:
from transformers import AutoTokenizer
from functools import partial
import datasets

def _create_prompt(entry, instruct_key="instructions", response_key="text"): 
    label = "human" if entry["label"] == 0 else "AI"
    prompt = f"Based on the task instruction, determine if the response is written by a human, or AI generated.\nInstruction: {entry[instruct_key]}\n\nResponse: {entry[response_key]}\n\nThe response is written by: {label}"

    entry["input_text"] = prompt
    return entry

def preprocess_batch(batch, tokenizer, max_length):

    return tokenizer(
        batch["input_text"],
        max_length=max_length,
        truncation=True,
    )

def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, dataset: datasets.Dataset):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """
    
    dataset = dataset.map(_create_prompt)

    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    processed_dat = dataset.map(
        _preprocessing_function,
        batched=True,
        remove_columns=dataset.column_names,
    )

    processed_dat = processed_dat.filter(lambda sample: len(sample["input_ids"]) < max_length)
    
    return processed_dat

embedding vocabs

In [14]:
tokenizer.encode("human")

[13961]

In [11]:
tokenizer.encode("AI")

[18128]

In [4]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from datasets import load_dataset

# Paths to model and dataset
base_model_path = "./models/pythia-160m"
test_data_path = "data/test.jsonl"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model_path)
if tokenizer.pad_token is None:
    tokenizer.add_special_tokens({'pad_token': '[PAD]'})

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_path,
)
base_model.resize_token_embeddings(len(tokenizer))
base_model.config.pad_token_id = tokenizer.pad_token_id

# Load the test dataset
test_dataset = load_dataset("json", data_files=test_data_path)["train"]
test_prompts = test_dataset["text"][:10]  # Select the first 10 prompts

# Tokenize the test prompts
inputs = tokenizer(
    test_prompts,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors="pt"
)

# Move tensors to the appropriate device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
base_model.to(device)
inputs = {key: value.to(device) for key, value in inputs.items()}

# Indices to extract
token_indices = [tokenizer.encode("human"), tokenizer.encode("AI")]

# Extract the last hidden states
with torch.no_grad():
    outputs = base_model(**inputs, output_hidden_states=True)

# Loop through all prompts dynamically based on the dataset
for i, prompt in enumerate(test_prompts, start=0):
    # Calculate the logits for "AI" and "human"
    AI_idx = outputs.logits[i, -1, tokenizer.encode("AI")].detach().cpu().numpy()
    human_idx = outputs.logits[i, -1, tokenizer.encode("human")].detach().cpu().numpy()

    # Compare logits and print results accordingly
    if human_idx > AI_idx:
        print(f"Prompt {i}: human > AI")
    else:
        print(f"Prompt {i}: AI > human")

    
torch.cuda.empty_cache()

Prompt 0: human > AI
Prompt 1: human > AI
Prompt 2: human > AI
Prompt 3: human > AI
Prompt 4: AI > human
Prompt 5: human > AI
Prompt 6: human > AI
Prompt 7: human > AI
Prompt 8: human > AI
Prompt 9: human > AI


In [7]:
import json
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

def load_test_data(file_path: str) -> Dataset:
    """Load the test data from a JSONL file into a Hugging Face Dataset."""
    dataset = load_dataset("json", data_files=file_path, split="train")
    return dataset

def make_few_shot_prompt(dev_ds: Dataset) -> str:
    """
    Create a few-shot learning prompt for classifying text as 'human' or 'AI' generated.
    
    Args:
        dev_ds (Dataset): Dataset containing text examples and labels.
        
    Returns:
        str: A few-shot prompt string.
    """
    examples = []
    label_map = {0: "AI-generated", 1: "Human-generated"}

    for example in dev_ds.select(range(3)):  # Take three examples to build the prompt
        text = example["text"]
        label = example["label"]
        label_text = label_map.get(label, "Human-generated")  # Default to 'Human-generated'

        # Format the example with text and its corresponding label
        example_str = f"Text: {text}\nLabel: {label_text}\n"
        examples.append(example_str)

    # Combine examples into a few-shot prompt
    few_shot_prompt = "\n".join(examples) + "\nClassify the following text as 'Human-generated' or 'AI-generated':"
    return few_shot_prompt

def classify_text(prompt: str, text: str, model, tokenizer) -> str:
    """
    Classify a single text using the model and the prompt.
    
    Args:
        prompt (str): The few-shot learning prompt.
        text (str): The text to classify.
        model: The pre-trained language model.
        tokenizer: The tokenizer for the language model.
        
    Returns:
        str: The predicted label ('Human-generated' or 'AI-generated').
    """
    input_text = f"{prompt}\nText: {text}\nLabel:"
    inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=512)
    inputs = {key: value.to(device) for key, value in inputs.items()}

    # Generate logits for classification
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=5)
    
    prediction = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract the label from the generated output
    if "AI-generated" in prediction:
        return "AI-generated"
    elif "Human-generated" in prediction:
        return "Human-generated"
    else:
        return "Unknown"

# Paths to model and dataset
test_data_path = "data/test.jsonl"
base_model_path = "models/pythia-160m"  # Use your pre-trained language model

# Load the test dataset
test_dataset = load_test_data(test_data_path)

# Limit the test dataset to 3 prompts
test_dataset = test_dataset.select(range(3))  # Select only the first 3 prompts

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(base_model_path)
model = AutoModelForCausalLM.from_pretrained(base_model_path)

# Move model to appropriate device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Generate the few-shot prompt using the first few examples
few_shot_prompt = make_few_shot_prompt(test_dataset)

# Print the prompt used
print("Few-Shot Prompt Used:")
print(few_shot_prompt)
print("=" * 80)

# Classify each example in the limited test dataset
results = []
for example in test_dataset:
    text = example["text"]
    true_label = example["label"]
    predicted_label = classify_text(few_shot_prompt, text, model, tokenizer)
    results.append({"text": text, "true_label": true_label, "predicted_label": predicted_label})

# Print the results
for result in results:
    print(f"Text: {result['text']}")
    print(f"True Label: {'Human-generated' if result['true_label'] == 1 else 'AI-generated'}")
    print(f"Predicted Label: {result['predicted_label']}\n")


Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.
Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Few-Shot Prompt Used:
Text: He became the owner of his own brokerage firm and went on to lead a successful life. He was a living testament that one can rise above the circumstances of their life and achieve success against all odds. He rose from a tumultuous childhood and poverty to become a successful entrepreneur, stockbroker, and author. Chris had a great impact on society and his story will continue to be a source of hope and inspiration.. He wrote numerous books and appeared in major films, magnifying his reach and inspiring others.

Chris's journey of resilience serves a source of inspiration to many. He put in the work to earn his degree in business administration. He credited his strong will to stay in school and have a better life as a key factor in his success.

Thanks to Chris's relentless determination and hard work, he was able to break barriers within the world of banking and finance. 
Chris Gardner's story is an inspiring one. Despite facing numerous personal and profess