# MSAI 495 | Text Generation | Conversation Primer

### Business Goal / Case Statement

Accelerate and Innovate T-shirt Graphic Design Through AI.

### Assignment Context

**Relevant Industry and/or Business Function:** Social/Messaging apps

**Description:**

The proposal for my Text Generation project is as follows:

```
1. Title
ConvoBot: An AI-Powered Conversation Starter Generator
2. Text Data Source
Dataset: Langame/conversation-starters from Hugging Face (17,470 conversation prompts)
Nature of the Problem:
This dataset contains conversation starters categorized by topics (video games, science, relationships, philosophy, etc.) with varying complexity levels from ice breakers to deep philosophical discussions. The challenge is to generate contextually appropriate, engaging conversation starters that match specific topics or social situations. This addresses the real-world problem of social anxiety and difficulty initiating meaningful conversations.
Data Characteristics:
	•	17,470 diverse prompts with topic tags
	•	Wide range of conversation depths (casual to profound)
	•	Multiple topic categories for targeted generation
	•	Varying prompt lengths and complexity
3. Model Architecture(s)
Primary Approach: Fine-tune a pre-trained transformer model (GPT-2 or T5)
	•	GPT-2 Medium/Large: For autoregressive generation of conversation starters
	•	T5-Base: For conditional generation based on topic inputs
```

The data from Dataset loading and exploration appears as such:

```
/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
Dataset structure:
DatasetDict({
    train: Dataset({
        features: ['topics', 'prompt'],
        num_rows: 17470
    })
})

First few examples:
{'topics': [['video games'], ['science'], ['relationship'], ['personal', 'relationship', 'relationships', 'social', 'big talk', 'personal growth'], ['transhumanism', 'fun']], 'prompt': ['What was the most difficult aspect of mastering a video game?', 'What scientific or intellectual studies do you think would increase substantially if more people contributed around them?', 'What would you do if your partner disappeared?', 'Give eachother four praises and one critique', 'If a sufficiently advanced robot were to take your place in society and have children with a sufficiently advanced robot, would the children have any advantages over you?']}

Column names: ['topics', 'prompt']

Dataset size: 17470
```

Things to think about: "Interesting idea! Will the model be trained to generate similar-style starters, or context-aware responses based on an input topic? And will the training be conditioned on context like domain (gaming, dating, etc.)? Consider these two question during implementation. Looking forward to seeing your work."

Walk me through step-by-step how I would fine-tune GPT-2 on this dataset to execute core functions. I want to do this within a Google Collab notebook. We can leave out extra criteria for now. I will prompt you with "next" when ready to proceed to the next step.

### The Data

**Dataset name:** <code>[conversation-starters](https://huggingface.co/datasets/Langame/conversation-starters)</code><br>

**Data characteristics**

* 17,470 diverse prompts with topic tags

* Wide range of conversation depths (casual to profound)

* Multiple topic categories for targeted generation

* Varying prompt lengths and complexity

### Model Architecture(s)

* GPT-2 Medium/Large: For autoregressive generation of conversation starters

### AI/ML Task(s)

Fine-tune a pre-trained transformer model (GTP-2)

## Step 1: Environment Setup and Installation

In [1]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


**Navigate to the Project Directory**: Change your current directory to the `src` folder:

In [2]:
import os
os.chdir('/content/drive/My Drive/Northwestern/MSAI_495/conversation_primer/src')

**Install Dependencies**: Install the required Python packages using `pip`:

In [3]:
!pip install -r requirements.txt



**Imports**:

In [4]:
import torch
import pandas as pd
import json
from transformers import (
    GPT2LMHeadModel,
    GPT2Tokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import Dataset
import numpy as np

## Step 2: Load and Explore the Dataset

In [5]:
# Load the dataset
from datasets import load_dataset

# Load the conversation starters dataset
dataset = load_dataset("Langame/conversation-starters")

# Explore the dataset structure
print("Dataset structure:")
print(dataset)
print("\nFirst few examples:")
print(dataset['train'][:5])

# Check the columns and data format
print("\nColumn names:", dataset['train'].column_names)
print("\nDataset size:", len(dataset['train']))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset structure:
DatasetDict({
    train: Dataset({
        features: ['topics', 'prompt'],
        num_rows: 17470
    })
})

First few examples:
{'topics': [['video games'], ['science'], ['relationship'], ['personal', 'relationship', 'relationships', 'social', 'big talk', 'personal growth'], ['transhumanism', 'fun']], 'prompt': ['What was the most difficult aspect of mastering a video game?', 'What scientific or intellectual studies do you think would increase substantially if more people contributed around them?', 'What would you do if your partner disappeared?', 'Give eachother four praises and one critique', 'If a sufficiently advanced robot were to take your place in society and have children with a sufficiently advanced robot, would the children have any advantages over you?']}

Column names: ['topics', 'prompt']

Dataset size: 17470


## Step 3: Data Preprocessing and Tokenization

Preprocess the conversation starters into a format suitable for GPT-2 training:

In [6]:
def format_conversation_data(examples):
    """
    Format the data for GPT-2 training
    """
    formatted_texts = []

    for topics, prompt in zip(examples['topics'], examples['prompt']):
        # Handle topics properly
        if topics and len(topics) > 0:
            topic_str = ", ".join(topics[0]) if isinstance(topics[0], list) else str(topics[0])
        else:
            topic_str = "general"

        # Use a clearer format with special tokens
        formatted_text = f"<|startoftext|>Topic: {topic_str}\nConversation Starter: {prompt}<|endoftext|>"
        formatted_texts.append(formatted_text)

    return {"text": formatted_texts}

# Re-format your dataset
print("Re-formatting dataset with improved structure...")
formatted_dataset = dataset['train'].map(format_conversation_data, batched=True)
formatted_dataset = formatted_dataset.remove_columns(['topics', 'prompt'])

print("Sample formatted text:")
print(formatted_dataset[0]['text'])

Re-formatting dataset with improved structure...
Sample formatted text:
<|startoftext|>Topic: video games
Conversation Starter: What was the most difficult aspect of mastering a video game?<|endoftext|>


## Step 4: Model and Tokenizer Setup

Load the GPT-2 model and tokenizer, and configure them for fine-tuning:

In [7]:
# Load GPT-2 model and tokenizer
model_name = "gpt2"  # You can also use "gpt2-medium" for better performance
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Add padding token (GPT-2 doesn't have one by default)
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id

# Add special tokens BEFORE resizing embeddings
special_tokens = {
    "additional_special_tokens": ["<|startoftext|>", "<|topic|>", "<|starter|>"]
}
num_added_tokens = tokenizer.add_special_tokens(special_tokens)

# Resize model embeddings to accommodate new tokens
model.resize_token_embeddings(len(tokenizer))

print(f"Model loaded: {model_name}")
print(f"Added {num_added_tokens} special tokens")
print(f"New vocabulary size: {len(tokenizer)}")
print(f"Model parameters: {model.num_parameters():,}")

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Model loaded: gpt2
Added 3 special tokens
New vocabulary size: 50260
Model parameters: 124,442,112


## Step 5: Tokenization Function

Create a function to tokenize formatted text data for training:

In [8]:
def tokenize_function(examples):
    """
    Tokenize the text data for GPT-2 training
    """
    # Tokenize the text
    tokenized = tokenizer(
        examples["text"],
        padding="max_length",  # Pad all sequences to max_length
        truncation=True,
        max_length=512,  # Adjust based on your needs and GPU memory
        return_tensors=None
    )

    # Set labels as lists (not tensor clones)
    tokenized["labels"] = [list(ids) for ids in tokenized["input_ids"]]

    return tokenized

# Re-apply the corrected tokenization
print("Re-tokenizing dataset with proper padding...")
tokenized_dataset = formatted_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"]
)

# Recreate the train/test split
train_test_split = tokenized_dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

print(f"Re-tokenization complete!")
print(f"Sample lengths should now be consistent: {len(tokenized_dataset[0]['input_ids'])}")

Re-tokenizing dataset with proper padding...


Map:   0%|          | 0/17470 [00:00<?, ? examples/s]

Re-tokenization complete!
Sample lengths should now be consistent: 512


## Step 6: Data Splitting and Data Collator Setup

Split dataset into training and validation sets, and set up the data collator:

In [9]:
# Split the dataset into train and validation sets
train_test_split = tokenized_dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = train_test_split['train']
eval_dataset = train_test_split['test']

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(eval_dataset)}")

# Set up data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # GPT-2 uses causal language modeling, not masked language modeling
    return_tensors="pt"
)

Training samples: 15723
Validation samples: 1747


Step 7: Training Arguments Configuration

In [10]:
os.environ["WANDB_DISABLED"] = "true"

# Define training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-conversation-starters",
    overwrite_output_dir=True,
    num_train_epochs=1,  # Reduced from 3
    per_device_train_batch_size=2,  # Reduced from 4
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=4,  # Increased to maintain effective batch size
    warmup_steps=100,  # Reduced from 500
    learning_rate=5e-5,
    logging_steps=50,   # More frequent logging
    eval_strategy="steps",
    eval_steps=200,     # More frequent evaluation
    save_steps=500,     # More frequent saving
    save_total_limit=2,
    prediction_loss_only=True,
    remove_unused_columns=False,
    dataloader_pin_memory=False,
    fp16=True,
    report_to=None,
    max_steps=1000,
)

print("Training arguments configured!")
print(f"Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"Total training steps: {len(train_dataset) // (training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps) * training_args.num_train_epochs}")

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


Training arguments configured!
Effective batch size: 8
Total training steps: 5895


## Step 8: Initialize the Trainer

Set up the Trainer object that will handle the fine-tuning process:

In [11]:
# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
    tokenizer=tokenizer,
)

print("Trainer initialized successfully!")
print(f"Training dataset size: {len(trainer.train_dataset)}")
print(f"Evaluation dataset size: {len(trainer.eval_dataset)}")

Trainer initialized successfully!
Training dataset size: 15723
Evaluation dataset size: 1747


  trainer = Trainer(


## Step 9: Start Fine-Tuning

Begin the actual training process:

In [12]:
# Start training
print("Starting fine-tuning...")
print("This may take 30-60 minutes depending on your GPU...")

# Train the model
training_result = trainer.train()

print("Training completed!")
print(f"Final training loss: {training_result.training_loss:.4f}")

Starting fine-tuning...
This may take 30-60 minutes depending on your GPU...


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss,Validation Loss


KeyboardInterrupt: 

## Step 10: Save and Test Fine-Tuned Model

In [None]:
# Save the fine-tuned model
print("Saving the fine-tuned model...")
trainer.save_model("./gpt2-conversation-starters-final")
tokenizer.save_pretrained("./gpt2-conversation-starters-final")
print("Model saved successfully!")

# Test your fine-tuned model
def generate_conversation_starter(topic, max_length=100):
    """
    Generate a conversation starter for a given topic
    """
    # Use the exact training format
    prompt = f"<|startoftext|>Topic: {topic}\nConversation Starter:"

    # Tokenize
    inputs = tokenizer.encode(prompt, return_tensors="pt").to(model.device)

    # Generate with stricter parameters
    with torch.no_grad():
        outputs = model.generate(
            inputs,
            max_length=inputs.shape[1] + 30,  # Only generate ~30 new tokens
            temperature=0.7,
            do_sample=True,
            top_p=0.9,
            repetition_penalty=1.2,
            no_repeat_ngram_size=3,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
            early_stopping=True
        )

    # Decode and extract just the conversation starter
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract the conversation starter part
    if "Conversation Starter:" in generated_text:
        starter = generated_text.split("Conversation Starter:")[-1].strip()
        # Clean up - stop at first sentence
        if "?" in starter:
            starter = starter.split("?")[0] + "?"
        elif "." in starter:
            starter = starter.split(".")[0] + "."
        return starter

    return "Could not generate conversation starter"

# Test the improved version
test_topics = ["relationships", "science", "video games", "philosophy"]

print("\n=== Testing Improved Generation ===")
for topic in test_topics:
    starter = generate_conversation_starter(topic)
    print(f"\nTopic: {topic}")
    print(f"Generated: {starter}")