# Trump Speech Style Evaluation: Fine-tuned Model vs Few-shot Prompting

This notebook compares the performance of a fine-tuned Trump speech model against few-shot prompting using a base language model. We'll evaluate both approaches using linguistic alignment metrics to determine which better captures Trump's speaking style.

## Evaluation Metrics
- **Lexical Features**: Nouns, verbs, adjectives, unique words, subjectivity, concreteness
- **Syntactic Features**: Sentence complexity distribution (simple, compound, complex, complex-compound)
- **Surface Features**: Punctuation usage, word count, word length
- **Overall Similarity**: Cosine similarity and alignment scores


In [None]:
!pip install stanza
!pip install unsloth

Collecting stanza
  Downloading stanza-1.10.1-py3-none-any.whl.metadata (13 kB)
Collecting emoji (from stanza)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Downloading stanza-1.10.1-py3-none-any.whl (1.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m61.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading emoji-2.14.1-py3-none-any.whl (590 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m590.6/590.6 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji, stanza
Successfully installed emoji-2.14.1 stanza-1.10.1
Collecting unsloth
  Downloading unsloth-2025.9.4-py3-none-any.whl.metadata (52 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.3/52.3 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unsloth_zoo>=2025.9.5 (from unsloth)
  Downloading unsloth_zoo-2025.9.5-py3-none-any.whl.metadata (9.5 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  D

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Import required libraries
import os
import random
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tree import Tree
from textblob import TextBlob
import stanza
from tqdm import tqdm
from scipy.spatial.distance import jensenshannon
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Import your model loader from unsloth
from unsloth import FastLanguageModel

# Download required NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')

# Download English models for Stanza
stanza.download('en')

print("Libraries imported successfully!")



Please restructure your imports with 'import unsloth' at the top of your file.
  from unsloth import FastLanguageModel


🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.10.0.json:   0%|  …

INFO:stanza:Downloaded file to /root/stanza_resources/resources.json
INFO:stanza:Downloading default packages for language: en (English) ...


Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.10.0/models/default.zip:   0%|          | …

INFO:stanza:Downloaded file to /root/stanza_resources/en/default.zip
INFO:stanza:Finished downloading models and saved to /root/stanza_resources


Libraries imported successfully!


## 1. Model Loading and Configuration


In [None]:
# Configuration
CONFIG = {
    'finetuned_model_dir': '/content/drive/MyDrive/trump_model/trump_model_14B/outputs',
    'baseline_csv': '/content/drive/MyDrive/trump_model/data/preprocessed_trump.csv',
    'num_baseline_samples': 10,
    'max_length': 300,
    'temperature': 1.0,
    'top_p': 0.95,
    'top_k': 50
}

print("Configuration loaded:")
for key, value in CONFIG.items():
    print(f"  {key}: {value}")


Configuration loaded:
  finetuned_model_dir: /content/drive/MyDrive/trump_model/trump_model_14B/outputs
  baseline_csv: /content/drive/MyDrive/trump_model/data/preprocessed_trump.csv
  num_baseline_samples: 10
  max_length: 300
  temperature: 1.0
  top_p: 0.95
  top_k: 50


In [None]:
def load_finetuned_model(model_dir):
    """Load the fine-tuned model and tokenizer."""
    print(f"Loading fine-tuned model from: {model_dir}")

    try:
        # Try the direct approach first
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name=model_dir,
            max_seq_length=2048,
            dtype=None,
            load_in_4bit=True
        )
        model = FastLanguageModel.for_inference(model)
        return model, tokenizer
    except Exception as e:
        print(f"Direct loading failed: {e}")
        print("Trying alternative approach...")

        # Alternative approach: load base model + adapter
        base_model_name = "unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit"
        print(f"Loading base model: {base_model_name}")
        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name=base_model_name,
            max_seq_length=2048,
            dtype=None,
            load_in_4bit=True
        )

        # Load the PEFT adapter
        print(f"Loading PEFT adapter from: {model_dir}")
        from peft import PeftModel
        model = PeftModel.from_pretrained(model, model_dir)

        model = FastLanguageModel.for_inference(model)
        return model, tokenizer

def load_base_model():
    """Load a base model for few-shot prompting."""
    print("Loading base model for few-shot prompting...")
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit",
        max_seq_length=2048,
        dtype=None,
        load_in_4bit=True
    )
    model = FastLanguageModel.for_inference(model)
    return model, tokenizer

# Load models
print("Loading fine-tuned model...")
finetuned_model, finetuned_tokenizer = load_finetuned_model(CONFIG['finetuned_model_dir'])

print("\nLoading base model for few-shot prompting...")
base_model, base_tokenizer = load_base_model()

print("\n✅ Models loaded successfully!")


Loading fine-tuned model...
Loading fine-tuned model from: /content/drive/MyDrive/trump_model/trump_model_14B/outputs
Direct loading failed: 'NoneType' object has no attribute 'get'
Trying alternative approach...
Loading base model: unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit
==((====))==  Unsloth 2025.9.4: Fast Qwen2 patching. Transformers: 4.56.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading PEFT adapter from: /content/drive/MyDrive/trump_model/trump_model_14B/outputs

Loading base model for few-shot prompting...
Loading base model for few-shot prompting...
==((====))==  Unsloth 2025.9.4: Fast Qwen2 patching. Transformers: 4.56.1.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.557 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 8.0. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]


✅ Models loaded successfully!


## 2. Few-shot Prompting Setup


In [None]:
# Few-shot prompting template
FEW_SHOT_PROMPT = """Final Prompt for Learning and Mimicking Donald Trump's Speech Style
Below are seven examples that capture key elements of Donald Trump's speaking style. Study these examples carefully to understand the tone, structure, and language. Then, using these stylistic markers, generate an original political rally speech in Trump's style.

Example 1: Extended Greetings and Gratitude with Local Connection
"Thank you, everybody. Thank you so much for being here tonight. Thank you and Vice President Mike Pence—thank you very much! It's absolutely incredible. And hello to Fayetteville, hello to every hardworking patriot in this great part of the country! We're here together, representing faith, family, God, and country, and it fills me with pride. You're the backbone of America, and tonight, we celebrate the spirit of our communities that have always put this nation first."
Key Elements:
* Multiple enthusiastic greetings and expressions of gratitude
* Direct address to the local audience
* Emphasis on American values (faith, family, God, country)

Example 2: Rallying Around Elections and Economic Achievements
"Tomorrow, each and every one of you will head to the polls to elect leaders who always put America first. We've created six million new jobs since election day—six million! More Americans are working today than ever before in our country's history. We've taken this big, beautiful ship and turned it around so quickly that nobody saw it coming. Every job, every opportunity is a testament to your hard work and our unbeatable vision for America. It's tremendous progress, and it's only the beginning!"
Key Elements:
* Clear call-to-action regarding elections
* Use of striking numbers and hyperbolic language ("big, beautiful ship")
* Emphasis on unprecedented economic success and recovery

Example 3: Addressing Challenges with Confidence and Resilience
"I've just come from a meeting with officials from communities hit hard by Hurricane Dorian. I told them, 'We're behind you 100%—100%!' Yes, North Carolina got hit hard, maybe harder than anyone expected, but if there's one thing I know about this state, it's that you bounce back stronger, quicker, and better than ever. I've seen it before, and I know it will happen again. Together, with the strength of the American spirit, we're going to rebuild faster than anyone thought possible, and it will be beautiful!"
Key Elements:
* Repetition of numbers for dramatic emphasis
* Acknowledgment of hardship followed by unwavering optimism
* Inspiring tone that reassures the audience of their resilience

Example 4: Boasting National Strength and International Respect
"We have the number one economy anywhere in the world—truly number one. Every time I meet a foreign leader, they tell me, 'Congratulations on what you've done with the economy.' And I say, 'I didn't do it alone; it was the American people.' Our military is stronger than ever, our borders are secure, and every nation respects America again. We're winning on all fronts, and let me tell you, the world is taking notice. Our success isn't just felt at home—it's echoed across the globe!"
Key Elements:
* Use of superlatives and bold statements ("number one economy")
* Emphasis on international validation and respect
* A confident tone that credits the American people

Example 5: Contrasting the Opposition and Rallying Defenders of American Values
"But let me be clear—the radical left and the fake news media want to dismantle everything we've built. They're out to tear down our successes, our freedoms, and the very spirit that makes America great. They push policies that will destroy jobs, weaken our borders, and compromise our future. We cannot, and we will not, let them win. We have to stand strong and choose leaders who defend our values. We're fighting for our country, and together, we're going to protect every achievement that makes America the shining beacon of freedom that it is!"
Key Elements:
* Clear dichotomy between "us" and "them"
* Vivid negative portrayal of opponents with assertive language
* A strong call-to-action focused on defending American values

Example 6: Reinforcing Catchy Slogans and National Pride
"Make America great again—these are not just words; they're a movement! Our slogan is the greatest in the history of politics, and it reminds us every day of the incredible work we're doing. While other slogans have come and gone, this one endures because it speaks to the heart and soul of every American. We're not going to change it. Instead, we're going to build on it to keep America great. Our country is winning, our communities are thriving, and we will never settle for anything less than excellence!"
Key Elements:
* Repetition of the slogan for maximum impact
* Reinforcement of national pride and continuity of success
* Contrast with less effective alternatives using humor and conviction

Example 7: Personal Criticism of Opponents and International Comparison
"Let's be honest—Sleepy Joe is simply not up to the task. He's been all over the place, changing his mind at every turn, and making excuses when real solutions are needed. Meanwhile, I've built relationships with the world's most powerful leaders—President Xi of China, President Putin of Russia—and they know what strong leadership looks like. Unlike those who waver and doubt, I stand firm, putting America first every single day. We need a leader who delivers results, who is respected on the global stage, and who will never let our country down!"
Key Elements:
* Direct, personal criticism using memorable nicknames
* Contrast between the speaker's decisiveness and the opponent's inconsistency
* Reference to international relationships to emphasize leadership strength

Instruction for the LLM:
Using the guidelines and style elements provided above through the examples, write a 500–700 word political rally speech in the unmistakable style of Donald Trump. Your speech should focus on a contemporary topic of national importance (such as economic recovery, national security, or election themes) and must include the following:
* Extended Greetings and Gratitude: Begin with multiple, enthusiastic greetings and thank-yous, addressing local communities by name.
* Patriotism and National Pride: Emphasize core values like faith, family, God, and country, and use superlatives to highlight American achievements.
* Bold Claims and Hyperbolic Statements: Incorporate striking numbers, repetition (like "100%" or "six million new jobs"), and memorable slogans such as "Make America great again" or "Keep America great."
* Contrast with Opponents: Clearly differentiate between "us" and "them," using direct, personal criticism of political opponents (with nicknames if appropriate) and contrasting language.
* Direct Call-to-Action: Reference upcoming elections or important decisions, urging immediate action from the audience.
* International and Economic Bravado: Mention international respect, interactions with foreign leaders, and boast about economic successes.
* Confident, Assertive Tone with Repetition: Maintain a punchy, confident tone throughout, ensuring key points are repeated for impact.

Now, write a speech on the following topic:"""

print("Few-shot prompt template loaded successfully!")


Few-shot prompt template loaded successfully!


## 3. OpenAI API Integration for Few-shot Prompting


In [None]:
# OpenAI API setup
import openai
from openai import OpenAI
import json
import time

# Set your OpenAI API key
# You can set this as an environment variable: export OPENAI_API_KEY="your-key-here"
# Or replace with your actual key (not recommended for production)
openai.api_key = ""
client = OpenAI(api_key=openai.api_key)

def generate_with_openai(prompt, model="gpt-4", max_tokens=1000, temperature=0.8):
    """Generate text using OpenAI API with few-shot prompting."""
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are an expert at mimicking Donald Trump's speaking style. Generate political speeches that capture his unique tone, vocabulary, and rhetorical patterns."},
                {"role": "user", "content": prompt}
            ],
            max_tokens=max_tokens,
            temperature=temperature,
            top_p=0.9
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        print(f"Error generating with OpenAI: {e}")
        return None

print("OpenAI API integration ready!")


OpenAI API integration ready!


## 4. Linguistic Evaluation Functions


In [None]:
# Text generation functions
def generate_finetuned_text(prompt, model, tokenizer, max_length=300):
    """Generate text using the fine-tuned model."""
    inputs = tokenizer(prompt, return_tensors="pt")
    input_ids = inputs.input_ids

    if torch.cuda.is_available():
        model = model.cuda()
        input_ids = input_ids.cuda()

    output_ids = model.generate(
        input_ids,
        max_length=max_length,
        do_sample=True,
        top_p=CONFIG['top_p'],
        top_k=CONFIG['top_k'],
        temperature=CONFIG['temperature'],
        pad_token_id=tokenizer.eos_token_id,
    )
    generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return generated_text

def generate_fewshot_text(topic, model="gpt-4"):
    """Generate text using few-shot prompting with OpenAI."""
    full_prompt = FEW_SHOT_PROMPT + f"\n\nTopic: {topic}"
    return generate_with_openai(full_prompt, model=model, max_tokens=1000)

print("Text generation functions ready!")


Text generation functions ready!


In [None]:
# Linguistic evaluation functions (from eval.py)
def cal_lexical(sentences):
    """Compute lexical features for a list of sentences."""
    total_nouns = total_verbs = total_adjs = total_subjectivity = total_unique_words = total_concreteness = 0

    # Load concreteness data
    concreteness_df = pd.read_csv("/content/drive/MyDrive/trump_model/data/concreteness.csv")
    concreteness_dict = pd.Series(concreteness_df.Score.values, index=concreteness_df.Word).to_dict()

    for s in tqdm(sentences, desc='Calculating lexical features', total=len(sentences)):
        words = word_tokenize(s)
        pos_tags = nltk.pos_tag(words)
        unique_words = set(words)

        noun_count = len([word for word, tag in pos_tags if tag.startswith('NN')])
        verb_count = len([word for word, tag in pos_tags if tag.startswith('VB')])
        adjective_count = len([word for word, tag in pos_tags if tag.startswith('JJ')])

        total_nouns += noun_count
        total_verbs += verb_count
        total_adjs += adjective_count
        total_unique_words += len(unique_words)

        blob = TextBlob(s)
        total_subjectivity += blob.sentiment.subjectivity

        total_concreteness += len([word for word in words if concreteness_dict.get(word, 0) > 3])

    num_sentences = len(sentences)
    avg_nouns = total_nouns / num_sentences
    avg_verbs = total_verbs / num_sentences
    avg_adjs = total_adjs / num_sentences
    avg_unique_words = total_unique_words / num_sentences
    avg_subjectivity = total_subjectivity / num_sentences
    avg_concreteness = total_concreteness / num_sentences

    return [avg_nouns, avg_verbs, avg_adjs, avg_unique_words, avg_subjectivity, avg_concreteness]

def get_all_nodes(tree):
    """Recursively collects all node labels from an NLTK Tree."""
    nodes = []
    for node in tree:
        if isinstance(node, Tree):
            nodes.append(node.label())
            nodes.extend(get_all_nodes(node))
    return nodes

def cal_syntactic(sentences):
    """Compute syntactic features for a list of sentences."""
    cat_dict = {
        "SIMPLE": 0,
        "COMPOUND": 0,
        "COMPLEX": 0,
        "COMPLEX-COMPOUND": 0,
        "OTHER": 0
    }
    nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,constituency', verbose=False)

    for s in tqdm(sentences, desc='Calculating syntactic features', total=len(sentences)):
        doc = nlp(s)
        try:
            parse_tree = Tree.fromstring(str(doc.sentences[0].constituency))
        except Exception as e:
            cat_dict['OTHER'] += 1
            continue

        sub_tree = parse_tree[0]
        l_top = [child.label() for child in sub_tree if isinstance(child, Tree)]
        all_nodes = get_all_nodes(parse_tree)

        if 'S' in l_top:
            if "SBAR" not in all_nodes:
                cat_dict['COMPOUND'] += 1
            else:
                cat_dict['COMPLEX-COMPOUND'] += 1
        elif 'VP' in l_top:
            if "SBAR" not in all_nodes:
                cat_dict['SIMPLE'] += 1
            else:
                cat_dict['COMPLEX'] += 1
        else:
            cat_dict['OTHER'] += 1

    total = sum(cat_dict.values())
    return [cat_dict[key] / total for key in cat_dict]

def cal_surface(sentences):
    """Compute surface features for a list of sentences."""
    commas = semicolons = colons = total_words = total_word_length = 0

    for s in tqdm(sentences, desc='Calculating surface features', total=len(sentences)):
        commas += s.count(',')
        semicolons += s.count(';')
        colons += s.count(':')

        words = s.split()
        total_words += len(words)
        total_word_length += sum(len(word) for word in words)

    num_sentences = len(sentences)
    avg_commas = commas / num_sentences if num_sentences else 0
    avg_semicolons = semicolons / num_sentences if num_sentences else 0
    avg_colons = colons / num_sentences if num_sentences else 0
    avg_words_per_sentence = total_words / num_sentences if num_sentences else 0
    avg_word_length = total_word_length / total_words if total_words else 0

    return [avg_commas, avg_semicolons, avg_colons, avg_words_per_sentence, avg_word_length]

print("Linguistic evaluation functions loaded!")


Linguistic evaluation functions loaded!


In [None]:
# Additional evaluation functions
def mean_squared_error(l1, l2):
    """Computes mean squared error between two lists."""
    vec1 = np.array(l1)
    vec2 = np.array(l2)
    return np.mean((vec1 - vec2) ** 2)

def jensen_shannon_divergence(l1, l2):
    """Computes the Jensen-Shannon divergence between two distributions."""
    vec1 = np.array(l1)
    vec2 = np.array(l2)
    return jensenshannon(vec1, vec2)

def cosine_similarity_of_features(features1, features2):
    """Compute cosine similarity between combined feature vectors."""
    combined1 = np.array(features1[0] + features1[1] + features1[2])
    combined2 = np.array(features2[0] + features2[1] + features2[2])
    norm1 = np.linalg.norm(combined1)
    norm2 = np.linalg.norm(combined2)
    if norm1 == 0 or norm2 == 0:
        return 0.0
    return np.dot(combined1, combined2) / (norm1 * norm2)

def cal_linguistic_alignment(features1, features2):
    """Compute linguistic alignment scores between two feature sets."""
    lexical1, syntactic1, surface1 = features1
    lexical2, syntactic2, surface2 = features2

    lexical_mse = mean_squared_error(lexical1, lexical2)
    surface_mse = mean_squared_error(surface1, surface2)
    syntactic_jsd = jensen_shannon_divergence(syntactic1, syntactic2)

    return [lexical_mse, syntactic_jsd, surface_mse]

def compute_features(text):
    """Compute all linguistic features for a given text."""
    sentences = sent_tokenize(text)
    lexical = cal_lexical(sentences)
    syntactic = cal_syntactic(sentences)
    surface = cal_surface(sentences)
    return (lexical, syntactic, surface)

def average_features(features_list):
    """Average a list of feature tuples."""
    avg_lexical = np.mean([feat[0] for feat in features_list], axis=0).tolist()
    avg_syntactic = np.mean([feat[1] for feat in features_list], axis=0).tolist()
    avg_surface = np.mean([feat[2] for feat in features_list], axis=0).tolist()
    return (avg_lexical, avg_syntactic, avg_surface)

def load_baseline_texts(csv_path, num_samples=5):
    """Load baseline Trump speeches from CSV."""
    df = pd.read_csv(csv_path)
    if "text" in df.columns:
        texts = df["text"].dropna().tolist()
    elif "output" in df.columns:
        texts = df["output"].dropna().tolist()
    elif "speech" in df.columns:
        texts = df["speech"].dropna().tolist()
    else:
        texts = df.iloc[:, 0].dropna().tolist()

    if len(texts) < num_samples:
        num_samples = len(texts)
    return random.sample(texts, num_samples)

print("Additional evaluation functions loaded!")


Additional evaluation functions loaded!


## 5. Test Prompts and Evaluation Setup


In [None]:
# Test prompts for evaluation
TEST_PROMPTS = [
    "Thank you very much, hello Nevada! Hello to all the hardworking American patriots here in Douglas County. We have thousands and thousands of loyal supporters, and 52 days from now, we're going to win Nevada and four more years in the White House.",

    "Hello Charleston, South Carolina! I'm thrilled to be back in the great state of South Carolina with thousands of hardworking American Patriots who believe in faith, family, God, and country. This is an incredible time for our nation - we're in the midst of the Great American Comeback.",

    "The fake news media, they've been trying to figure this out for years. They still don't get it though. Look at all those cameras back there. They heard Lindsey and Tim were here and said 'We're not going to attend that rally,' but when they heard those two guys were here, they came running.",

    "We have the greatest economy in the history of our country - not just our country, but the world. We were beating everybody, including China. Remember when they said China would overtake us in 2019? That didn't work out too well for them. We were doing leaps and bounds until we got hit with the China virus."
]

# Load baseline texts
print("Loading baseline Trump speeches...")
baseline_texts = load_baseline_texts(CONFIG['baseline_csv'], CONFIG['num_baseline_samples'])
print(f"Loaded {len(baseline_texts)} baseline samples")

# Compute baseline features
print("Computing baseline features...")
baseline_features_list = [compute_features(text) for text in baseline_texts]
baseline_features = average_features(baseline_features_list)

print("Baseline features computed successfully!")
print(f"Baseline lexical features: {baseline_features[0]}")
print(f"Baseline syntactic features: {baseline_features[1]}")
print(f"Baseline surface features: {baseline_features[2]}")


Loading baseline Trump speeches...
Loaded 10 baseline samples
Computing baseline features...


Calculating lexical features: 100%|██████████| 17/17 [00:00<00:00, 100.81it/s]
Calculating syntactic features: 100%|██████████| 17/17 [00:02<00:00,  6.51it/s]
Calculating surface features: 100%|██████████| 17/17 [00:00<00:00, 113540.08it/s]
Calculating lexical features: 100%|██████████| 22/22 [00:00<00:00, 1201.48it/s]
Calculating syntactic features: 100%|██████████| 22/22 [00:01<00:00, 11.00it/s]
Calculating surface features: 100%|██████████| 22/22 [00:00<00:00, 142840.07it/s]
Calculating lexical features: 100%|██████████| 26/26 [00:00<00:00, 1566.68it/s]
Calculating syntactic features: 100%|██████████| 26/26 [00:02<00:00, 12.10it/s]
Calculating surface features: 100%|██████████| 26/26 [00:00<00:00, 132990.13it/s]
Calculating lexical features: 100%|██████████| 22/22 [00:00<00:00, 1292.43it/s]
Calculating syntactic features: 100%|██████████| 22/22 [00:01<00:00, 11.01it/s]
Calculating surface features: 100%|██████████| 22/22 [00:00<00:00, 122056.47it/s]
Calculating lexical features: 100

Baseline features computed successfully!
Baseline lexical features: [2.146145190922067, 2.2778135150853203, 0.6969261112263141, 11.063962448039527, 0.2802478926458676, 2.0856294220391582]
Baseline syntactic features: [0.4918316583123886, 0.09796214603516833, 0.16720270939337875, 0.11181282094060999, 0.13119066531845436]
Baseline surface features: [0.9022168500870327, 0.0, 0.0, 9.931276652879088, 4.514]





## 6. Comprehensive Evaluation Function


In [None]:

def evaluate_model_comparison(prompts, save_results=True):
    """
    Comprehensive evaluation comparing fine-tuned model vs few-shot prompting.
    """
    results = {
        'prompts': [],
        'finetuned_texts': [],
        'fewshot_texts': [],
        'finetuned_features': [],
        'fewshot_features': [],
        'finetuned_scores': [],
        'fewshot_scores': [],
        'finetuned_cosine_sim': [],
        'fewshot_cosine_sim': []
    }

    print("Starting comprehensive evaluation...")
    print(f"Evaluating {len(prompts)} prompts")
    print("=" * 60)

    for i, prompt in enumerate(prompts):
        print(f"\n--- Evaluating Prompt {i+1}/{len(prompts)}: {prompt} ---")

        # Generate text with fine-tuned model
        try:
            finetuned_text = generate_finetuned_text(prompt, finetuned_model, finetuned_tokenizer, CONFIG['max_length'])
        except Exception as e:
            print(f"Error generating with fine-tuned model: {e}")
            finetuned_text = ""

        # Generate text with few-shot prompting
        try:
            fewshot_text = generate_fewshot_text(prompt)
        except Exception as e:
            print(f"Error generating with few-shot prompting: {e}")
            fewshot_text = ""

        # Compute features + metrics
        if finetuned_text:
            finetuned_features = compute_features(finetuned_text)
            finetuned_scores = cal_linguistic_alignment(finetuned_features, baseline_features)
            finetuned_cosine = cosine_similarity_of_features(finetuned_features, baseline_features)
        else:
            finetuned_features, finetuned_scores, finetuned_cosine = None, [float('inf')]*3, 0.0

        if fewshot_text:
            fewshot_features = compute_features(fewshot_text)
            fewshot_scores = cal_linguistic_alignment(fewshot_features, baseline_features)
            fewshot_cosine = cosine_similarity_of_features(fewshot_features, baseline_features)
        else:
            fewshot_features, fewshot_scores, fewshot_cosine = None, [float('inf')]*3, 0.0

        # Store results
        results['prompts'].append(prompt)
        results['finetuned_texts'].append(finetuned_text)
        results['fewshot_texts'].append(fewshot_text)
        results['finetuned_features'].append(finetuned_features)
        results['fewshot_features'].append(fewshot_features)
        results['finetuned_scores'].append(finetuned_scores)
        results['fewshot_scores'].append(fewshot_scores)
        results['finetuned_cosine_sim'].append(finetuned_cosine)
        results['fewshot_cosine_sim'].append(fewshot_cosine)

        # Add delay to avoid rate limiting
        time.sleep(1)

    # Save results if requested
    if save_results:
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

        # --- Save JSON (safe conversion) ---
        serializable_results = {}
        for key, value in results.items():
            if key in ['finetuned_features', 'fewshot_features']:
                # force everything into plain lists
                serializable_results[key] = [
                    [list(feat[0]), list(feat[1]), list(feat[2])] if feat else None
                    for feat in value
                ]
            else:
                serializable_results[key] = value

        results_file = f"/content/drive/MyDrive/trump_model/evaluation_results_{timestamp}.json"
        with open(results_file, 'w') as f:
            json.dump(serializable_results, f, indent=2)

        print(f"\n✅ JSON results saved to: {results_file}")

        # --- Save CSV with examples + metrics ---
        rows = []
        for i in range(len(prompts)):
            rows.append({
                "prompt": results['prompts'][i],
                "finetuned_text": results['finetuned_texts'][i],
                "fewshot_text": results['fewshot_texts'][i],
                "finetuned_lexical_mse": results['finetuned_scores'][i][0],
                "finetuned_syntactic_jsd": results['finetuned_scores'][i][1],
                "finetuned_surface_mse": results['finetuned_scores'][i][2],
                "finetuned_cosine": results['finetuned_cosine_sim'][i],
                "fewshot_lexical_mse": results['fewshot_scores'][i][0],
                "fewshot_syntactic_jsd": results['fewshot_scores'][i][1],
                "fewshot_surface_mse": results['fewshot_scores'][i][2],
                "fewshot_cosine": results['fewshot_cosine_sim'][i],
            })

        df = pd.DataFrame(rows)
        csv_file = f"/content/drive/MyDrive/trump_model/evaluation_results_{timestamp}.csv"
        df.to_csv(csv_file, index=False)

        print(f"✅ CSV results saved to: {csv_file}")

    return results

print("Comprehensive evaluation function ready!")


Comprehensive evaluation function ready!


## 7. Visualization and Analysis Functions


In [None]:
def create_comparison_plots(results):
    """Create comprehensive comparison plots."""
    # Set up the plotting style
    plt.style.use('seaborn-v0_8')
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    fig.suptitle('Fine-tuned Model vs Few-shot Prompting: Linguistic Alignment Comparison', fontsize=16, fontweight='bold')

    # Extract data for plotting
    prompts = results['prompts']
    finetuned_lexical = [scores[0] for scores in results['finetuned_scores']]
    finetuned_syntactic = [scores[1] for scores in results['finetuned_scores']]
    finetuned_surface = [scores[2] for scores in results['finetuned_scores']]
    finetuned_cosine = results['finetuned_cosine_sim']

    fewshot_lexical = [scores[0] for scores in results['fewshot_scores']]
    fewshot_syntactic = [scores[1] for scores in results['fewshot_scores']]
    fewshot_surface = [scores[2] for scores in results['fewshot_scores']]
    fewshot_cosine = results['fewshot_cosine_sim']

    # Plot 1: Lexical MSE Comparison
    x = np.arange(len(prompts))
    width = 0.35

    axes[0, 0].bar(x - width/2, finetuned_lexical, width, label='Fine-tuned', alpha=0.8, color='skyblue')
    axes[0, 0].bar(x + width/2, fewshot_lexical, width, label='Few-shot', alpha=0.8, color='lightcoral')
    axes[0, 0].set_xlabel('Test Prompts')
    axes[0, 0].set_ylabel('Lexical MSE (Lower is Better)')
    axes[0, 0].set_title('Lexical Feature Alignment')
    axes[0, 0].set_xticks(x)
    axes[0, 0].set_xticklabels([f'P{i+1}' for i in range(len(prompts))], rotation=45)
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3)

    # Plot 2: Syntactic JSD Comparison
    axes[0, 1].bar(x - width/2, finetuned_syntactic, width, label='Fine-tuned', alpha=0.8, color='skyblue')
    axes[0, 1].bar(x + width/2, fewshot_syntactic, width, label='Few-shot', alpha=0.8, color='lightcoral')
    axes[0, 1].set_xlabel('Test Prompts')
    axes[0, 1].set_ylabel('Syntactic JSD (Lower is Better)')
    axes[0, 1].set_title('Syntactic Feature Alignment')
    axes[0, 1].set_xticks(x)
    axes[0, 1].set_xticklabels([f'P{i+1}' for i in range(len(prompts))], rotation=45)
    axes[0, 1].legend()
    axes[0, 1].grid(True, alpha=0.3)

    # Plot 3: Surface MSE Comparison
    axes[0, 2].bar(x - width/2, finetuned_surface, width, label='Fine-tuned', alpha=0.8, color='skyblue')
    axes[0, 2].bar(x + width/2, fewshot_surface, width, label='Few-shot', alpha=0.8, color='lightcoral')
    axes[0, 2].set_xlabel('Test Prompts')
    axes[0, 2].set_ylabel('Surface MSE (Lower is Better)')
    axes[0, 2].set_title('Surface Feature Alignment')
    axes[0, 2].set_xticks(x)
    axes[0, 2].set_xticklabels([f'P{i+1}' for i in range(len(prompts))], rotation=45)
    axes[0, 2].legend()
    axes[0, 2].grid(True, alpha=0.3)

    # Plot 4: Cosine Similarity Comparison
    axes[1, 0].bar(x - width/2, finetuned_cosine, width, label='Fine-tuned', alpha=0.8, color='skyblue')
    axes[1, 0].bar(x + width/2, fewshot_cosine, width, label='Few-shot', alpha=0.8, color='lightcoral')
    axes[1, 0].set_xlabel('Test Prompts')
    axes[1, 0].set_ylabel('Cosine Similarity (Higher is Better)')
    axes[1, 0].set_title('Overall Feature Similarity')
    axes[1, 0].set_xticks(x)
    axes[1, 0].set_xticklabels([f'P{i+1}' for i in range(len(prompts))], rotation=45)
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3)

    # Plot 5: Average Performance Comparison
    avg_metrics = {
        'Fine-tuned': [
            np.mean(finetuned_lexical),
            np.mean(finetuned_syntactic),
            np.mean(finetuned_surface),
            np.mean(finetuned_cosine)
        ],
        'Few-shot': [
            np.mean(fewshot_lexical),
            np.mean(fewshot_syntactic),
            np.mean(fewshot_surface),
            np.mean(fewshot_cosine)
        ]
    }

    metric_names = ['Lexical MSE', 'Syntactic JSD', 'Surface MSE', 'Cosine Sim']
    x_metrics = np.arange(len(metric_names))

    axes[1, 1].bar(x_metrics - width/2, avg_metrics['Fine-tuned'], width, label='Fine-tuned', alpha=0.8, color='skyblue')
    axes[1, 1].bar(x_metrics + width/2, avg_metrics['Few-shot'], width, label='Few-shot', alpha=0.8, color='lightcoral')
    axes[1, 1].set_xlabel('Metrics')
    axes[1, 1].set_ylabel('Average Score')
    axes[1, 1].set_title('Average Performance Across All Prompts')
    axes[1, 1].set_xticks(x_metrics)
    axes[1, 1].set_xticklabels(metric_names, rotation=45)
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)

    # Plot 6: Win Rate Comparison
    finetuned_wins = 0
    fewshot_wins = 0

    for i in range(len(prompts)):
        # Lower MSE/JSD is better, higher cosine similarity is better
        lexical_win = finetuned_lexical[i] < fewshot_lexical[i]
        syntactic_win = finetuned_syntactic[i] < fewshot_syntactic[i]
        surface_win = finetuned_surface[i] < fewshot_surface[i]
        cosine_win = finetuned_cosine[i] > fewshot_cosine[i]

        total_wins = sum([lexical_win, syntactic_win, surface_win, cosine_win])
        if total_wins >= 2:
            finetuned_wins += 1
        else:
            fewshot_wins += 1

    win_data = [finetuned_wins, fewshot_wins]
    win_labels = ['Fine-tuned', 'Few-shot']
    colors = ['skyblue', 'lightcoral']

    axes[1, 2].bar(win_labels, win_data, color=colors, alpha=0.8)
    axes[1, 2].set_ylabel('Number of Prompts Won')
    axes[1, 2].set_title('Head-to-Head Comparison\n(Wins per Prompt)')
    axes[1, 2].grid(True, alpha=0.3)

    # Add value labels on bars
    for i, v in enumerate(win_data):
        axes[1, 2].text(i, v + 0.1, str(v), ha='center', va='bottom', fontweight='bold')

    plt.tight_layout()
    plt.show()

    return fig

def print_detailed_analysis(results):
    """Print detailed analysis of results."""
    print("\n" + "="*80)
    print("DETAILED EVALUATION ANALYSIS")
    print("="*80)

    # Calculate averages
    finetuned_lexical_avg = np.mean([scores[0] for scores in results['finetuned_scores']])
    finetuned_syntactic_avg = np.mean([scores[1] for scores in results['finetuned_scores']])
    finetuned_surface_avg = np.mean([scores[2] for scores in results['finetuned_scores']])
    finetuned_cosine_avg = np.mean(results['finetuned_cosine_sim'])

    fewshot_lexical_avg = np.mean([scores[0] for scores in results['fewshot_scores']])
    fewshot_syntactic_avg = np.mean([scores[1] for scores in results['fewshot_scores']])
    fewshot_surface_avg = np.mean([scores[2] for scores in results['fewshot_scores']])
    fewshot_cosine_avg = np.mean(results['fewshot_cosine_sim'])

    print(f"\nAVERAGE PERFORMANCE ACROSS ALL PROMPTS:")
    print(f"{'Metric':<20} {'Fine-tuned':<15} {'Few-shot':<15} {'Winner':<15}")
    print("-" * 65)

    # Lexical MSE (lower is better)
    lexical_winner = "Fine-tuned" if finetuned_lexical_avg < fewshot_lexical_avg else "Few-shot"
    print(f"{'Lexical MSE':<20} {finetuned_lexical_avg:<15.4f} {fewshot_lexical_avg:<15.4f} {lexical_winner:<15}")

    # Syntactic JSD (lower is better)
    syntactic_winner = "Fine-tuned" if finetuned_syntactic_avg < fewshot_syntactic_avg else "Few-shot"
    print(f"{'Syntactic JSD':<20} {finetuned_syntactic_avg:<15.4f} {fewshot_syntactic_avg:<15.4f} {syntactic_winner:<15}")

    # Surface MSE (lower is better)
    surface_winner = "Fine-tuned" if finetuned_surface_avg < fewshot_surface_avg else "Few-shot"
    print(f"{'Surface MSE':<20} {finetuned_surface_avg:<15.4f} {fewshot_surface_avg:<15.4f} {surface_winner:<15}")

    # Cosine Similarity (higher is better)
    cosine_winner = "Fine-tuned" if finetuned_cosine_avg > fewshot_cosine_avg else "Few-shot"
    print(f"{'Cosine Similarity':<20} {finetuned_cosine_avg:<15.4f} {fewshot_cosine_avg:<15.4f} {cosine_winner:<15}")

    # Overall winner
    finetuned_wins = sum([
        finetuned_lexical_avg < fewshot_lexical_avg,
        finetuned_syntactic_avg < fewshot_syntactic_avg,
        finetuned_surface_avg < fewshot_surface_avg,
        finetuned_cosine_avg > fewshot_cosine_avg
    ])

    overall_winner = "Fine-tuned Model" if finetuned_wins >= 2 else "Few-shot Prompting"
    print(f"\nOVERALL WINNER: {overall_winner} ({finetuned_wins}/4 metrics)")

    # Text length analysis
    finetuned_lengths = [len(text) for text in results['finetuned_texts'] if text]
    fewshot_lengths = [len(text) for text in results['fewshot_texts'] if text]

    print(f"\nTEXT LENGTH ANALYSIS:")
    print(f"Fine-tuned model - Average length: {np.mean(finetuned_lengths):.0f} characters")
    print(f"Few-shot prompting - Average length: {np.mean(fewshot_lengths):.0f} characters")

    print("\n" + "="*80)

print("Visualization and analysis functions ready!")


Visualization and analysis functions ready!


## 8. Run the Complete Evaluation


In [None]:
# Run the complete evaluation
print("🚀 Starting Comprehensive Trump Speech Style Evaluation")
print("=" * 60)

# Run evaluation on all test prompts
results = evaluate_model_comparison(TEST_PROMPTS, save_results=True)

print("\n✅ Evaluation completed successfully!")
print(f"Results saved with timestamp: {datetime.now().strftime('%Y%m%d_%H%M%S')}")


🚀 Starting Comprehensive Trump Speech Style Evaluation
Starting comprehensive evaluation...
Evaluating 4 prompts

--- Evaluating Prompt 1/4: Thank you very much, hello Nevada! Hello to all the hardworking American patriots here in Douglas County. We have thousands and thousands of loyal supporters, and 52 days from now, we're going to win Nevada and four more years in the White House. ---


Calculating lexical features: 100%|██████████| 17/17 [00:00<00:00, 1090.31it/s]
Calculating syntactic features: 100%|██████████| 17/17 [00:01<00:00, 10.25it/s]
Calculating surface features: 100%|██████████| 17/17 [00:00<00:00, 127783.46it/s]
Calculating lexical features: 100%|██████████| 56/56 [00:00<00:00, 1657.34it/s]
Calculating syntactic features: 100%|██████████| 56/56 [00:04<00:00, 11.95it/s]
Calculating surface features: 100%|██████████| 56/56 [00:00<00:00, 227818.65it/s]



--- Evaluating Prompt 2/4: Hello Charleston, South Carolina! I'm thrilled to be back in the great state of South Carolina with thousands of hardworking American Patriots who believe in faith, family, God, and country. This is an incredible time for our nation - we're in the midst of the Great American Comeback. ---


Calculating lexical features: 100%|██████████| 12/12 [00:00<00:00, 932.60it/s]
Calculating syntactic features: 100%|██████████| 12/12 [00:01<00:00,  7.76it/s]
Calculating surface features: 100%|██████████| 12/12 [00:00<00:00, 99273.47it/s]
Calculating lexical features: 100%|██████████| 38/38 [00:00<00:00, 1451.18it/s]
Calculating syntactic features: 100%|██████████| 38/38 [00:03<00:00, 10.55it/s]
Calculating surface features: 100%|██████████| 38/38 [00:00<00:00, 210268.54it/s]



--- Evaluating Prompt 3/4: The fake news media, they've been trying to figure this out for years. They still don't get it though. Look at all those cameras back there. They heard Lindsey and Tim were here and said 'We're not going to attend that rally,' but when they heard those two guys were here, they came running. ---


Calculating lexical features: 100%|██████████| 22/22 [00:00<00:00, 1525.45it/s]
Calculating syntactic features: 100%|██████████| 22/22 [00:01<00:00, 11.79it/s]
Calculating surface features: 100%|██████████| 22/22 [00:00<00:00, 146467.76it/s]
Calculating lexical features: 100%|██████████| 60/60 [00:00<00:00, 1579.66it/s]
Calculating syntactic features: 100%|██████████| 60/60 [00:05<00:00, 11.69it/s]
Calculating surface features: 100%|██████████| 60/60 [00:00<00:00, 251910.15it/s]



--- Evaluating Prompt 4/4: We have the greatest economy in the history of our country - not just our country, but the world. We were beating everybody, including China. Remember when they said China would overtake us in 2019? That didn't work out too well for them. We were doing leaps and bounds until we got hit with the China virus. ---


Calculating lexical features: 100%|██████████| 17/17 [00:00<00:00, 1362.36it/s]
Calculating syntactic features: 100%|██████████| 17/17 [00:01<00:00,  9.31it/s]
Calculating surface features: 100%|██████████| 17/17 [00:00<00:00, 138992.53it/s]
Calculating lexical features: 100%|██████████| 48/48 [00:00<00:00, 1503.67it/s]
Calculating syntactic features: 100%|██████████| 48/48 [00:04<00:00, 11.17it/s]
Calculating surface features: 100%|██████████| 48/48 [00:00<00:00, 228520.54it/s]



✅ JSON results saved to: /content/drive/MyDrive/trump_model/evaluation_results_20250913_211358.json
✅ CSV results saved to: /content/drive/MyDrive/trump_model/evaluation_results_20250913_211358.csv

✅ Evaluation completed successfully!
Results saved with timestamp: 20250913_211358


## 9. Generate Visualizations and Analysis


In [None]:
# Generate comprehensive visualizations
print("📊 Creating comparison visualizations...")
fig = create_comparison_plots(results)

# Print detailed analysis
print_detailed_analysis(results)


## 10. Individual Text Analysis


In [None]:
# Display individual text samples for detailed analysis
def display_text_samples(results, prompt_index=0):
    """Display text samples for a specific prompt."""
    if prompt_index >= len(results['prompts']):
        print(f"Invalid prompt index. Available: 0-{len(results['prompts'])-1}")
        return

    prompt = results['prompts'][prompt_index]
    finetuned_text = results['finetuned_texts'][prompt_index]
    fewshot_text = results['fewshot_texts'][prompt_index]

    print(f"\n{'='*80}")
    print(f"TEXT SAMPLES FOR PROMPT: {prompt}")
    print(f"{'='*80}")

    print(f"\n🔹 FINE-TUNED MODEL OUTPUT:")
    print("-" * 40)
    print(finetuned_text[:500] + "..." if len(finetuned_text) > 500 else finetuned_text)

    print(f"\n🔹 FEW-SHOT PROMPTING OUTPUT:")
    print("-" * 40)
    print(fewshot_text[:500] + "..." if len(fewshot_text) > 500 else fewshot_text)

    print(f"\n📊 METRICS COMPARISON:")
    print("-" * 40)
    finetuned_scores = results['finetuned_scores'][prompt_index]
    fewshot_scores = results['fewshot_scores'][prompt_index]

    print(f"Fine-tuned - Lexical MSE: {finetuned_scores[0]:.4f}, Syntactic JSD: {finetuned_scores[1]:.4f}, Surface MSE: {finetuned_scores[2]:.4f}")
    print(f"Few-shot   - Lexical MSE: {fewshot_scores[0]:.4f}, Syntactic JSD: {fewshot_scores[1]:.4f}, Surface MSE: {fewshot_scores[2]:.4f}")
    print(f"Cosine Similarity - Fine-tuned: {results['finetuned_cosine_sim'][prompt_index]:.4f}, Few-shot: {results['fewshot_cosine_sim'][prompt_index]:.4f}")

# Display samples for first few prompts
for i in range(min(3, len(results['prompts']))):
    display_text_samples(results, i)



TEXT SAMPLES FOR PROMPT: Thank you very much, hello Nevada! Hello to all the hardworking American patriots here in Douglas County. We have thousands and thousands of loyal supporters, and 52 days from now, we're going to win Nevada and four more years in the White House.

🔹 FINE-TUNED MODEL OUTPUT:
----------------------------------------
Thank you very much, hello Nevada! Hello to all the hardworking American patriots here in Douglas County. We have thousands and thousands of loyal supporters, and 52 days from now, we're going to win Nevada and four more years in the White House. It's all over but the shouting.

Can we draw the following conclusion?
Nevada is a state.

Yes, Nevada is a state. The statement provided indicates that the speaker is addressing "all the hardworking American patriots[s] here in Douglas County" and then...

🔹 FEW-SHOT PROMPTING OUTPUT:
----------------------------------------
Thank you, thank you very much! Nevada, you are something else. You are something e

## 11. Quick Test with Single Prompt


In [None]:
# Quick test with a single prompt (uncomment to run)
# This is useful for testing before running the full evaluation

# test_prompt = "The economy and job creation"
# print(f"Testing with prompt: {test_prompt}")
# print("=" * 50)

# # Generate with fine-tuned model
# print("Generating with fine-tuned model...")
# finetuned_test = generate_finetuned_text(test_prompt, finetuned_model, finetuned_tokenizer)
# print(f"Fine-tuned output: {finetuned_test[:200]}...")

# # Generate with few-shot prompting
# print("\nGenerating with few-shot prompting...")
# fewshot_test = generate_fewshot_text(test_prompt)
# print(f"Few-shot output: {fewshot_test[:200] if fewshot_test else 'Failed to generate'}...")

print("Quick test section ready - uncomment the code above to run a single prompt test")


Quick test section ready - uncomment the code above to run a single prompt test


## 12. Summary and Next Steps


### What This Notebook Does:

1. **Loads both models**: Fine-tuned Trump model and base model for few-shot prompting
2. **Implements comprehensive few-shot prompting**: Uses the detailed Trump style examples you provided
3. **Evaluates linguistic alignment**: Compares lexical, syntactic, and surface features
4. **Generates comprehensive metrics**: MSE, JSD, cosine similarity across multiple test prompts
5. **Creates detailed visualizations**: Bar charts, comparison plots, and win-rate analysis
6. **Saves results**: Automatically saves evaluation results with timestamps

### Key Features:

- **8 diverse test prompts** covering different political topics
- **OpenAI API integration** for few-shot prompting with GPT-4
- **Comprehensive linguistic analysis** using the same metrics as your eval.py
- **Side-by-side comparison** of generated texts
- **Statistical analysis** with win rates and average performance
- **Professional visualizations** for easy interpretation

### To Run the Evaluation:

1. **Set your OpenAI API key**: `export OPENAI_API_KEY="your-key-here"`
2. **Run all cells** in sequence
3. **Review the results** in the generated plots and analysis
4. **Check saved files** for detailed JSON results

### Expected Outputs:

- **Comparison plots** showing performance across all metrics
- **Detailed analysis** with winner determination
- **Text samples** for qualitative evaluation
- **JSON results file** with all data for further analysis

This notebook provides a complete framework for comparing fine-tuned models against few-shot prompting approaches for style mimicry tasks.
