<a href="https://colab.research.google.com/github/piesauce/llm-playbooks/blob/ateng%2FCH2_exercises/Ch2_Cosmopedia.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 2 Exercises - Cosmopedia Exploration

Load cosmopedia-100K, a subset of the cosmopedia dataset and explore the prompts as well as the resulting synthetic data. What does the quality of the synthetic data look like? Do you observe any factual or reasoning errors? Additionally, try varying the prompts and see if you can generate more diverse data.

Here's the link to Cosmopedia via HuggingFace Datasets: https://huggingface.co/datasets/HuggingFaceTB/cosmopedia-100k

In [3]:
# load the dataset
!pip install -q datasets

from datasets import load_dataset

dataset = load_dataset("HuggingFaceTB/cosmopedia-100k", split="train")

In [4]:
# look at examples from the dataset
for idx in range(5):
    sample = dataset[idx]
    print(f"Sample {idx+1}:")
    for k, v in sample.items():
        print(f"  {k}: {v}")
    print("-----")

Sample 1:
  prompt: Here is an extract from a webpage: "What can cause my settlement offer to be delayed?
When you’ve been injured in an Austin truck accident, one of the most common questions is how long it will take for the insurance company to make an offer to settle your case. The answer depends on a variety of factors.
The process starts with filing an insurance claim and providing evidence that shows exactly what happened during the accident and who was at fault. This can involve gathering key Austin truck accident evidence such as:
- Medical records
- Photographs or video footage of the crash scene
- Witness statements
- Other documents related to your injuries and damages.
Once this information has been collected by both sides, negotiations may begin between your Austin truck accident lawyer and the insurance company on how much compensation should be offered in exchange for settling the case out of court.
It is important to remember that every truck accident case is unique so 

As seen above, in many synthetic data corpora, you might see a structure like:

    prompt: The original system or user prompt used to generate text.
    response or text: The AI-generated answer or text snippet.
    Possibly additional metadata: temperature, model, date of generation, text_token_length, etc.

## Assessing the Quality of Synthetic Data

Synthetic text from language models can be coherent and fluent, but it may exhibit:

- Factual inaccuracies: The model can “hallucinate” facts, misquote sources, or provide incorrect details.
- Reasoning errors: It might produce reasoning that seems logical on the surface but contains subtle contradictions or leaps.
- Stylistic uniformity: If prompts or generation hyperparameters are not varied, the text can start to sound repetitive, especially with the same tone or structure.

However, based on the 5 samples above, the synthetic samples look surprisingly coherent and well-structured. Each example fulfills the requested style or format (e.g., blog post, academic tone, children's story) and stays on-topic. The writing flows smoothly, with appropriate transitions and readable paragraph structure. The text tends to be grammatically sound and largely stylistically consistent—hallmarks of modern large-language-model-generated text.

    Factual Accuracy
        In the LISA Pathfinder example (Sample 2), references to the mission’s purpose (testing drag-free flight, laser interferometry), plus mentions of ESA, NASA, and the timeline, seem consistent with known facts. The text states that LISA Pathfinder launched in December 2015, which matches the actual historical launch date (December 3, 2015). These details appear largely correct rather than random hallucinations.
        In the “Angel Has Fallen” text (Sample 4), references to the film’s premise (Gerard Butler as Mike Banning, Morgan Freeman as President Trumbull, directed by Ric Roman Waugh) are also correct. The mention of the film’s tight schedule and the cinematographer Jules O’Loughlin ASC ACS lines up with real production constraints often cited in interviews.
        The truck accident blog post (Sample 1) is generally coherent about how settlements might be delayed, what factors contribute (e.g., gathering evidence), and how legal negotiations work. It doesn’t contain glaring factual errors or absurd claims.
        The mention of the “MoMath” event in Sample 3 is plausible, but the user would need to confirm if any references to the math talk or the content about Alex Kontorovich’s approach are 100% accurate. At a glance, it appears consistent with typical math outreach events.
        The children’s story (Sample 5) is mostly fictional/educational; it’s not heavily reliant on real-world facts beyond the general theme of “don’t judge by appearances.”

In short, no major factual inaccuracies jump out. However, one should keep in mind that synthetic data can still contain subtle errors. If these texts were integrated into a real publication or used as training data, a human fact-check would still be recommended.

    Reasoning and Logical Flow
        The samples generally follow a coherent line of reasoning: They introduce a topic, expand on it, and draw conclusions (e.g., “here’s what can delay a settlement,” “here’s how LISA Pathfinder demonstrates technology for gravitational wave detection,” etc.).
        The text in each sample stays on theme and transitions well among points (for example, Sample 2 logically moves from explaining gravitational waves to describing LISA Pathfinder’s technology demonstrations).
        There are no blatant contradictions or nonsensical jumps in reasoning across the shown excerpts.

    Stylistic Observations
        The texts follow the user instructions fairly closely (e.g., no “Hello dear readers…” in the performing arts post of Sample 4, or the children’s story adds science bits as requested).
        The tone is consistent with each specified format (blog post, academic course unit, children’s story).
        In some places, the text uses transitional phrases in a similar manner. This slight formulaic style can sometimes signal AI-generated text, but it remains coherent.

    Potential Limitations
        Overconfidence: AI-generated text can sound quite certain, even on complex or specialized topics. Without thorough referencing, it’s easy to slip in smaller inaccuracies (e.g., the exact date or mission detail).
        Repetitive Patterns: There are occasional repeated phrases (“It is crucial to…” or “One factor that can…”). This can indicate the text is pulling from a learned pattern or template.
        Lack of Citations: Except for references to actual facts/figures, there are no formal citations or footnotes. This is common in synthetic text but also means verifying precise claims can be harder.

Although, again note that we only looked at 5 samples here for illustration. :)


## Varying the Prompts for More Diverse Outputs

In [5]:
!pip install transformers

from transformers import pipeline



The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

In [6]:
# Load a text-generation model of your choice; for demonstration, we'll use GPT-2 because it's easily available on HF ehre https://huggingface.co/openai-community/gpt2
generator = pipeline("text-generation", model="gpt2")

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cpu


In [7]:
generator

<transformers.pipelines.text_generation.TextGenerationPipeline at 0x79b571575690>

In [8]:
# Define base snippet to expand on, simlar to the examples above

base_snippet = (
    "Here is an extract from a webpage: "
    '"The LISA Pathfinder scientific collaboration will meet in Trento '
    'to discuss the outstanding success of the mission..."'
    "\n\n"
    "Write an extensive and detailed course unit suitable for a textbook targeted at college students."
)

# A list of style prompts or instructions to prepend or append
style_variations = [
    "In a highly technical, academic style:",
    "In a conversational, first-person narrative:",
    "As a short, playful summary under 150 words:",
    "Rewrite with bullet points and short paragraphs:",
]


In [10]:
# Generate text for each variation
# We'll loop over these variations, constructing a combined prompt each time.
# We'll also play with generation hyperparameters (e.g., temperature for creativity, top_k for sampling diversity) to see how it affects the output.

for idx, style_instruction in enumerate(style_variations):
    # Combine the style instruction + base snippet into a single prompt
    prompt = f"{style_instruction}\n\n{base_snippet}\n\n"

    # Generate text with specific parameters for diversity
    output = generator(
        prompt,
        max_length=300,         # limit the maximum tokens
        temperature=1.2,        # higher temperature -> more randomness in sampling
        top_k=50,               # sample only from the top_k most likely tokens
        top_p=0.9,              # or use nucleus sampling for additional control
        num_return_sequences=1,  # generate only one sequence this round
        do_sample=True          # enable sampling (rather than greedy decoding)
    )

    # Print or store the result
    print(f"--- Variation {idx+1} ---")
    print(f"PROMPT:\n{prompt}")
    print("GENERATED TEXT:\n", output[0]["generated_text"])
    print("------------------------------------------------\n")


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


--- Variation 1 ---
PROMPT:
In a highly technical, academic style:

Here is an extract from a webpage: "The LISA Pathfinder scientific collaboration will meet in Trento to discuss the outstanding success of the mission..."

Write an extensive and detailed course unit suitable for a textbook targeted at college students.


GENERATED TEXT:
 In a highly technical, academic style:

Here is an extract from a webpage: "The LISA Pathfinder scientific collaboration will meet in Trento to discuss the outstanding success of the mission..."

Write an extensive and detailed course unit suitable for a textbook targeted at college students.


This section discusses the mission of the team:

How long is it taken to complete your project? A team would be expected to have about five-six weeks to complete a project at one school in the country. A school that does not have a complete project of the sort (for example, a research school) will typically take longer than this.

Why do they not do more?

Thes

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


--- Variation 2 ---
PROMPT:
In a conversational, first-person narrative:

Here is an extract from a webpage: "The LISA Pathfinder scientific collaboration will meet in Trento to discuss the outstanding success of the mission..."

Write an extensive and detailed course unit suitable for a textbook targeted at college students.


GENERATED TEXT:
 In a conversational, first-person narrative:

Here is an extract from a webpage: "The LISA Pathfinder scientific collaboration will meet in Trento to discuss the outstanding success of the mission..."

Write an extensive and detailed course unit suitable for a textbook targeted at college students.


This group consists of over 200 people from all levels of academia, from middle school through junior high, who are working in collaboration in a high-tech world.

What does this mean in practice?

A student on the project team will be given an overview of a project, written in English in a concise, easy-to-remember and authoritative way. The indivi

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


--- Variation 3 ---
PROMPT:
As a short, playful summary under 150 words:

Here is an extract from a webpage: "The LISA Pathfinder scientific collaboration will meet in Trento to discuss the outstanding success of the mission..."

Write an extensive and detailed course unit suitable for a textbook targeted at college students.


GENERATED TEXT:
 As a short, playful summary under 150 words:

Here is an extract from a webpage: "The LISA Pathfinder scientific collaboration will meet in Trento to discuss the outstanding success of the mission..."

Write an extensive and detailed course unit suitable for a textbook targeted at college students.


In this course:


* You will prepare a student's initial English-language textbook (the required two or three chapters will be required);


* You will introduce yourself and how your assignment came about.


* You will describe your course assignments and why their outcomes are important to you.


* You will ask students who have completed a course 

- temperature=1.2: Increases “randomness” in token selection. A value around 1.0 is standard; going above that can yield more creative or less deterministic results.
- top_k=50: The model chooses the next token only from the top 50 probable tokens.
- top_p=0.9: Nucleus (top-p) sampling ensures the model samples from the smallest set of tokens whose cumulative probability exceeds 0.9.
- do_sample=True: Enables sampling (rather than taking the highest-probability token every time).

Overall Quality & Key Takeaways of Examples Above

    Prompt Echoing: Each variation repeats your entire instruction. GPT-2 often does this if the prompt is given as plain text rather than using an “instruction-tuned” approach (e.g., Flan-T5, GPT-NeoX chat, etc.).
    Ignoring Instructions: The model disregards requests for technical detail, style constraints, or word limits. It rarely addresses the actual LISA Pathfinder context in a meaningful way.
    Hallucinations: Much of the generated text is random or off-topic, discussing school projects, course timelines, or digital libraries rather than gravitational wave technology.
    Not Factual: The output does not contain factual information about LISA, gravitational waves, or ESA. Instead, it inserts inaccurate or irrelevant details about “200 people in academia,” “the next chapter,” or “deadline for a working effort.”
    Limited Depth: The content does not produce a “detailed course unit,” “first-person narrative,” or “playful summary,” failing to meet the instructions or address the snippet meaningfully.

Why is this Happening?

    GPT-2 is not instruction-tuned, so it often just regurgitates or partially modifies the prompt.
    It’s an older, smaller model with limited context handling and a tendency to produce irrelevant text.
    The prompts are given as direct text, which GPT-2 sees as “stuff to continue,” rather than instructions to follow.

Possible Solutions

    Use Instruction-Tuned or Chat Models
        Models like Flan-T5, Llama 2 Chat, GPT-NeoXT Chat Models, or OpenAI’s ChatGPT are often much better at following style requests and not echoing the prompt verbatim.
    Prompt Engineering
        Add clear separators, such as "Context:\n<text>\n\nInstruction:\n<what to do>\n\nAnswer:\n".
        Use no_repeat_ngram_size, limit max new tokens, or parse out the repeated portion.
    In-Context Examples
        Provide a short example of what you want (an example input + desired output) inside the prompt so the model can mimic that style.


The outputs from GPT-2 in these four variations:

    Mostly ignore your style requests,
    Echo the prompt content, and
    Produce random or irrelevant text about “projects” or “deadlines” instead of a coherent course unit about LISA Pathfinder.

This behavior is common with non-instruction-tuned models. If you switch to a more advanced or instruction-tuned model (or significantly adjust your prompt structure and sampling strategy), you’ll likely see more on-topic, less repetitive, and higher-quality text.