## Synthetic Data = Simulated Data
Synthetic data refers to generated data vs naturally occurring data or  
data collected by people.
<br/>
<br/>
<br/>

### Why Is It Useful?
- **Cost-effective Data Collection:** Generating synthetic data is often  
much cheaper than employing human annotators, making it an  
economical choice for data collection.

- **Overcoming Data Scarcity:** In domains where adequate real-world data  
is scarce or difficult to collect, synthetic data can fill the gap, enabling  
the training of machine learning models.

- **Large Diverse Data:** Synthetic data can capture a wider distribution  
than human collectors, and can produce the data at scale.

- **Privacy Compliance:** It helps in scenarios where using real data could  
violate privacy regulations. Synthetic data can be used to create anonymized  
datasets that mimic real data characteristics without exposing sensitive  
information.


### Papers
**Self-Instruct: Aligning Language Models with Self-Generated Instructions**  


Finetuned a base GPT-3 on synthetic and noisy data which compares favorably  
to InstructGPT (precursor to ChatGPT)

<br/>

**Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor**  


Generates synthetic data from GPT-3.5 to finetune a smaller model, T5

<br/>

**Textbooks Are All You Need II (phi-1.5; Microsoft)**  


Trained a model on 20B synthetic textbook quality tokens

<br/>

**Rephrasing the Web: A Recipe for Compute and Data Efficient Language Modeling**  


Rephrased documents via different prompt strategies, trained with a mix of  
synthetic and natural data, and showed 3x improvement in training speed

<br/>

**Self-Alignment with Instruction Backtranslation**  


Proposes a scaleable process to self improve a model starting with a small  
set a seed data, and iteratively generating synthetic data and finetuning



## Data quality is all you need

In [1]:
import instructor
from pydantic import BaseModel, Field, create_model
from openai import AsyncOpenAI, OpenAI
from enum import Enum
from typing import Iterable, Optional, List, Literal, Union
from langchain_community.document_loaders import WebBaseLoader
from langchain_experimental.text_splitter import SemanticChunker
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai.embeddings import OpenAIEmbeddings
import asyncio
import json
import warnings
warnings.filterwarnings("ignore")

In [18]:
!export OPENAI_API_KEY='sk-' + 'your api key'
client = instructor.patch(AsyncOpenAI())
MODEL='gpt-4-turbo-preview'

semantic_text_splitter = SemanticChunker(OpenAIEmbeddings())

In [3]:
class ChatRole(str, Enum):
    user: str = 'user'
    system: str = 'system'
    chat: str = 'chat'


class Message(BaseModel):
    role: Literal[ChatRole.user, ChatRole.system]
    content: str

class ChatSequence(BaseModel):
    messages: List[Message] = Field(..., description="A list of chat turns in sequence")

def create_chat_instruction(system_instruction, user_instruction):
    return ChatSequence(messages=[
        Message(role=ChatRole.system, content=system_instruction),
        Message(role=ChatRole.user, content=user_instruction)
    ])

def pretty_print(instance: BaseModel):
    print(json.dumps(instance.dict(), indent=2))
    

### Create Synthetic Data for RAG Embedding Finetune

In [4]:
class Context(BaseModel):
    body: str = Field(..., description="The text body of the context chunk")
    id: Optional[int] = Field(..., description="Number identifier of the context chunk")
    title: Optional[str] = Field(None, description="The tital of the context chunk")


class ContextQuestion(BaseModel):
    question: str = Field(..., description="A question where the context is the answer.")
    question_type: Optional[str] = Field(None, description="The type of question")
    question_level: Optional[str] = Field(None, description="How difficult the question is")

class ContextQuestions(BaseModel):
    questions: List[str] = Field(..., description="A sequence of questions where the context is the answer list should be at least 5 questions.")
    question_type: Optional[str] = Field(..., description="The type of questions indicated by user instruction")
    question_level: Optional[str] = Field(..., description="How difficult the question is")

def with_chain_of_thought(model: type[BaseModel], action: str) -> type[BaseModel]:
    # Retrieve the original model's fields and types
    # fields = {name: (field.outer_type_, ...) for name, field in model.__fields__.items()}
    fields = {name: (field.annotation, field.default) for name, field in model.__fields__.items()}
    
    # Add the 'reasoning' field
    fields['reasoning'] = (str, Field(..., description=f"Let's think step by step to {action}"))
    
    # Create the new model with '_ChainOfThought' appended to the original model name
    new_model_name = f"{model.__qualname__}_ChainOfThought"
    new_model = create_model(new_model_name, **fields)
    
    return new_model

CoTContextQuestion = with_chain_of_thought(ContextQuestion, action="Generate unique questions for the context")
CoTContextQuestions = with_chain_of_thought(ContextQuestions, action="Generate unique questions for the context")


class RagQuestionContext(BaseModel):
    context: Context = Field(..., description="A sequence of conext chunks")
    questions: Union[ContextQuestion, ContextQuestions, CoTContextQuestion, CoTContextQuestions] = Field(..., description="A sequence of questions where the context is the answer.")

class IterableRagQuestionContext(BaseModel):
    question_groups: List[RagQuestionContext] = Field(..., description="A sequence of conext chunks and questions")


async def generate_questions(answers: List[Context], response_model=ContextQuestion) -> IterableRagQuestionContext:
    results = []
    for answer in answers:
        # format context
        context = answer.body
        instructions = create_chat_instruction(
            system_instruction="You are a helpful assistant and a worldclass question master capable of generating questions of all skill level for a given answer.",
            user_instruction=f"Create unique questions that the context contains the answer to. \n\nContext: \n{context} "

        )
        # TODO check temp and other params
        questions = client.chat.completions.create(
            model=MODEL,
            messages=instructions.dict()['messages'],
            response_model=response_model,
        )
        results.append(questions)
    results = await asyncio.gather(*results)
    results = IterableRagQuestionContext(question_groups=[RagQuestionContext(context=answers[i], questions=results[i]) for i in range(len(answers))])
    return results


In [5]:
loader = WebBaseLoader('https://openai.com/research/video-generation-models-as-world-simulators')
loader.requests_kwargs = {'verify':False}
data = loader.load()
# chunk it up
answers = semantic_text_splitter.create_documents([data[0].page_content])
answers = [Context(id=i, body=answer.page_content) for i, answer in enumerate(answers)]
answers

[Context(body='\n\n\nVideo generation models as world simulators\n\n\n\n\n\n\n\n\n\n\n\n\n\nCloseSearch Submit Skip to main contentSite NavigationResearchOverviewIndexGPT-4DALL·E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer storiesSearch Navigation quick links Log inTry ChatGPTMenu Mobile Navigation CloseSite NavigationResearchOverviewIndexGPT-4DALL·E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer stories Quick Links Log inTry ChatGPTSearch Submit ResearchVideo generation models as world simulatorsWe explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Ou

### Just Ask For It

In [6]:
questions = await generate_questions(answers)
pretty_print(questions)

{
  "question_groups": [
    {
      "context": {
        "body": "\n\n\nVideo generation models as world simulators\n\n\n\n\n\n\n\n\n\n\n\n\n\nCloseSearch Submit Skip to main contentSite NavigationResearchOverviewIndexGPT-4DALL\u00b7E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer storiesSearch Navigation quick links Log inTry ChatGPTMenu Mobile Navigation CloseSite NavigationResearchOverviewIndexGPT-4DALL\u00b7E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer stories Quick Links Log inTry ChatGPTSearch Submit ResearchVideo generation models as world simulatorsWe explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that ope

### Add Chain of Thought Reasoning

In [7]:

questions = await generate_questions(answers, response_model=CoTContextQuestion)
pretty_print(questions)

{
  "question_groups": [
    {
      "context": {
        "body": "\n\n\nVideo generation models as world simulators\n\n\n\n\n\n\n\n\n\n\n\n\n\nCloseSearch Submit Skip to main contentSite NavigationResearchOverviewIndexGPT-4DALL\u00b7E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer storiesSearch Navigation quick links Log inTry ChatGPTMenu Mobile Navigation CloseSite NavigationResearchOverviewIndexGPT-4DALL\u00b7E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer stories Quick Links Log inTry ChatGPTSearch Submit ResearchVideo generation models as world simulatorsWe explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that ope

### While you're at it ask for many questions

In [8]:
questions = await generate_questions(answers, response_model=CoTContextQuestions)
pretty_print(questions)

{
  "question_groups": [
    {
      "context": {
        "body": "\n\n\nVideo generation models as world simulators\n\n\n\n\n\n\n\n\n\n\n\n\n\nCloseSearch Submit Skip to main contentSite NavigationResearchOverviewIndexGPT-4DALL\u00b7E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer storiesSearch Navigation quick links Log inTry ChatGPTMenu Mobile Navigation CloseSite NavigationResearchOverviewIndexGPT-4DALL\u00b7E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer stories Quick Links Log inTry ChatGPTSearch Submit ResearchVideo generation models as world simulatorsWe explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that ope

### Try Few Shot Prompts

In [9]:
async def few_shot_generate_questions(answers: List[Context], few_shot_examples: List[RagQuestionContext], response_model=ContextQuestion, question_types: List[str] = [None], question_levels: List[str] = [None]) -> IterableRagQuestionContext:
    results = []
    few_shot_context = ''
    for example in few_shot_examples:
        few_shot_context += f"Context: {example.context.body}\n"
        questions = ''
        for q in example.questions:
            questions += f"Question: {q}\n"
        few_shot_context += f"Question: {questions}\n\n"
    
    rate_limit_buffer = 20
    gathered_results = []
    full_results = []
    completion_count = 0
    for answer in answers:
        for question_type in question_types:
            if question_type:
                topic = f"The question type should be {question_type}. "
            else:
                topic = ''
            for question_level in question_levels:
                if question_level:
                    topic += f"The question level should be {question_level}. "
                # format context
                context = answer.body
                instructions = create_chat_instruction(
                    system_instruction="You are a helpful assistant and a worldclass question master capable of generating questions of all skill level for a given answer.",
                    user_instruction=f"Create unique questions that the context contains the answer to. {topic} Follow the these examples and format:\n\n{few_shot_context} Context \n\nContext: \n{context} "

                )
                # TODO check temp and other params
                questions = client.chat.completions.create(
                    model=MODEL,
                    messages=instructions.dict()['messages'],
                    response_model=response_model,
                )
                results.append((answer, questions))
                full_results.append((answer, questions))
                completion_count += 1
                if len(results) >= rate_limit_buffer:
                    gathered_results.extend(await asyncio.gather(*list(zip(*results))[1]))
                    results = []
    gathered_results.extend(await asyncio.gather(*list(zip(*results))[1]))
    # results keys and values are in the same order so we can map back
    results = [(orig_tuple[0], gathered) for gathered, orig_tuple in zip(gathered_results, full_results)]
    results = IterableRagQuestionContext(
        question_groups=[RagQuestionContext(
            context=context,
            questions=question,
        )
        for context, question in results]
    )
    return results


In [10]:
few_shot_examples = [RagQuestionContext(
    context=Context(id=40, body="We follow the methods of Yu et al. (2022a) and jointly pre-train our captioner with a CLIP and a language modeling objective using the above formulation on our dataset of (t, i) text and image pairs. The resulting model is indeed a good captioner, but exhibits the same problems we describe in section 2, such as a reluctance to describe details."),
    questions=ContextQuestions(questions=["What objectives is the captioner trained with?"], question_type=None, question_level=None)),
    RagQuestionContext(context=Context(id=503, body="""To evaluate caption blending ratios, we trained four image generation models using our descriptive synthetic captions at different blending ratios. We chose blends of 65%, 80%, 90% and 95% synthetic captions. Midway through the experiment, evaluations showed that the 65% blend was far behind the other blends on all evals and we dropped it."""),
                       questions=ContextQuestions(questions=["How can you evaluate caption blending ratios?"], question_type=None, question_level=None)),
    RagQuestionContext(context=Context(id=12, body="Annotating large-scale instruction data can be chal- lenging for humans because it requires 1) creativity to come up with novel tasks and 2) expertise for writing the solutions to each task. Here, we de- tail our process for SELF-INSTRUCT, which refers to the pipeline of generating tasks with a vanilla pretrained language model itself, filtering the generated data, and then conducting instruction tuning with this generated data in order to align the LM to follow instructions better."),
                       questions=ContextQuestions(questions=['Why is annotating instruction data challenging, particularly at large scale?'], question_type=None, question_level=None))
]

In [11]:
questions = await few_shot_generate_questions(answers, few_shot_examples, response_model=CoTContextQuestions)
questions.model_dump()

{'question_groups': [{'context': {'body': '\n\n\nVideo generation models as world simulators\n\n\n\n\n\n\n\n\n\n\n\n\n\nCloseSearch Submit Skip to main contentSite NavigationResearchOverviewIndexGPT-4DALL·E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer storiesSearch Navigation quick links Log inTry ChatGPTMenu Mobile Navigation CloseSite NavigationResearchOverviewIndexGPT-4DALL·E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer stories Quick Links Log inTry ChatGPTSearch Submit ResearchVideo generation models as world simulatorsWe explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of vide

### Steer The Model

In [12]:
questions = await few_shot_generate_questions(answers, few_shot_examples, response_model=CoTContextQuestions, question_types=['open-ended', 'closed-ended'])
questions.model_dump()

{'question_groups': [{'context': {'body': '\n\n\nVideo generation models as world simulators\n\n\n\n\n\n\n\n\n\n\n\n\n\nCloseSearch Submit Skip to main contentSite NavigationResearchOverviewIndexGPT-4DALL·E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer storiesSearch Navigation quick links Log inTry ChatGPTMenu Mobile Navigation CloseSite NavigationResearchOverviewIndexGPT-4DALL·E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer stories Quick Links Log inTry ChatGPTSearch Submit ResearchVideo generation models as world simulatorsWe explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of vide

In [13]:
questions = await few_shot_generate_questions(answers, few_shot_examples, response_model=CoTContextQuestions, question_types=['open-ended', 'closed-ended'], question_levels=['easy', 'medium', 'hard'])
questions.model_dump()

{'question_groups': [{'context': {'body': '\n\n\nVideo generation models as world simulators\n\n\n\n\n\n\n\n\n\n\n\n\n\nCloseSearch Submit Skip to main contentSite NavigationResearchOverviewIndexGPT-4DALL·E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer storiesSearch Navigation quick links Log inTry ChatGPTMenu Mobile Navigation CloseSite NavigationResearchOverviewIndexGPT-4DALL·E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer stories Quick Links Log inTry ChatGPTSearch Submit ResearchVideo generation models as world simulatorsWe explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of vide

### Filter Low Quality Data

In [14]:
async def filter_questions(questions: IterableRagQuestionContext) -> IterableRagQuestionContext:
    filtered_list = []
    for question_group in questions.question_groups:
        context = question_group.context.body
        instructions = create_chat_instruction(
            system_instruction="You are a helpful assistant and a worldclass question master capable of determining if questions are good.",
            user_instruction=f"Filter out questions that are low quality, are not relevant to the context, or incorrect given the answer. Return a list of booleans determining if the question should be removed. This list should map directly to the question sequence. \n\nContext: \n{context}\n\nQuestions: \n{question_group.questions} "
        )
        filtered_questions = client.chat.completions.create(
            model=MODEL,
            messages=instructions.dict()['messages'],
            response_model=FilterQuestions
        )
        filtered_list.append(filtered_questions)
    filtered_list = await asyncio.gather(*filtered_list)
    # keep only questions that are not filtered
    filtered_questions = []
    for i, (question_group, filtered_group) in enumerate(zip(questions.question_groups, filtered_list)):
        filtered_questions.append(RagQuestionContext(
            context=question_group.context,
            questions=CoTContextQuestions(questions=[question_group.questions.questions[f] for f in range(len(question_group.questions.questions)) if filtered_group.should_remove[f] == False],
                question_type=question_group.questions.question_type,
                question_level=question_group.questions.question_level,
                reasoning=question_group.questions.reasoning
            )
        ))
    return IterableRagQuestionContext(question_groups=filtered_questions)

class FilterQuestions(BaseModel):
    should_remove: List[bool] = Field(..., description="Whether or not to remove questions that are low quality, are not relevant to the context, or incorrect given the answer. Values should match question index")
    

In [15]:
filtered_questions = await filter_questions(questions)
print(json.dumps(filtered_questions.model_dump(), indent=2))

{
  "question_groups": [
    {
      "context": {
        "body": "\n\n\nVideo generation models as world simulators\n\n\n\n\n\n\n\n\n\n\n\n\n\nCloseSearch Submit Skip to main contentSite NavigationResearchOverviewIndexGPT-4DALL\u00b7E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer storiesSearch Navigation quick links Log inTry ChatGPTMenu Mobile Navigation CloseSite NavigationResearchOverviewIndexGPT-4DALL\u00b7E 3SoraAPIOverviewPricingDocsChatGPTOverviewTeamEnterprisePricingTry ChatGPTSafetyCompanyAboutBlogCareersResidencyCharterSecurityCustomer stories Quick Links Log inTry ChatGPTSearch Submit ResearchVideo generation models as world simulatorsWe explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that ope

In [16]:
def count_questions(questions: IterableRagQuestionContext) -> int:
    return sum([len(q.questions.questions) for q in questions.question_groups])

print(f'{count_questions(questions) - count_questions(filtered_questions)} questions were removed')

1 questions were removed


## What's next?
- Add validation to make sure all fields are correct
- Adjust prompts and openai parameters to the task
- Add reasoning chains to improve quality and diversity of the dataset
- Ask for variations of the same question, but be sure to remove data points that are too similiar
- Use natural data as seed data and if possible use it in training
- Add stronger filtering to remove low quality samples
- Scale up the data flywheel