# Data generation

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/data_generation.ipynb)

## Use case

Creating synthethic language data can be beneficial for multiple reasons:
- providing data augmentation
- obtaining domain-specific examples
- increasing data diversity
- enabling quick iteration and experimentation

## Quickstart

Let's see a very straightforward example of how we can use OpenAI functions for creating synthetic data in LangChain.

In [None]:
!pip install langchain openai 

# Set env var OPENAI_API_KEY or load from a .env file:
# import dotenv
# dotenv.load_dotenv()

In [2]:
from langchain.llms import OpenAI
from langchain.chains.synthetic_data import create_data_generation_chain, DatasetGenerator

In [7]:
# LLM
model = OpenAI(model_name="gpt-3.5-turbo", temperature=0.7)
chain = create_data_generation_chain(model)



In [8]:
chain({"fields": ["blue", "yellow"], "preferences": {}})

{'fields': ['blue', 'yellow'],
 'preferences': {},
 'text': 'The vibrant blue sky served as a breathtaking backdrop for the cheerful yellow sunflowers, creating a picturesque scene that radiated warmth and happiness.'}

In [9]:
chain({"fields": {"colors": ["blue", "yellow"]}, "preferences": {"style": "Make it in a style of a weather forecast."}})

{'fields': {'colors': ['blue', 'yellow']},
 'preferences': {'style': 'Make it in a style of a weather forecast.'},
 'text': "Good morning! Today's weather forecast brings a beautiful combination of colors, where the clear blue sky will gracefully blend with the vibrant yellow hues of the sun, creating a breathtaking spectacle that is sure to brighten up your day."}

In [7]:
chain({"fields": {"actor": "Tom Hanks", "movies": ["Forrest Gump", "Green Mile"]}, "preferences": None})

{'fields': {'actor': 'Tom Hanks', 'movies': ['Forrest Gump', 'Green Mile']},
 'preferences': None,
 'text': 'Tom Hanks is an iconic actor, renowned for his performances in beloved classics such as Forrest Gump and The Green Mile.'}

In [10]:
chain(
    {
        "fields": [
            {"actor": "Tom Hanks", "movies": ["Forrest Gump", "Green Mile"]},
            {"actor": "Mads Mikkelsen", "movies": ["Hannibal", "Another round"]}
        ],
        "preferences": {"minimum_length": 200, "style": "gossip"}
    }
)

{'fields': [{'actor': 'Tom Hanks', 'movies': ['Forrest Gump', 'Green Mile']},
  {'actor': 'Mads Mikkelsen', 'movies': ['Hannibal', 'Another round']}],
 'preferences': {'minimum_length': 200, 'style': 'gossip'},
 'text': 'In a surprising turn of events, Hollywood legend Tom Hanks, renowned for his mesmerizing performances in iconic films like "Forrest Gump" and "Green Mile," shares the limelight with the enigmatic and captivating Mads Mikkelsen, whose chilling portrayal of Hannibal Lecter in the psychological thriller "Hannibal" and his recent critically acclaimed role in "Another round" have solidified his status as a powerhouse actor in the industry, leaving audiences in awe of his undeniable talent.'}

As we can see created examples are diversified and possess information we wanted them to have. Also, their style reflects the given preferences quite well.

## Generating exemplary dataset for extraction benchmarking purposes

In [11]:
inp = [
    {
        'Actor': 'Tom Hanks',
        'Film': [
            'Forrest Gump',
            'Saving Private Ryan',
            'The Green Mile',
            'Toy Story',
            'Catch Me If You Can']
    },
    {
        'Actor': 'Tom Hardy',
        'Film': [
            'Inception',
            'The Dark Knight Rises',
            'Mad Max: Fury Road',
            'The Revenant',
            'Dunkirk'
        ]
    }
]

generator = DatasetGenerator(model, {"style": "informal", "minimal length": 500})
dataset = generator(inp)

In [12]:
dataset

[{'fields': {'Actor': 'Tom Hanks',
   'Film': ['Forrest Gump',
    'Saving Private Ryan',
    'The Green Mile',
    'Toy Story',
    'Catch Me If You Can']},
  'preferences': {'style': 'informal', 'minimal length': 500},
  'text': 'Tom Hanks, an incredibly talented actor known for his iconic roles in films such as "Forrest Gump," "Saving Private Ryan," "The Green Mile," "Toy Story," and "Catch Me If You Can," has captivated audiences worldwide with his exceptional range of performances. Whether he is portraying a simple-minded yet endearing character like Forrest Gump, a brave and compassionate soldier in the midst of war like Captain John Miller, a gentle and empathetic prison guard like Paul Edgecomb, a lovable cowboy toy named Woody, or a charismatic and cunning con artist like Frank Abagnale Jr., Hanks effortlessly brings each character to life and leaves a lasting impression on viewers. With his undeniable charisma, versatility, and unmatched acting skills, Tom Hanks has undoubted

## Extraction from generated examples

In [27]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
from langchain.chains import create_extraction_chain_pydantic, SimpleSequentialChain
from pydantic import BaseModel, Field
from typing import List

In [26]:
class Actor(BaseModel):
    Actor: str = Field(description="name of an actor")
    Film: List[str] = Field(description="list of names of films they starred in")

### Parsers

In [15]:
parser = PydanticOutputParser(pydantic_object=Actor)

prompt = PromptTemplate(
    template="Extract fields from a given text.\n{format_instructions}\n{text}\n",
    input_variables=["text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

_input = prompt.format_prompt(text=dataset[0]["text"])
output = model(_input.to_string())

parser.parse(output)

Actor(Actor='Tom Hanks', Film=['Forrest Gump', 'Saving Private Ryan', 'The Green Mile', 'Toy Story', 'Catch Me If You Can'])

In [16]:
_input = prompt.format_prompt(text=dataset[1]["text"])
output = model(_input.to_string())

parser.parse(output)

Actor(Actor='Tom Hardy', Film=['Inception', 'The Dark Knight Rises', 'Mad Max: Fury Road', 'The Revenant', 'Dunkirk'])

### Extractors

In [24]:
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

extractor = create_extraction_chain_pydantic(pydantic_schema=Actor, llm=llm)
extractor.run(dataset[0]["text"])

[Actor(Actor='Tom Hanks', Film=['Forrest Gump', 'Saving Private Ryan', 'The Green Mile', 'Toy Story', 'Catch Me If You Can'])]

In [29]:
extractor.run(dataset[1]["text"])

[Actor(Actor='Tom Hardy', Film=['Inception', 'The Dark Knight Rises', 'Mad Max: Fury Road', 'The Revenant', 'Dunkirk'])]