# Data generation

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/data_generation.ipynb)

## Use case

Creating synthethic language data can be beneficial for multiple reasons:
- providing data augmentation
- obtaining domain-specific examples
- increasing data diversity
- enabling quick iteration and experimentation

## Quickstart

Let's see a very straightforward example of how we can use OpenAI functions for creating synthetic data in LangChain.

In [None]:
!pip install langchain openai 

# Set env var OPENAI_API_KEY or load from a .env file:
# import dotenv
# dotenv.load_dotenv()

In [1]:
from langchain.llms import OpenAI
from langchain.chains.synthetic_data import create_data_generation_chain, DatasetGenerator

In [4]:
# LLM
model = OpenAI(temperature=0.7)
chain = create_data_generation_chain(model)

In [5]:
chain({"fields": ["blue", "yellow"], "preferences": {}})

{'fields': ['blue', 'yellow'],
 'preferences': {},
 'text': 'The bright blue and cheery yellow contrasted against each other, creating a vibrant display of color.'}

In [6]:
chain({"fields": {"colors": ["blue", "yellow"]}, "preferences": {"style": "Make it in a style of a weather forecast."}})

{'fields': {'colors': ['blue', 'yellow']},
 'preferences': {'style': 'Make it in a style of a weather forecast.'},
 'text': "Today's forecast is bright and cheerful, with a mix of blue and yellow, making it a perfect day to get outside and enjoy the sun."}

In [7]:
chain({"fields": {"actor": "Tom Hanks", "movies": ["Forrest Gump", "Green Mile"]}, "preferences": None})

{'fields': {'actor': 'Tom Hanks', 'movies': ['Forrest Gump', 'Green Mile']},
 'preferences': None,
 'text': 'Tom Hanks is an iconic actor, renowned for his performances in beloved classics such as Forrest Gump and The Green Mile.'}

In [8]:
chain(
    {
        "fields": [
            {"actor": "Tom Hanks", "movies": ["Forrest Gump", "Green Mile"]},
            {"actor": "Mads Mikkelsen", "movies": ["Hannibal", "Another round"]}
        ],
        "preferences": {"minimum_length": 200, "style": "gossip"}
    }
)

{'fields': [{'actor': 'Tom Hanks', 'movies': ['Forrest Gump', 'Green Mile']},
  {'actor': 'Mads Mikkelsen', 'movies': ['Hannibal', 'Another round']}],
 'preferences': {'minimum_length': 200, 'style': 'gossip'},
 'text': '\nThe gossip around town is that Tom Hanks was a real hit in his roles as Forrest Gump and The Green Mile, while Mads Mikkelsen left audiences stunned with his impressive performances in Hannibal and Another Round.'}

As we can see created examples are diversified and possess information we wanted them to have. Also, their style reflects the given preferences quite well.

## Generating exemplary dataset for extraction benchmarking purposes

In [13]:
inp = [
    {
        'Actor': 'Tom Hanks',
        'Film': [
            'Forrest Gump',
            'Saving Private Ryan',
            'The Green Mile',
            'Toy Story',
            'Catch Me If You Can']
    },
    {
        'Actor': 'Tom Hardy',
        'Film': [
            'Inception',
            'The Dark Knight Rises',
            'Mad Max: Fury Road',
            'The Revenant',
            'Dunkirk'
        ]
    }
]

generator = DatasetGenerator(model, {"style": "informal", "minimal length": 500})
dataset = generator(inp)

In [14]:
dataset

[{'fields': {'Actor': 'Tom Hanks',
   'Film': ['Forrest Gump',
    'Saving Private Ryan',
    'The Green Mile',
    'Toy Story',
    'Catch Me If You Can']},
  'preferences': {'style': 'informal', 'minimal length': 500},
  'text': 'Tom Hanks is an incredible actor, having starred in some of the most iconic films of all time, such as Forrest Gump, Saving Private Ryan, The Green Mile, Toy Story, and Catch Me If You Can. His likable personality and natural acting skills have made him a household name and a much beloved actor for generations. Tom Hanks is an actor that will be remembered for years to come!'},
 {'fields': {'Actor': 'Tom Hardy',
   'Film': ['Inception',
    'The Dark Knight Rises',
    'Mad Max: Fury Road',
    'The Revenant',
    'Dunkirk']},
  'preferences': {'style': 'informal', 'minimal length': 500},
  'text': "Tom Hardy is an incredible actor, having appeared in a range of amazing films, such as Inception, The Dark Knight Rises, Mad Max: Fury Road, The Revenant and Dun

## Extraction from generated examples

In [45]:
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import List

In [46]:
class Actor(BaseModel):
    Actor: str = Field(description="name of an actor")
    Film: List[str] = Field(description="list of names of films they starred in")
        
parser = PydanticOutputParser(pydantic_object=Actor)

prompt = PromptTemplate(
    template="Extract fields from a given text.\n{format_instructions}\n{text}\n",
    input_variables=["text"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

In [47]:
_input = prompt.format_prompt(text=dataset[0]["text"])
output = model(_input.to_string())

parser.parse(output)

Actor(Actor='Tom Hanks', Film=['Forrest Gump', 'Saving Private Ryan', 'The Green Mile', 'Toy Story', 'Catch Me If You Can'])

In [48]:
_input = prompt.format_prompt(text=dataset[1]["text"])
output = model(_input.to_string())

parser.parse(output)

Actor(Actor='Tom Hardy', Film=['Inception', 'The Dark Knight Rises', 'Mad Max: Fury Road', 'The Revenant', 'Dunkirk'])