# Data generation

[![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/langchain-ai/langchain/blob/master/docs/extras/use_cases/data_generation.ipynb)

## Use case

Creating synthethic language data can be beneficial for multiple reasons:
- providing data augmentation
- obtaining domain-specific examples
- increasing data diversity
- enabling quick iteration and experimentation

## Quickstart

Let's see a very straightforward example of how we can use OpenAI functions for creating synthetic data in LangChain.

In [None]:
!pip install langchain openai 

# Set env var OPENAI_API_KEY or load from a .env file:
# import dotenv
# dotenv.load_dotenv()

In [2]:
from langchain.llms import OpenAI
from langchain.chains.synthetic_data import create_data_generation_chain, DatasetGenerator

In [5]:
# LLM
model = OpenAI(temperature=0.7)
chain = create_data_generation_chain(model)

In [6]:
chain(["blue", "yellow"])

{'fields': ['blue', 'yellow'],
 'text': 'The beautiful sky was a mix of blue and yellow, creating an enchanting sunrise that filled the air with a sense of serenity.'}

In [7]:
chain([{"colors": ["blue", "yellow"]}])

{'fields': [{'colors': ['blue', 'yellow']}],
 'text': 'The vibrant colors of blue and yellow danced together in the field, creating a captivating display of beauty and energy.'}

In [8]:
chain([{"actor": "Tom Hanks", "movies": ["Forrest Gump", "Green Mile"]}])

{'fields': [{'actor': 'Tom Hanks', 'movies': ['Forrest Gump', 'Green Mile']}],
 'text': 'Tom Hanks has starred in some of the most beloved and iconic movies of all time, including Forrest Gump and The Green Mile. His portrayal of the characters in both films have left an indelible mark on audiences worldwide, and he has been recognized for his talent with numerous awards and accolades.'}

In [9]:
chain([{"actor": "Tom Hanks", "movies": ["Forrest Gump", "Green Mile"]}, {"actor": "Mads Mikkelsen", "movies": ["Hannibal", "Another round"]}])

{'fields': [{'actor': 'Tom Hanks', 'movies': ['Forrest Gump', 'Green Mile']},
  {'actor': 'Mads Mikkelsen', 'movies': ['Hannibal', 'Another round']}],
 'text': 'Tom Hanks and Mads Mikkelsen are two of the most acclaimed actors in recent decades, having starred in films such as Forrest Gump, Green Mile, Hannibal, and Another Round.'}

As we can see created examples are diversified and possess information we wanted them to have.

## Generating exemplary dataset for extraction benchmarking purposes

In [10]:
inp = [
    [
        {
            'Actor': 'Tom Hanks',
            'Film': [
                'Forrest Gump',
                'Saving Private Ryan',
                'The Green Mile',
                'Toy Story',
                'Catch Me If You Can']
        }
    ],
    [
        {
            'Actor': 'Tom Hardy',
            'Film': [
                'Inception',
                'The Dark Knight Rises',
                'Mad Max: Fury Road',
                'The Revenant',
                'Dunkirk'
            ]
        }
    ]
]

generator = DatasetGenerator(model)
dataset = generator(inp)

In [11]:
dataset

[{'fields': [{'Actor': 'Tom Hanks',
    'Film': ['Forrest Gump',
     'Saving Private Ryan',
     'The Green Mile',
     'Toy Story',
     'Catch Me If You Can']}],
  'text': 'Tom Hanks has had a long and storied career in Hollywood, highlighted by iconic roles in films such as Forrest Gump, Saving Private Ryan, The Green Mile, Toy Story, and Catch Me If You Can.'},
 {'fields': [{'Actor': 'Tom Hardy',
    'Film': ['Inception',
     'The Dark Knight Rises',
     'Mad Max: Fury Road',
     'The Revenant',
     'Dunkirk']}],
  'text': 'Tom Hardy is an incredibly versatile actor, having starred in films like Inception, The Dark Knight Rises, Mad Max: Fury Road, The Revenant, and Dunkirk, showing his extraordinary range and talent.'}]