# Creating Data with Generative AI "Understanding Data"
## Transformer architecture, Examples, Prompt Engineering
### - Divyesh S. Rajput


***

Upon initial inspection, generative artificial intelligence (AI) may seem almost mystical. Yet, a closer examination reveals it as a statistically driven process, exhibiting both remarkable capabilities and notable limitations.

**Generative AI fundamentally represents a subclass of artificial intelligence that generates new content, extrapolating from learned patterns in existing data. A quintessential example is the predictive search technology used by Google, which employs a large language model (LLM) trained on extensive user search queries to anticipate subsequent words in a user's search.**

This predictive search, however, pales in comparison to the recent advancements in generative AI. Modern applications of this technology extend to the creation of new television episode scripts, academic articles, the synthesis of images from textual descriptions, and even the composition of music resembling the work of renowned artists.

Notwithstanding its celebrated advances, generative AI harbors potential pitfalls. For instance, chatbots utilizing generative AI can generate erroneous or harmful responses. The creation of deepfake videos, particularly of political figures or celebrities, poses risks of spreading misinformation. Additionally, these models can perpetuate and amplify existing human biases.

The implications of generative AI are far-reaching, affecting labor, industry, governance, and the very essence of human existence. To coexist harmoniously with this technology, a comprehensive understanding of its mechanics and inherent risks is crucial.

In the realm of machine learning models, which are a subset of AI, algorithms learn patterns and associations from data to execute specific tasks. This process involves several key steps:

- **Task Definition**: The initial step is to specify the task for the model, such as email classification, revenue prediction, customer segmentation, product recommendation, or image creation from text prompts.
- **Model Selection**: This involves choosing an appropriate model based on the defined task, characteristics and volume of available data, and the model's intended real-world application and users.
- **Data Collection and Preparation**: The next phase involves gathering relevant data and refining it by removing outliers and corrupt data, and organizing it in a structured format.
- **Data Splitting (Optional)**: Typically, a portion of the data (often 80%) is designated as training data to build the model’s knowledge base. The remaining data (usually 20%) serves as validation data, used to assess the model's learning efficacy against new information.
- **Model Training**: In this phase, the model learns from the training data, analyzing data relationships. Model parameters can be adjusted to guide the learning process.
- **Model Evaluation**: Before deployment, the model’s performance is evaluated using metrics such as accuracy, precision, recall, and F1 score. This evaluation often includes testing the model on the validation set to gauge its accuracy.


Understanding these elements is vital in navigating the complexities and harnessing the potential of generative AI.


## Generative AI: Transforming the Real World

Generative AI is revolutionizing machine learning, shifting from mere data modeling to the synthesis of entirely new content. The rise of Variational Autoencoder (VAE) and Generative Adversarial Network (GAN) models in 2014 marked a pivotal moment in image synthesis. In 2017, the debut of the first Generative Pre-trained Transformer (GPT) model further revolutionized language generation, yielding text virtually indiscernible from human writing.

Major tech giants, such as Meta and Google, are in a perpetual race, each vying to outdo the other with their advanced language models. Concurrently, the startup sphere is witnessing a surge in AI, particularly in generative models. These startups have attracted over $17 billion in funding, as tracked by Dealroom's dynamic visualization of the sector's financial landscape.

## How Companies Are Harnessing Generative AI

### Generative Text: Crafting Words with AI
- **AI Chatbots**: Take OpenAI's ChatGPT, a leading-edge personal assistant bot capable of tasks ranging from text summarization to content generation. Companies are customizing ChatGPT’s foundational model, fine-tuning it for specialized tasks.
- **Content Generation**: Jasper AI stands out with its AI platform, offering bespoke content solutions that align with a company’s branding, and facilitating integration into various platforms and business products.
- **Language Correction**: Grammarly's suite of applications provides integrated writing assistance across a multitude of apps and websites. Their plugins seamlessly fit into platforms like Microsoft Office, Google Docs, and Gmail, ensuring error-free writing.

The journey of Generative AI is more than just a technological feat; it's a creative and strategic revolution, redefining how we perceive and interact with machine intelligence in the real world.


## Generative Models

Generative models are designed to learn and predict data probabilities by understanding the inherent structure of the input data. Unlike traditional models that rely on labeled outcome data, generative models excel in a distinct way:

1. **Probability Prediction**: Just like discriminative models, generative models can predict probabilities for data, but they employ a different methodology, focusing on learning from the data's structure itself.

2. **Data Generation**: A standout feature of generative models is their ability to create new data. They achieve this by extensively learning from the training data, enabling them to generate new data points that closely resemble the original data they were trained on.

Consider the task of analyzing Amazon product reviews to determine whether they are positive or negative. Each review, with its unique style, tone, and combination of words and phrases, is a distinct data point. A generative model comprehensively studies these reviews, capturing the nuances in language, patterns, structure, and the intricate relationships between different reviews. This process allows the model to not only understand the sentiment behind each review but also to generate new, realistic review texts that are similar to those it has been trained on.


![image.png](attachment:image.png)

Given the vast array of words and patterns present in all the reviews, a generative model dedicates itself to calculating the probability of these words and patterns appearing in either a positive or a negative review. This calculated probability is known as the joint probability, which refers to the likelihood of a specific combination of features occurring together in the dataset. 

For example, the model might identify that words like “broken” and “disappointed” frequently co-occur in negative reviews. Conversely, phrases such as “highly recommend” and “very satisfied” are often found in positive reviews. This kind of learning allows the model to understand the sentiment expressed in different combinations of words.

The true prowess of the model is showcased when it applies this accumulated knowledge to new, unseen reviews. Here, it assesses the likelihood of these reviews being positive or negative. What distinguishes generative models from their discriminative counterparts is their dual capability. Not only can they perform classification tasks - like sorting reviews into positive or negative categories - but they are also adept at generating entirely new data when prompted. This implies that our Amazon review model is not limited to categorizing existing reviews; it is also capable of creating new, realistic reviews by itself.


## How generative AI contributes to solving data-related problems
Generative AI offers innovative solutions to a myriad of data-related challenges. Its unique capabilities contribute significantly to overcoming issues such as data scarcity, imbalance, privacy concerns, and model generalization. Here's a breakdown of how generative AI makes pivotal contributions across various data-related problems:

1. Data Augmentation for Improved Model Generalization
2. Addressing Imbalanced Datasets
3. Privacy-Preserving Data Generation
4. Data Scarcity Mitigation
5. Domain Adaptation and Transfer Learning
6. Anomaly Detection Improvement
7. Data Simulation for Forecasting
8. Enhancing Natural Language Processing (NLP) Tasks

Selecting a Generative AI technique to focus on depends on a few factors, such as your interest, the complexity you're comfortable with, and the kind of data you want to generate. Here are a few popular options, each with its unique features and applications:

**Generative Adversarial Networks (GANs):**

What They Are: GANs consist of two neural networks, the generator and the discriminator, which are trained simultaneously. The generator creates data, while the discriminator evaluates it.

**Use Cases**: Ideal for generating realistic images, art, or enhancing image resolution.

Complexity: Fairly complex due to the need to balance two networks.


**Variational Autoencoders (VAEs):**

What They Are: VAEs are designed to compress data into a lower-dimensional representation and then reconstruct it.

**Use Cases**: Good for image generation and modification, as well as anomaly detection.

Complexity: Moderate, with a strong basis in probability theory.


**Autoregressive Models (like GPT for text):**

What They Are: These models predict the next element in a sequence, using the previous elements.

**Use Cases**: Primarily used for text generation but also applicable to time-series data.

Complexity: Can range from simple to highly complex.


**Transformer Models (like BERT, GPT):**

What They Are: Advanced version of sequence models that use self-attention mechanisms.

**Use Cases**: Extremely versatile, used for text generation, translation, summarization, and more.

Complexity: High, often requiring significant computational resources.


**Deep Belief Networks (DBNs):**

What They Are: A type of deep neural network with multiple layers of stochastic, latent variables.

**Use Cases**: Useful for feature extraction, classification, regression, and also generation of images and video sequences.

Complexity: Complex, with a foundation in unsupervised learning.

### Introduction to Transformer Model 

## Historical Context

#### Evolution from RNNs and LSTMs

In the landscape of natural language processing (NLP), earlier models like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTMs) units played a pivotal role. These models were designed to process sequences of data, such as text, by passing the representation of one word to the next, creating a form of memory about previous words. However, this sequential processing presented significant challenges:

- **Difficulty in Long-Range Dependencies:** RNNs and LSTMs struggled with long-range dependencies in text. When sequences got too long, these models often failed to maintain the context, losing relevant information from earlier text.
  
- **Sequential Processing Limitations:** The inherent sequential nature of these models meant they couldn't process words in parallel, leading to slower training and inference times, particularly for lengthy texts.


### Definition and Significance

#### Introduction of Transformer Models

The introduction of transformer models, as detailed in the paper "Attention Is All You Need" by Vaswani et al., was a game-changer in the field of NLP. These models revolutionized how text is processed by employing a mechanism known as 'attention', which, unlike previous methods, allowed for parallel processing of entire sequences. This was a radical shift from the word-by-word processing of RNNs and LSTMs.

#### Why They Matter

- **Parallel Processing:** By processing entire sequences at once, transformers can handle vast amounts of text far more efficiently than sequential models. This leads to faster training times and the ability to handle longer sequences more effectively.

- **Better Contextual Understanding:** The self-attention mechanism in transformers enables the model to weigh and incorporate context from any part of the input sequence, not just the immediate surroundings. This allows for a more nuanced and comprehensive understanding of the text, especially beneficial for tasks requiring a deep understanding of context, like summarization or question answering.

- **Scalability and Flexibility:** Transformers are highly scalable, capable of handling tasks ranging from simple sentence classification to complex document summarization. Their architecture is versatile, allowing for adaptations like the addition of more layers to increase complexity and capacity for larger datasets.

In summary, the introduction of transformer models marked a significant advancement in NLP, overcoming many limitations of earlier models and opening up new possibilities in text processing and generation.


## Introduction to Data Generation Using Generative AI

The advent of generative AI has ushered in a new era in the field of data science and artificial intelligence. This introduction will provide an overview of the data generation process using generative AI, exploring its context, significance, and fundamental principles.

### The Context of Generative AI in Data Generation

Generative AI refers to the subset of AI techniques focused on generating new data that mimics real data. It contrasts with discriminative AI, which is more about understanding and categorizing data. The context of generative AI in data generation is anchored in the need to address various challenges such as data scarcity, privacy concerns, and the desire to improve AI models’ performance and versatility.

### Significance of Generative AI in Data Science

1. **Augmenting Data**: Generative AI plays a pivotal role in augmenting datasets, especially in scenarios where acquiring real-world data is challenging, costly, or constrained by privacy issues.

2. **Improving Model Robustness**: By generating diverse datasets, these techniques allow for the training of more robust machine learning models that are less prone to overfitting and better generalize to new data.

3. **Enabling Innovation**: Generative AI is at the forefront of innovation, facilitating breakthroughs in fields ranging from drug discovery to artistic creation.

### Principles Behind Generative AI for Data Generation

1. **Learning Data Distributions**: At its core, generative AI involves learning the distribution of the training data and then using this learned information to generate new data points.

2. **Types of Models**: Key generative models include Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Transformer models. Each model has a unique approach to learning and generating data.

3. **Transformer Models in Data Generation**: Transformer models, in particular, leverage self-attention mechanisms to efficiently process sequences of data, making them highly effective for tasks such as text generation. They are known for their ability to handle long-range dependencies and generate highly coherent and contextually relevant data.

4. **Balance of Realism and Novelty**: An essential principle in data generation is the balance between mimicking the training data (realism) and introducing novel elements (creativity), ensuring that the generated data is both realistic and diverse.

5. **Ethical Considerations**: It's crucial to acknowledge the ethical implications of data generation, including concerns about data authenticity, privacy, and potential biases in the generated data.

This introduction sets the stage for understanding the transformative role of generative AI in data generation, highlighting its significance and the principles that guide its application.



Let’s get familiar with the high-level architecture of the GPT transformer:
![image.png](attachment:image.png)

This diagram depicts the conventional flow of data in a Transformer model, where information moves from the bottom to the top. Initially, the input tokens undergo encoding through a series of steps. They are first processed by an Embedding layer, followed by a Positional Encoding layer. Subsequently, these two encodings are combined. Following this, the encoded inputs undergo a sequence of N decoding steps, followed by normalization. Finally, the decoded data passes through a linear layer and a softmax function, resulting in a probability distribution used to select the next token.

In the subsequent sections, we will delve into each component of this architecture.

#### Embedding Layer
The Embedding layer transforms each token in the input sequence into a vector of length d_model. The input to the Transformer comprises batches of token sequences with a shape of (batch_size, seq_len). For each token, which is represented by a single number, the Embedding layer computes its embedding, resulting in a sequence of numbers with a length of d_model. Consequently, a tensor containing each embedding replaces the corresponding original token.

The rationale behind using an embedding instead of the original token is to ensure a consistent mathematical vector representation for tokens with similar semantics. Take, for instance, the words "she" and "her". While these words convey similar meanings, referring to a female subject, their corresponding tokens may differ significantly. For instance, using OpenAI's tiktoken tokenizer, "she" corresponds to token 7091, while "her" corresponds to token 372. Initially, the embeddings for these tokens will differ substantially as well, given that the embedding layer's weights are initialized randomly and refined during training. However, with frequent co-occurrence in the training data, the embeddings of these tokens will tend to converge, reflecting their semantic similarity.

#### Positional Encoding

The Positional Encoding layer supplements the model with information concerning the absolute position and relative distance of each token within the sequence. Unlike recurrent neural networks (RNNs) or convolutional neural networks (CNNs), Transformers lack inherent knowledge of token positions within a sequence. Hence, to capture token order, Transformers rely on Positional Encoding.

Numerous methods exist for encoding token positions. One approach is to employ another embedding module, akin to the previous layer, but instead of token values, the positions of tokens are passed as input. Once again, initialization of weights in this embedding layer is random, with subsequent learning during training to discern token positions.

Precomputing the values for positional encoding, instead of utilizing a trainable Embedding, offers a significant advantage: it reduces the number of parameters our model needs to train. This reduction in parameters enhances training performance, a crucial factor especially for large language models.

#### Decoder
As illustrated in the diagram summarizing the Transformer architecture, following the Embedding and Positional Encoding layers comes the Decoder module. The Decoder comprises N instances of a Decoder Layer, succeeded by a Layer Norm.
The Layer Norm takes an input of shape (batch_size, seq_len, d_model) and normalizes it over its last dimension. As a result of this step, each embedding distribution will start out as unit normal (centered around zero and with standard deviation of one). Then during training, the distribution will change shape as the parameters a_2 and b_2 are optimized


#### Generator

The **Generator** is a crucial part of the transformer architecture, typically found at the end of the decoder module. Its primary function is to transform the decoder's output into a final output sequence, such as a sequence of words for a language model. The process and importance are as follows:

1. **Softmax Layer**: This is the final layer of the generator. It takes the vectors output by the decoder and applies the softmax function to them. The softmax function converts these vectors into a probability distribution, which helps in selecting the next word in the generated sequence. The output probabilities correspond to the model's confidence in each possible next word.

2. **Linear Transformation**: Just before the softmax layer, there is a linear transformation. This layer is a fully connected neural network that maps the high-dimensional output of the decoder to the desired output size. It's crucial for shaping the output appropriately for the softmax function to process.

3. **Essential Role in Transformer Architecture**: The generator is key to producing coherent and contextually relevant text in transformer models. It's the component that effectively converts the complex, abstract representations of the input sequence into a readable and meaningful output.

In summary, the generator takes the abstract representations from the transformer decoder and turns them into a probabilistic form that can be used to generate human-readable text. This process is essential for tasks such as language translation, text summarization, and even generative tasks like creating original content.




## Data Characteristics of Generative AI Using Transformer Models

Generative AI, particularly when powered by Transformer models, has redefined the frontiers of artificial data synthesis. These models, which leverage layers of self-attention mechanisms, are particularly adroit at discerning the nuanced patterns in large volumes of data. Here, we explore the multifaceted characteristics of the data generated by these transformative architectures.

### High Fidelity
One of the paramount features of data generated through Transformer models is its high fidelity—its remarkable accuracy in reflecting the intricate patterns inherent in the training dataset. Unlike previous generative models, Transformers are adept at capturing and replicating complex structures within the data. This attribute is not just limited to textual data; when applied to tasks such as image and speech generation, the outputs are often indistinguishable from authentic data.

**Python Example**: Consider a Transformer model trained on literary texts. When prompted to generate a continuation for a given passage, the model outputs text that mirrors the style and tone of the original author, often with a level of subtlety that belies its synthetic origin.


In [2]:
from transformers import pipeline

# Initialize a text-generation pipeline with GPT-2 (as an example)
generator = pipeline('text-generation', model='gpt2')

# Prompt the model to generate text based on the opening line of a novel
prompt = "It was a bright cold day in April, and the clocks were striking thirteen."
generated_text = generator(prompt, max_length=50, num_return_sequences=1)

print(generated_text[0]['generated_text'])

  torch.utils._pytree._register_pytree_node(


Downloading config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


It was a bright cold day in April, and the clocks were striking thirteen. A young man was in the middle of the road on the north side of the road, and he passed out under a sheet of paper and a book in one of his


### Diversity and Variety

The outputs of Transformer models are not monolithic; they embody a rich tapestry of variance, essential for tasks that require a broad spectrum of perspectives. Diversity in the generated data ensures that models downstream do not suffer from overfitting and are able to generalize well to new, unseen data.

Python Example: Generating diverse text responses for a chatbot to prevent repetitive interactions. 

In [3]:
responses = generator("Customer: I'd like to check my order status.", max_length=50, num_return_sequences=5)
for response in responses:
    print(response['generated_text'])


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Customer: I'd like to check my order status. Please confirm you saw the email that sent. **Confirm email address** @gummi.com Email * Required field Email account Password * You must provide a valid email address. Password is
Customer: I'd like to check my order status.


I get in trouble for not checking my order.


Please email me as well to tell... my business can be traced to you...


Your name can be found at your
Customer: I'd like to check my order status. Your order, please, do not panic!


My order is shipped today! This is a nice, easy place to check your order and I am honored to see you. I received the
Customer: I'd like to check my order status. In the event:

"Yes, I'll be back at my place!"

"Ok, here you go, no problem. Just leave the phone disconnected."

I'm
Customer: I'd like to check my order status. I don't want your money.

Customer: I have 3 of these "FULL" items. They take out the order and ship back to me. My car is under review and


### Novelty and Uniqueness

While Transformers excel at mimicking the training data, they also possess the uncanny ability to innovate, to weave together elements in configurations that, though reminiscent of the training data, are unique unto themselves. This aspect is particularly beneficial in creative domains where the generation of novel content is prized.

### Consistency and Coherence

Despite the inherent randomness in data generation, Transformer models manage to maintain a high degree of consistency and coherence. This is especially evident in their ability to generate lengthy passages of text that remain contextually relevant and semantically sound throughout.

In conclusion, the data generated by Transformer models stands out for its accuracy, diversity, novelty, and coherence, making it a gold standard in synthetic data generation.

## Application Areas for Data Generated by Transformer Models

The application of generative AI, particularly through the utilization of Transformer models, spans an array of domains, each benefiting from the unique capabilities of these systems to produce rich, context-aware, and diverse data.

### Natural Language Processing (NLP)
In the realm of NLP, Transformer models have revolutionized numerous tasks. Text completion, machine translation, and content creation are prime examples where these models excel. They don't merely translate words but understand and convey nuances, idioms, and the cultural context, enhancing the quality of machine-assisted communication significantly.

**Python Example**: Use a Transformer model to translate a sentence from English to French.

In [4]:
from transformers import pipeline

# Initialize a translation pipeline
translator = pipeline('translation_en_to_fr', model='t5-base')

# Translate text from English to French
english_text = "Generative AI is transforming the field of machine learning."
french_translation = translator(english_text, max_length=40)

print(french_translation[0]['translation_text'])

Downloading config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


L'AI génératrice transforme le domaine de l'apprentissage automatique.


### Synthetic Data Generation

Data scarcity is a critical challenge in fields where data collection is constrained by ethical, legal, or practical barriers. Transformer models can mitigate this by generating high-quality synthetic data that can be used for model training. For example, in healthcare, generative AI can produce anonymized patient records, preserving privacy while enabling medical research.

### Art and Creativity

Generative AI transcends technical applications; it is also an accomplice to creativity. In art and music, Transformer models are used to compose music pieces, create digital art, and even write poetry. They act as catalysts for creativity, augmenting human artists by providing inspirational starting points or by collaboratively developing artistic works.

### Research and Development

In R&D, especially in domains like pharmaceuticals, generative models expedite the discovery process. They can predict molecular structures that might yield new medications, reducing the need for extensive physical trials and opening up possibilities for novel treatments.

The versatility of Transformer models in generative AI thus extends across a multitude of industries, reshaping existing paradigms and forging new paths in digital innovation.

***

## Data Generation Using Transformer Models

Transformer models, with their state-of-the-art capabilities, are adept at performing a variety of data generation tasks. In this section, we focus on two distinct applications: text generation and synthetic dataset creation.

### Text Generation Task

#### Defining the Task
The text generation task aims to produce coherent and contextually relevant pieces of text. A common use case is generating responses in a chatbot, where the input is a user's message and the output is a natural, human-like reply.

#### Format of Generated Text
The generated text will be in the form of sentences or paragraphs, based on the input prompt. The output format can vary from simple responses to complex narratives, depending on the prompt and the model's capabilities.

#### Python Example: Generating Text
Here, we use a pre-trained Transformer model from the Hugging Face `transformers` library to generate a text snippet.



In [7]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Encode context the generation is conditioned on
input_ids = tokenizer.encode('The quick brown fox jumps over the lazy dog', return_tensors='pt')

# Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Generate text until the output length (which includes the context length) reaches 50
outputs = model.generate(input_ids, max_length=50)

# Decode and print the output for each sequence
for i, output in enumerate(outputs):
    print(f"Generated text {i+1}: {tokenizer.decode(output, skip_special_tokens=True)}")

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated text 1: The quick brown fox jumps over the lazy dog and runs off.

"I'm sorry, I'm sorry, I'm sorry," the fox says.

"I'm sorry, I'm sorry, I'm sorry," the fox says


### Constraints for Generated Text

To ensure the utility of the generated text, certain constraints are applied:

- **Length:** The model's `max_length` parameter limits the length of the generated text to a defined range.
- **Diversity:** The `temperature` parameter controls the randomness of the output, impacting the diversity of the generated text.
- **Safety:** Content filters are applied to prevent the model from generating inappropriate content.

### Synthetic Dataset Generation Task

### Defining the Task

The synthetic dataset generation task involves creating artificial datasets that mimic the statistical properties of real-world data. This can be crucial for training machine learning models when actual datasets are inadequate or sensitive.

### Format of the Synthetic Dataset

The synthetic dataset will typically have a tabular format, with rows representing individual records and columns representing features.

### Python Example: Generating a Synthetic Dataset

We simulate the generation of a synthetic dataset using Python's pandas and numpy libraries for illustrative purposes.


In [9]:
import pandas as pd
import numpy as np

# Assume we have a model that generates synthetic data following the distribution of the training data
def generate_synthetic_data(num_records):
    # This function is a placeholder for the actual model's generative function
    data = {
        'Feature1': np.random.normal(loc=0.0, scale=1.0, size=num_records),
        'Feature2': np.random.uniform(low=0.0, high=1.0, size=num_records),
        # ... more features can be added as per the model's capabilities
    }
    synthetic_df = pd.DataFrame(data)
    return synthetic_df

# Generate a synthetic dataset with 100 records
synthetic_dataset = generate_synthetic_data(100)
print(synthetic_dataset.head())

   Feature1  Feature2
0 -1.159953  0.786905
1  0.981290  0.940439
2  0.999376  0.111725
3  1.433815  0.084361
4 -0.166150  0.644720


## Constraints for Synthetic Dataset

When generating synthetic datasets, several constraints may be enforced to ensure data quality:

- **Statistical Representativeness:** The generated data should statistically reflect the real dataset's characteristics.
- **Privacy:** Techniques such as differential privacy ensure that the synthetic data does not reveal information about real individuals.
- **Domain Constraints:** The generated features must adhere to domain-specific rules, such as valid ranges for numerical values or categories.


## Evaluation and Justification of Generated Data Quality

### Assessing Effectiveness of Generative AI Technique

The effectiveness of a generative AI technique, particularly one involving Transformer models, hinges on the model's ability to produce data that is relevant and applicable to the task at hand. Evaluating this effectiveness involves several dimensions:

- **Fidelity:** How well does the generated data reflect the characteristics of the training dataset? For text, this could involve comparing the style, vocabulary, and syntax of generated sentences to those of the corpus it was trained on.

- **Diversity:** A robust model should generate diverse outputs, offering a variety of plausible results rather than repeating a limited set of patterns.

- **Novelty:** While the generated data should be similar to the training data, it should also provide new combinations and patterns that were not explicitly present in the training data.

- **Coherence and Consistency:** In text generation, coherence refers to the logical flow of content, while consistency pertains to maintaining context and style throughout the piece.

### Python Example for Evaluation

To illustrate how one might assess these dimensions, consider a simple example where we use a similarity measure to compare the generated text to the training data.


In [11]:
pip install sentence_transformers

Collecting sentence_transformers
  Obtaining dependency information for sentence_transformers from https://files.pythonhosted.org/packages/ba/20/7ef81df2e07322d95332d07c1c38c597f543c1f666d689a3153ba6fa09e3/sentence_transformers-2.6.1-py3-none-any.whl.metadata
  Downloading sentence_transformers-2.6.1-py3-none-any.whl.metadata (11 kB)
Downloading sentence_transformers-2.6.1-py3-none-any.whl (163 kB)
   ---------------------------------------- 0.0/163.3 kB ? eta -:--:--
   -- ------------------------------------- 10.2/163.3 kB ? eta -:--:--
   --------- ----------------------------- 41.0/163.3 kB 653.6 kB/s eta 0:00:01
   ---------------------------------------- 163.3/163.3 kB 1.6 MB/s eta 0:00:00
Installing collected packages: sentence_transformers
Successfully installed sentence_transformers-2.6.1
Note: you may need to restart the kernel to use updated packages.


In [15]:
from transformers import GPT2Tokenizer, GPT2Model
import torch
from scipy.spatial.distance import cosine

# Define example training and generated texts
training_texts = ["The weather is cold today.", "I enjoyed the new movie."]
generated_texts = ["It's a chilly day outside.", "The new film was delightful."]

# Load pre-trained model tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

# Set padding token
tokenizer.pad_token = tokenizer.eos_token

# Encode text inputs
training_text_encodings = tokenizer(training_texts, padding=True, return_tensors='pt')
generated_text_encodings = tokenizer(generated_texts, padding=True, return_tensors='pt')

# Compute embeddings
with torch.no_grad():
    training_text_embeddings = model(**training_text_encodings).last_hidden_state.mean(dim=1)
    generated_text_embeddings = model(**generated_text_encodings).last_hidden_state.mean(dim=1)

# Compute cosine similarity
cosine_similarities = [1 - cosine(training.detach().numpy(), generated.detach().numpy())
                       for training, generated in zip(training_text_embeddings, generated_text_embeddings)]

# Assess the similarity scores
for i, score in enumerate(cosine_similarities):
    print(f"Similarity between training text '{training_texts[i]}' and generated text '{generated_texts[i]}': {score:.4f}")


Similarity between training text 'The weather is cold today.' and generated text 'It's a chilly day outside.': 0.9985
Similarity between training text 'I enjoyed the new movie.' and generated text 'The new film was delightful.': 0.9984


## Validation of Generated Data

### Standards and Criteria for Validation

To validate the quality of the generated data, it's crucial to compare it against established benchmarks or criteria. This could involve:

- **Comparative Analysis:** Comparing the generated data with a high-quality, human-curated dataset. For textual data, this might mean assessing factors like grammaticality, vocabulary richness, and topic relevance.

- **Automated Metrics:** Using metrics such as BLEU (Bilingual Evaluation Understudy) for translation tasks, ROUGE (Recall-Oriented Understudy for Gisting Evaluation) for summarization, or precision and recall measures for content generation.

- **Consistency Checks:** Ensuring that the generated data does not contradict known facts or established principles, particularly in domain-specific applications like medical or legal text generation.

### Python Example: Validation Using Automated Metrics

In this example, let's use the BLEU score to evaluate the quality of machine-translated text against a reference translation.


In [17]:
pip install sacrebleu

Collecting sacrebleu
  Obtaining dependency information for sacrebleu from https://files.pythonhosted.org/packages/de/a5/bf9eddf90deeb7833bbb1ecd7cd4515245cc54c330b936d502ac531f9412/sacrebleu-2.4.1-py3-none-any.whl.metadata
  Downloading sacrebleu-2.4.1-py3-none-any.whl.metadata (57 kB)
     ---------------------------------------- 0.0/57.9 kB ? eta -:--:--
     ------------- ------------------------ 20.5/57.9 kB 320.0 kB/s eta 0:00:01
     --------------------------------- ---- 51.2/57.9 kB 650.2 kB/s eta 0:00:01
     -------------------------------------- 57.9/57.9 kB 506.9 kB/s eta 0:00:00
Collecting portalocker (from sacrebleu)
  Obtaining dependency information for portalocker from https://files.pythonhosted.org/packages/17/9e/87671efcca80ba6203811540ed1f9c0462c1609d2281d7b7f53cef05da3d/portalocker-2.8.2-py3-none-any.whl.metadata
  Downloading portalocker-2.8.2-py3-none-any.whl.metadata (8.5 kB)
Downloading sacrebleu-2.4.1-py3-none-any.whl (106 kB)
   -----------------------------

In [18]:
from sacrebleu import corpus_bleu

# Sample machine-translated texts and their reference translations
machine_translations = ["C'est une journée froide.", "J'ai aimé le nouveau film."]
reference_translations = [["Il fait froid aujourd'hui."], ["J'ai apprécié le nouveau film."]]

# Calculate BLEU score
bleu_score = corpus_bleu(machine_translations, reference_translations).score
print(f"BLEU score: {bleu_score:.2f}")


BLEU score: 10.68


Interpreting the Score: BLEU scores are scaled from 0 to 100, with higher scores indicating better translations. A score of 10.68 is relatively low, suggesting that there is a significant difference between the machine-translated text and the reference translations. In general, a score closer to 100 means the machine translation is very similar to the human translations, indicating higher quality.

**Context of BLEU Scores**:

Scores Vary by Task: Scores can vary significantly depending on the complexity of the text and the languages involved. For instance, translations between languages with similar structures may yield higher BLEU scores.

Relative, Not Absolute: BLEU scores are best used to compare different translation models or approaches rather than as an absolute measure of translation quality. A BLEU score on its own doesn’t convey the full picture of translation quality and should be complemented with other qualitative evaluations.

Limitations: BLEU has limitations; it doesn't capture semantic accuracy or the fluency of the translation very well. It is possible for a translation to be semantically correct and fluent but still have a low BLEU score if it uses different wording from the reference texts.

In summary, a BLEU score of 10.68 indicates there's room for improvement in the machine translation quality. It suggests that the translation might not closely match the reference texts, either in terms of the exact words used or in the way phrases are structured. For a more comprehensive evaluation, consider both qualitative reviews and additional quantitative metrics.

## Potential Applications of Generated Data

### Enhancing Data Science Tasks

The generated data from Transformer models can be leveraged across various data science tasks:

- **Training Data Augmentation:** In cases of limited or imbalanced datasets, the synthetic data can supplement real data, aiding in training more robust models.
  
- **Model Validation and Testing:** Using generated data to validate the performance and reliability of other models, especially in scenarios where test data is scarce.
  
- **Simulating Scenarios for Decision Making:** In areas like finance or logistics, generated data can simulate various scenarios, assisting in strategic planning and decision-making processes.
  
- **Ethical AI Development:** By generating diverse and inclusive data, we can address biases present in existing datasets, fostering the development of more ethical AI systems.

### Advancing Research and Innovation

The application of this synthetic data is not just confined to practical data science tasks but also extends to innovative research areas:

- **Natural Language Understanding and Generation:** Advancing the capabilities of AI in understanding and generating human-like text.
  
- **Creative AI:** Exploring the boundaries of AI-generated content in art, literature, and music, potentially leading to new forms of creative expression.
  
- **Predictive Analytics:** Employing generated data in predictive models to forecast trends, behaviors, and outcomes in various domains.

This comprehensive evaluation and justification of the generated data affirm the transformative potential of generative AI techniques, particularly when using Transformer models. The quality validation and the multitude of application areas highlight the model's versatility and effectiveness in both augmenting human capabilities and advancing the field of AI.


## Data Generation Process with AG News Dataset

### 1. Loading GPT-2 Model:
   - Initialize by loading the GPT-2 tokenizer and TensorFlow model (`TFGPT2LMHeadModel`) from the Hugging Face `transformers` library, which is essential for generating text.

### 2. Setting a Reproducible Seed:
   - To ensure consistent outputs across different executions, the seed is set to a specific value (e.g., `set_seed(42)`), promoting reproducibility.

### 3. Developing a Text Generation Function:
   - Implement a function named `generate_news_text`. This function takes a news headline or topic as input and generates a corresponding news article or summary.
   - The function leverages the GPT-2 tokenizer for input processing and the GPT-2 model parameters (like `max_length`, `no_repeat_ngram_size`, `top_k`) for controlled text generation.

### 4. Integrating the AG News Dataset:
   - Load the "AG News" dataset, which consists of news articles categorized into topics such as World, Sports, Business, and Science/Technology.
   - Use the dataset to provide context or seed phrases for generating news-related text.

### 5. Interactive News Generation Loop:
   - Create an interactive loop that prompts users to enter a news category or topic.
   - Based on the user's input, the script identifies related headlines or topics from the AG News dataset.

### 6. Generating and Exhibiting News Content:
   - Upon receiving the user's input, the script uses the GPT-2 model to elaborate on the given topic, generating a news snippet or an article summary.
   - It presents the generated news content alongside the input headline or topic.

### 7. Handling Diverse News Queries:
   - If the user's input doesn't directly align with the AG News categories, the script adapts to generate relevant content based on the keywords or themes provided.
   - Users are encouraged to explore a variety of topics within the news domain.


In [20]:
pip install tensorflow pytorch

Collecting tensorflow
  Obtaining dependency information for tensorflow from https://files.pythonhosted.org/packages/e4/14/d795bb156f8cc10eb1dcfe1332b7dbb8405b634688980aa9be8f885cc888/tensorflow-2.16.1-cp311-cp311-win_amd64.whl.metadata
  Downloading tensorflow-2.16.1-cp311-cp311-win_amd64.whl.metadata (3.5 kB)
Collecting tensorflow-intel==2.16.1 (from tensorflow)
  Obtaining dependency information for tensorflow-intel==2.16.1 from https://files.pythonhosted.org/packages/e0/36/6278e4e7e69a90c00e0f82944d8f2713dd85a69d1add455d9e50446837ab/tensorflow_intel-2.16.1-cp311-cp311-win_amd64.whl.metadata
  Downloading tensorflow_intel-2.16.1-cp311-cp311-win_amd64.whl.metadata (5.0 kB)
Collecting absl-py>=1.0.0 (from tensorflow-intel==2.16.1->tensorflow)
  Obtaining dependency information for absl-py>=1.0.0 from https://files.pythonhosted.org/packages/a2/ad/e0d3c824784ff121c03cc031f944bc7e139a8f1870ffd2845cc2dd76f6c4/absl_py-2.1.0-py3-none-any.whl.metadata
  Downloading absl_py-2.1.0-py3-none-any

In [22]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

# 1. Loading GPT-2 Model and Tokenizer (PyTorch version)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# 2. Setting Seed for Reproducibility
def set_seed(seed):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(42)

# 3. Text Generation Function
def generate_news_text(prompt, max_length=100):
    input_ids = tokenizer.encode(prompt, return_tensors='pt')
    generated_text_samples = model.generate(input_ids, 
                                            max_length=max_length, 
                                            no_repeat_ngram_size=2, 
                                            top_k=50, 
                                            num_return_sequences=1)
    generated_text = tokenizer.decode(generated_text_samples[0], skip_special_tokens=True)
    return generated_text

# Example usage
prompt = "Latest technology trends"
generated_text = generate_news_text(prompt)
print("Generated Text:\n", generated_text)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated Text:
 Latest technology trends are changing the way we think about technology.

The world is changing, and we need to change how we use technology to make it better for everyone. We need a new way of thinking about how technology works, how it works for us, what it means for our lives. And we're going to need it.


In [23]:
from datasets import load_dataset

# 4. Load the AG News Dataset
dataset = load_dataset("ag_news")


Downloading readme:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

Downloading and preparing dataset None/None to C:/Users/Divyesh Rajput/.cache/huggingface/datasets/parquet/ag_news-216298233d2300ec/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to C:/Users/Divyesh Rajput/.cache/huggingface/datasets/parquet/ag_news-216298233d2300ec/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [25]:
# 5. Interactive News Generation Loop
while True:
    user_input = input("Enter a news topic or 'exit' to quit: ")
    if user_input.lower() == 'exit':
        break

    # Generate and display news content
    generated_news = generate_news_text(user_input)
    print("Generated News Article:\n", generated_news)

Enter a news topic or 'exit' to quit: USA


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated News Article:
 USA.

The team's first goal came in the second half, when the ball was headed into the box by a defender. The ball went through the legs of the defender and into a corner, but the keeper was able to get it back to the net. It was a great goal for the team, and it was the first time we've scored in a game in our history. We're very proud of that. I think we're going to be very happy with that."
.
Enter a news topic or 'exit' to quit: technology


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Generated News Article:
 technology.com/products/product-detail/b-battery-power-supply-1-3-pack-2-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-28-29-30-31-32-33-34-35-36-37-38-39-
Enter a news topic or 'exit' to quit: exit


### Another Example - Culinary World

In this example, we aim to explore the fascinating possibilities of text generation using advanced AI models, specifically focusing on the culinary domain. Leveraging the capabilities of the GPT-2 model, a state-of-the-art language processing neural network, we will generate creative recipe ideas and culinary content.

To facilitate this, we have crafted a synthetic dataset, synthetic_culinary_dataset.csv, which contains an array of ingredients and cuisines. This dataset serves as the foundational element for our text generation task, providing the contextual basis from which the GPT-2 model can draw inspiration to create unique and relevant culinary content.

The Python script below outlines the entire process, beginning with loading the necessary libraries and the GPT-2 model, setting a seed for reproducibility, and defining a function for generating text. It then loads our synthetic culinary dataset and enters into an interactive loop. In this loop, users can prompt the AI to generate recipes based on specific ingredients or cuisines, showcasing the model's ability to creatively interpret and build upon the given input.

In [26]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import pandas as pd
import random

# Load GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

# Set a seed for reproducibility
def set_seed(seed):
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

set_seed(42)

# Function to generate recipe text
def generate_recipe_text(prompt, max_length=100):
    input_ids = tokenizer.encode(prompt, return_tensors='pt')
    generated_text_samples = model.generate(input_ids, max_length=max_length, 
                                            no_repeat_ngram_size=2, top_k=50, 
                                            num_return_sequences=1)
    generated_text = tokenizer.decode(generated_text_samples[0], skip_special_tokens=True)
    return generated_text

# Load the dataset
file_path = 'C:/Users/Divyesh Rajput/synthetic_culinary_dataset.csv'
df = pd.read_csv(file_path)

# Interactive loop for recipe generation
while True:
    user_input = input("Enter 'ingredient', 'cuisine', or 'exit': ").lower()
    
    if user_input == 'exit':
        break
    elif user_input == 'ingredient':
        chosen_ingredient = random.choice(df['Ingredients'].dropna().tolist())
        prompt = f"Recipe with {chosen_ingredient}:"
    elif user_input == 'cuisine':
        chosen_cuisine = random.choice(df['Cuisines'].dropna().tolist())
        prompt = f"Recipe for {chosen_cuisine} cuisine:"
    else:
        print("Please enter 'ingredient', 'cuisine', or 'exit'.")
        continue

    generated_recipe = generate_recipe_text(prompt)
    print("\nGenerated Recipe Idea:\n", generated_recipe)


Enter 'ingredient', 'cuisine', or 'exit': japanese
Please enter 'ingredient', 'cuisine', or 'exit'.
Enter 'ingredient', 'cuisine', or 'exit': cuisine


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generated Recipe Idea:
 Recipe for Russian cuisine:

Ingredients: 1 cup (1 stick) butter, softened
.
, 1/2 cup sugar, divided
: 2 cups (2 sticks) unsalted butter
 (or 1 stick unsweetened)
-1/4 cup unsmoked salmon, cut into 1-inch cubes
(optional) 1 tablespoon (3/8 teaspoon) salt
1 tablespoon ground cumin
2 tablespoons (4 tablespoons) freshly ground black pepper
Directions:
Enter 'ingredient', 'cuisine', or 'exit': ingredient


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.



Generated Recipe Idea:
 Recipe with onion:

1 cup chopped fresh parsley
.
 (I used a small amount of parsnip)
, ( I used an extra 1/2 cup of chopped parsse) 1 cup finely chopped onion
: 1 1⁄2 cups chopped cilantro
- 1 teaspoon cumin
*1 teaspoon salt
(I use 1 tablespoon of cayenne pepper) 2 tablespoons olive oil
2 tablespoons chopped garlic
3 tablespoons minced fresh ginger
4 tablespoons finely
Enter 'ingredient', 'cuisine', or 'exit': exit


***

## Understanding Prompt Engineering in Generative AI

### Definition of Prompt Engineering
Prompt engineering is a crucial technique in guiding generative AI models to produce specific, high-quality outputs. These AI models, designed to emulate human-like responses, necessitate detailed instructions to yield accurate and relevant results. This practice involves the careful selection of formats, phrases, and symbols to enhance AI-user interactions. Creativity, alongside trial and error, is essential for crafting prompts that ensure the AI operates as intended.

### The Essence of Prompts in Generative AI
A prompt is essentially a command or request in natural language, directing the AI to execute a particular task. It leverages large language models (LLMs) — expansive machine learning models built upon deep neural networks and pre-trained on substantial datasets. These models are versatile, capable of tasks like document summarization, language translation, and question answering. Their design allows them to predict outputs based on their extensive training, even from minimal input like a single word. 

However, the effectiveness of generative AI hinges on the context and specificity of these prompts. Systematic prompt design leads to more pertinent and practical outputs. Prompt engineering is thus an iterative process of refining these inputs to achieve optimal responses from the AI system.

### The Importance of Prompt Engineering
The surge in generative AI applications has elevated the role of prompt engineering, serving as a critical link between users and language models. Prompt engineers work on identifying and testing various input forms to build a library of prompts. This repository aids application developers in crafting user interfaces that elicit effective responses from AI models.

In application development, encapsulating open-ended user queries within structured prompts enhances the efficiency and relevance of AI responses. For instance, in AI chatbots, an engineered prompt can transform a vague user query into a detailed, actionable request for the AI, thereby generating more accurate and helpful responses.


## Optimizing LLM Responses with CO-STAR Framework

The CO-STAR framework, developed by GovTech Singapore’s Data Science & AI team, offers a systematic approach to structuring prompts for Large Language Models (LLMs), enhancing their response quality and relevance. This framework encapsulates several key aspects that significantly influence LLM output.

### Understanding CO-STAR Framework

- **(C) Context**: This element involves providing essential background information about the task, thereby grounding the LLM in the specific scenario being addressed.
  
- **(O) Objective**: Clearly articulating the task's objective enables the LLM to direct its response toward achieving the defined goal.
  
- **(S) Style**: Specifying a desired style, such as emulating the writing of a well-known author or professional, guides the LLM in its language and word choice.
  
- **(T) Tone**: Setting the tone, whether formal, humorous, or empathetic, shapes the LLM’s response to match the intended emotional context.
  
- **(A) Audience**: Identifying the target audience, such as experts or beginners, ensures the LLM’s output is tailored to their level of understanding.
  
- **(R) Response**: Defining the desired response format (e.g., list, JSON, professional report) allows for effective downstream utilization of the LLM’s output.

![image.png](attachment:image.png)

### Practical Application Example

Let's consider an example where you're a digital marketing strategist drafting an email campaign. Without CO-STAR, a basic prompt to an LLM might yield a standard response, potentially lacking specificity and appeal.

However, applying the CO-STAR framework transforms this task:

- **Context**: Drafting an email for a new eco-friendly product line.
- **Objective**: Aim to drive online sales through engaging content.
- **Style**: Emulate the compelling narratives used by successful eco-conscious brands.
- **Tone**: Inspirational and motivational.
- **Audience**: Environmentally aware consumers, familiar with green products.
- **Response**: The email should be concise, yet captivating and informative.

By adhering to CO-STAR, the response from the LLM becomes more aligned with your campaign’s objectives, resonating more effectively with your intended audience.

The CO-STAR framework thus ensures a structured and comprehensive approach to prompt design, leading to more nuanced and purpose-driven outputs from LLMs.


**Example:**
![image.png](attachment:image.png)

**Using CO-STAR Framework**
![image.png](attachment:image.png)

The CO-STAR framework guides you to provide all of the crucial pieces of information about your task to the LLM in a structured manner, ensuring a tailored and optimized response to exactly what you need.

## Thank you

### References
https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf?ref=gptechblog.com

https://www.youtube.com/watch?v=2IK3DFHRFfw

https://towardsdatascience.com/what-is-generative-ai-a-comprehensive-guide-for-everyone-8614c0d5860c

https://pub.towardsai.net/prompt-engineering-best-practices-text-transforming-translation-d60858f86b85


MIT License

Copyright (c) 2024 Divyesh Singh Rajput

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.