<a href="https://colab.research.google.com/github/jjovalle99/fine-tune-llama2/blob/main/Generate_dataset_%7C_GPT_4_Completions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Setup

In [None]:
!pip install -q -U git+https://github.com/huggingface/datasets.git
!pip install -q langchain python-dotenv huggingface_hub
!pip install -q tiktoken
!pip install -q openai

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
DATASET = "jjovalle99/amazon_reviews_datathon_2023"

In [None]:
from datasets import load_dataset
from huggingface_hub import login

In [None]:
login()

#### Load big dataset and generate a subset

In [None]:
dataset = load_dataset(DATASET, token=HUGGING_FACE_TOKEN, num_proc=100)

In [None]:
subset = dataset["train"].shuffle(seed=1399).select(range(500))

#### Define template

In [None]:
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from pydantic import BaseModel, Field, validator
from typing import List, Union, Optional

In [None]:
class Response(BaseModel):
    sentiment: str = Field(description="Is the sentiment positive, negative or neutral?")
    topics: Optional[List[str]] = Field(description="What are the key topics associated with the sentiment? If this information is not found, output null.")
    entities: Optional[List[str]] = Field(description="Extract names of relevant entities, like companies or products. If this information is not found, output null.")

output_parser = PydanticOutputParser(pydantic_object=Response)

In [None]:
template = "Based on the following review, return the following information.\n\n\
sentiment: Is the sentiment positive, negative or neutral?\n\
topics: What are the key topics associated with the sentiment? \
If this information is not found, output null.\n\
entities: Extract names of relevant entities, like companies or products.\
If this information is not found, output null.\n\n\
The output should be in English independently of the review language. {format_instructions}\n\
review: \"{review}\""

In [None]:
# Example
prompt_template = PromptTemplate(
    template=template,
    input_variables=["review"],
    partial_variables={"format_instructions": output_parser.get_format_instructions()},
)
print(prompt_template.format_prompt(review=subset[0]["reviewText"]).text)

Based on the following review, return the following information.

sentiment: Is the sentiment positive, negative or neutral?
topics: What are the key topics associated with the sentiment? If this information is not found, output null.
entities: Extract names of relevant entities, like companies or products.If this information is not found, output null.

The output should be in English independently of the review language. The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"sentiment": {"title": "Sentiment", "description": "Is the sentiment positive, negative or neutral?", "typ

#### Generate prompts for each row in the dataset

In [None]:
def create_prompts(examples: dict) -> dict:
    examples["prompt"] = [
        PromptTemplate(
            template=template,
            input_variables=["review"],
            partial_variables={"format_instructions": output_parser.get_format_instructions()}
        ).format_prompt(review=review).text for review in examples["reviewText"]
    ]
    return examples

In [None]:
subset = subset.map(create_prompts, batched=True)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

#### Take a look at total cost with GPT4

In [None]:
import tiktoken

In [None]:
encoding = tiktoken.encoding_for_model("gpt-4")
subset = subset.map(lambda examples: {"prompt_tokens": [len(encoding.encode(prompt)) for prompt in examples["prompt"]]}, batched=True)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

We are using GPT-4 8k context model i.e. `gpt-4`

In [None]:
# input cost
total_input_tokens = sum(subset["prompt_tokens"])
total_input_cost = (total_input_tokens/1000) * 0.03
print(f"Total input cost: {total_input_cost} $")

Total input cost: 6.04326 $


In [None]:
# output max cost
total_output_max_cost = ((100*500)/1000) * 0.06
print(f"Total output max cost: {total_output_max_cost} $")

Total output max cost: 3.0 $


In [None]:
# total cost
total_cost = total_input_cost + total_output_max_cost
print(f"Total expected cost: {total_cost} $")

Total expected cost: 9.04326 $


#### Get prompt outputs

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain import LLMChain

In [None]:
gpt = ChatOpenAI(model_name="gpt-4", max_tokens=100, openai_api_key=OPENAI_API_KEY)

#### Test prompt

In [None]:
test_prompt = """Based on the following review, return the following information.

sentiment: Is the sentiment positive, negative or neutral?
topics: What are the key topics associated with the sentiment? If this information is not found, output null.
entities: Extract names of relevant entities, like companies or products.If this information is not found, output null.

The output should be in English independently of the review language. The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{"properties": {"sentiment": {"title": "Sentiment", "description": "Is the sentiment positive, negative or neutral?", "type": "string"}, "topics": {"title": "Topics", "description": "What are the key topics associated with the sentiment? If this information is not found, output null.", "type": "array", "items": {"type": "string"}}, "entities": {"title": "Entities", "description": "Extract names of relevant entities, like companies or products. If this information is not found, output null.", "type": "array", "items": {"type": "string"}}}, "required": ["sentiment"]}
```
review: "Impresionate trabajo de esta fotografo por su tratamiento y delicadeza de la imagen que hace que este libro sea diferente y personal de Sally Mann ."""

In [None]:
llmchain = LLMChain(llm=gpt, prompt=prompt_template)

In [None]:
result = llmchain.run({"review": test_prompt})

In [None]:
result

'{"sentiment": "positive", "topics": ["work", "treatment", "delicacy", "image", "book"], "entities": ["Sally Mann"]}'

### Create GPT-4 response for each prompt in subset

In [None]:
def create_response(examples: dict) -> dict:
    examples["gpt_responses"] = [
        llmchain.run({"review": review}) for review in examples["prompt"]
    ]
    return examples

In [None]:
subset = subset.map(create_response, batched=True)

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

#### Save resulting dataset

In [None]:
subset = subset.remove_columns([c for c in subset.features.keys() if c not in ["prompt", "gpt_responses"]])

In [None]:
subset.push_to_hub("gpt-4-responses-dataset-500", private=True)

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]