# Dataset Generation
I've decided to proceed with `Path 1` from `task_ds.md`, i.e.:

>**Given weather description in plain text, generate a valid JSON file with weather conditions.**
>
>For example, given the following text:
>
>```The weather is sunny and the temperature is 20 degrees. The wind is blowing at 10 km/h.
>Citizens are advised to go out and enjoy the weather. The weather is expected to be sunny tomorrow.
>```
>
>The output can look like:
>
>```
>{
    "weather": "sunny",
    "temperature": "20 degrees",
    "wind": "10 km/h",
    "vibe": "enjoyable"
>}
>```

Consequently, I need a dataset of `WEATHER DESCRIPTION` and `WEATHER ATTRIBUTES` (in JSON format) pairs. I've looked through publically available datasets and weather APIs, however I've only found numerical data (temperature, humidity etc.) with brief descriptions ("partially cloudy").

As a result I've decided to synthetically generate some descriptions using Claude Haiku model.

I will focus on the following weather attributes:
* general weather
* temperature
* wind speed
* wind direction
* humidity

## Generation Process
1. For each weather attribute, generate **20** examples of describing it. **15** of them will be used for training and **5** for validation.
2. Sample with replacement from each attribute list. Each description can be used at most **5** times.
3. Shuffle sampled descriptions. This allows the inputs to have different orders of attributes mentioned in the input.
4. Fill templates with random values, from predefined ranges.
5. For some of the examples remove one of the descriptions (to allow for null values)
6. Concatenate all attributes.

### Imports
Uncomment and run the following cell, if executing in Google Colab environment

In [None]:
# !pip install -r ../requirements.txt

In [3]:
import random
import os
from collections import Counter

import anthropic # 0.21.3
from datasets import Dataset


NUM_GENERATIONS = 20
NUM_VAL = int(0.25 * NUM_GENERATIONS)
NUM_TRAIN = NUM_GENERATIONS - NUM_VAL

RANDOM_STATE = 42
NUM_REPEATS = 5
PROB_OF_NULL = 0.3

ADD_DEGREES = lambda x: str(x) + " degrees"
ADD_PERC = lambda x: str(x) + "%"
ADD_SPEED_UNIT = lambda x: str(x) + " km/h"

NAMES_AND_RANGES = [
    ("temperature", {"<high-negative>": list(map(ADD_DEGREES, range(-50, -10 + 1))),
                     "<negative>": list(map(ADD_DEGREES, range(-10, 0 + 1))),
                     "<positive>": list(map(ADD_DEGREES, range(0, 15 + 1))),
                     "<high-positive>": list(map(ADD_DEGREES, range(15, 50 + 1)))}),
    ("wind_speed", {"<wind-speed>": list(map(ADD_SPEED_UNIT, range(0, 20 + 1)))}),
    ("wind_direction", {"<wind-direction>": ["South", "West", "East", "North", "NE", "NW", "SW", "SE"]}),
    ("humidity", {"<humidity>": list(map(ADD_PERC, range(20, 90 + 1)))}),
    ("weather", {"<weather>": ["sunny", "cloudy", "rainy", "snowy"]})
]

# Model details
MODEL_ID = "claude-3-haiku-20240307"
MAX_TOKENS = 1000
MODEL_CLIENT = client = anthropic.Anthropic(
    api_key=os.environ["ANTHROPIC_API_KEY"], # API key is required to use this model
)

# Output paths
TRAIN_DATA_OUTPUT_DIR = "hf_training"
VALIDATION_DATA_OUTPUT_DIR = "hf_validation"

#### Prompts for data generation

In [4]:
SYSTEM_PROMPT = "You are a weather TV presenter with 10 years of experience."

temperature_prompt = lambda num: f"""\
Generate at least {num} templates for presenting current temperature for today.
Instead of writing temperature directly, leave <high-negative>, <negative>, <positive>, <high-positive> to be filled.
Return those templates separated with a single \n."""

wind_prompt = lambda num: f"""\
Generate at least {num} templates for presenting wind speed and direction for today.
Instead of writing wind speed and direction directly, leave <wind-speed>, <wind-direction> to be filled.
Return those templates separated with a single \n."""

humidity_prompt = lambda num: f"""\
Generate at least {num} templates for presenting current humidity for today.
Instead of writing humidity directly, leave <humidity> to be filled.
Return those templates separated with a single \n."""

general_weather_prompt = lambda num: f"""\
Generate at least {num} templates for presenting current weather (sun, clouds, rain, snow and other) for today.
Instead of writing the weather directly, leave <weather> to be filled.
Return those templates separated with a single \n."""

In [5]:
def run_model(prompt: str) -> str:
    return MODEL_CLIENT.messages.create(
            model=MODEL_ID,
            max_tokens=MAX_TOKENS,
            temperature=0,
            system=SYSTEM_PROMPT,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": prompt
                        }
                    ]
                }
            ]
        ).content[0].text


Generate descriptions

In [9]:
example_descriptions = []
for prompt in [temperature_prompt, wind_prompt, humidity_prompt, general_weather_prompt]:
    output = [line for line in run_model(prompt(NUM_GENERATIONS)).split("\n") if "<" in line]
    example_descriptions.append(output)

In [10]:
for desc in example_descriptions:
    print(len(desc))

20
20
20
20


In [11]:
example_descriptions

[["It's a <high-negative> degree day out there today.",
  'The temperature is sitting at a chilly <negative> degrees.',
  "We're looking at a <positive> degree temperature for the day.",
  "Brace yourselves, it's a scorching <high-positive> degree day.",
  'The mercury is reading <negative> degrees as we speak.',
  'Wrap up warm, the temperature is only <high-negative> degrees.',
  "It's a comfortable <positive> degrees outside today.",
  "You'll want to stay cool, the temperature has reached <high-positive> degrees.",
  'Bundle up, the current temperature is <high-negative> degrees.',
  "It's a mild <positive> degree day today.",
  'The temperature has soared to a sweltering <high-positive> degrees.',
  "We're seeing a <negative> degree chill in the air today.",
  'Enjoy the warmth, the temperature is a pleasant <positive> degrees.',
  "Whew, it's a sizzling <high-positive> degree day out there.",
  'The thermometer is showing <high-negative> degrees at the moment.',
  "It's a <positi

Split the data

In [12]:
training, validation = [], []
for attribute in example_descriptions:
    training.append(attribute[:NUM_TRAIN])
    validation.append(attribute[NUM_TRAIN:])

Create synthetic examples

In [13]:
def create_weather_descriptions(
    examples: list[list[str]],
    names_and_ranges: list[tuple[str, dict]] = NAMES_AND_RANGES,
    num_repeats: int = NUM_REPEATS,
    null_probability: float = PROB_OF_NULL,
    seed: int = RANDOM_STATE
):
    # set the seed
    random.seed(seed)

    # repeat descriptions and shuffle them
    for i, attribute in enumerate(examples):
        repeated_examples = [desc for desc in attribute for _ in range(num_repeats)]
        random.shuffle(repeated_examples)

        examples[i] = repeated_examples

    # join them in random order and create nulls
    joined_attributes = []
    for attributes in zip(*examples):
        if random.random() < null_probability:
            attributes = random.sample(attributes, len(attributes) - 1) # removes one attribute

        attributes = list(attributes)
        random.shuffle(attributes)
        attributes = " ".join(attributes)

        labels = {}
        for name, ranges in names_and_ranges:
            labels[name] = None
            for range_key, range_values in ranges.items():
                if range_key in attributes:
                    sampled_value = random.choice(range_values)
                    labels[name] = sampled_value
                    attributes = attributes.replace(range_key, str(sampled_value))

        joined_attributes.append({"weather_description": attributes, "weather_conditions": labels})


    return joined_attributes


In [14]:
training = create_weather_descriptions(training)
validation = create_weather_descriptions(validation)

In [15]:
len(training)

75

In [16]:
training[:5]

[{'weather_description': "It's a mild 10 degrees degree day today. The humidity metric for today is 27%. It's a sunny kind of day today.",
  'weather_conditions': {'temperature': '10 degrees',
   'wind_speed': None,
   'wind_direction': None,
   'humidity': '27%',
   'weather': 'sunny'}},
 {'weather_description': "Looks like we're in for snowy conditions today. The current humidity figure for today is 50%. Brace yourselves, it's a scorching 26 degrees degree day. Winds will be a factor today, with speeds of 2 km/h and a West direction.",
  'weather_conditions': {'temperature': '26 degrees',
   'wind_speed': '2 km/h',
   'wind_direction': 'West',
   'humidity': '50%',
   'weather': 'snowy'}},
 {'weather_description': "The humidity level today is currently sitting at 53%. Expect the winds to be quite strong, with speeds up to 18 km/h out of the NW. You'll want to stay cool, the temperature has reached 41 degrees degrees.",
  'weather_conditions': {'temperature': '41 degrees',
   'wind_sp

### Saving the datasets
I'll save them using HuggingFace datasets format, because I'll be using HF API anyway, during model fine-tuning phase.

In [17]:
hf_training = Dataset.from_list(training)
hf_validation = Dataset.from_list(validation)

hf_training.save_to_disk(TRAIN_DATA_OUTPUT_DIR)
hf_validation.save_to_disk(VALIDATION_DATA_OUTPUT_DIR)

Saving the dataset (0/1 shards):   0%|          | 0/75 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/25 [00:00<?, ? examples/s]