# Generating Synthetic Data

- No hard and fast rule about how much data you need to have before you start fine-tuning
- I've seen ok results in as few as 1,000 examples but I usually just get as much data as I can that is reasonable to acquire.
- I ended up generated ~ 30k examples in the beginning for an initial run through

## My Prompt

First part is the same, but take a look after the horizontal rule below (starting with the text "You are given the following three inputs"):

```
Honeycomb is an observability platform that allows you to write queries to inspect trace data.
The specification of the Honeycomb query language is as follows:

QUERY SPEC:
All top-level keys are optional.

```json
"calculations":[
    // ops: COUNT, CONCURRENCY, COUNT_DISTINCT, HEATMAP, SUM, AVG, MAX, MIN, P001, P01, P05, P10, P25, P50, P75, P90, P95, P99, P999, RATE_AVG, RATE_SUM, RATE_MAX
    {"op": "COUNT"},// COUNT and CONCURRENCY are just op
    {"op": "HEATMAP", "column": "name"}
],
"filters":[
    // ops: =, !=, >, >=, <, <=, starts-with, does-not-start-with, exists, does-not-exist, contains, does-not-contain, in, not-in
    {"column": "name", "op": "exists"}, // exists and does-not-exist ops only have column
    {"column": "name", "op": "=", "value": "something"}
],
"filter_combination": "AND", // AND or OR
"breakdowns":[
    // columns in COLUMNS
    "column1","column2"
],
"orders":[
    // HEATMAP not allowed
    // Must come from breakdowns or calculations
    {"op": "op_in_calculation", "column": "column_in_calculation", "order": "ascending"},
    {"op": "COUNT", "order": "descending"}, // COUNT and CONCURRENCY have no column
    {"column": "column1", "order": "descending"},
],
"havings":[
    // HEATMAP not allowed
    {"calculate_op": "op_in_calculation", "column": "name", "op": "OPNAME", "value": 100},
    {"calculate_op": "COUNT", "op": ">", "value": 10}, // COUNT and CONCURRENCY have no column
],
"time_range": 7200, // Relative time range in seconds.
"start_time": 1234567890, // UNIX timestamp
"end_time": 1234567890, // UNIX timestamp
```

Here are some examples of how you would translate a natural language query(NLQ) into a Honeycomb Query:

NLQ: Exception count by exception and caller
Query:
{"breakdowns":["exception.message","parent_name"],"calculations":[{"op":"COUNT"}],"filters":[{"column":"exception.message","op":"exists","join_column":""},{"column":"parent_name","op":"exists","join_column":""}],"orders":[{"op":"COUNT","order":"descending"}],"time_range":7200}

NLQ: Error count
Query:
{"breakdowns":["error"],"calculations":[{"op":"COUNT"}],"filters":[{"column":"error","op":"=","value":true,"join_column":""}],"orders":[{"op":"COUNT","order":"descending"}],"time_range":7200}

NLQ: Error rate
Query:
{"breakdowns":["error"],"calculations":[{"op":"COUNT"}],"orders":[{"op":"COUNT","order":"descending"}],"time_range":7200}

NLQ: Slow requests
Query:
{"breakdowns":["http.route"],"calculations":[{"column":"duration_ms","op":"HEATMAP"},{"column":"duration_ms","op":"MAX"}],"filters":[{"column":"trace.parent_id","op":"does-not-exist","join_column":""}],"orders":[{"column":"duration_ms","op":"MAX","order":"descending"}],"time_range":7200}

---
You are given the following three inputs: (1) NLQ, (2) A list of candidate columns that are allowed to be in the query, and (3) The query. 
Your goal is to generate correct variations of the combination of NLQ, candidate columns and query to build syntetic dataset that is a valid representation of
the Honeycomb Query Language.  You can build synthetic data by re-wording the query and/or substituting a column name in both the query and candidate column lists.
Your response should be in json with the following three keys: "nlq", "cols", and "query".  Furthermore, the modified query should be a similar complexity as the original query, and the list of columns should be unchanged EXCEPT for the renamed column (the length of candidate columns should be the same).

NLQ: solver_svc.get_gmv_previsto_grupos
    
COLUMNS: ['service_name', 'model_name', 'pagarme_operation', 'motorista', 'service.name', 'django.view_func', 'client', 'objects_created', 'db.query', 'squad', 'torre-path', 'lazy', 'model', 'db.rows_affected', 'company', 'trace.parent_id', 'event_id', 'duration_ms', 'db.query_args', 'app_name', 'efox', 'app.exception_stacktrace', 'bulk_task.calls', 'rollup.objects_created', 'db.error', 'meta.type', 'correlation_id', 'efopops-path', 'request.query', 'db.query_short', 'environment', 'meta.beeline_version', 'db.error_detail', 'request.error_detail', 'db.last_insert_id', 'trace.trace_id', 'request.error', 'name', 'request.user_id', 'request.secure', 'request.method', 'db.total_duration', 'request.url', 'request.scheme', 'trace.span_id', 'request.user_agent', 'db.duration', 'db.call_count', 'rollup.celery.calls', 'type', 'request.path', 'response.status_code']

QUERY: {"breakdowns":["name"],"filters":[{"column":"name","op":"=","value":"solver_svc.get_gmv_previsto_grupos"}],"calculations":[{"column":"duration_ms","op":"HEATMAP"}]}

You should make small coherent changes and return the data as a json with three keys: "nlq", "cols", and "query".
```