# Generate data with different simulation profiles

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->

This notebook walks through how to specify the simulations to run in order to generate different data, whether using ones supplied in the repository or sending your own as part of the post to the data generator.

## Prerequisites

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

Run the next cell to set up the connection to the data generator.

In [None]:
import requests
import json

datagenUrl = "http://datagen:9999"
datagenHeaders = {'Content-Type': 'application/json'}

## Generate data using default configurations

The Data Generator repository contains a number of preconfigured configurations that the community is invited to contribute to.

Use the `/list` endpoint see the available configurations.

In [None]:
requests.get(f"{datagenUrl}/list").json()

Run the following cell to set up a JSON object with a simulation configuration that uses one of the configurations above.

In [None]:
job_name="20240707-iot"

datagen_request = {
    "name": job_name,
    "target": { "type": "file", "path":f"{job_name}.json"},
    "config_file": "iot/iot_twin.json", 
    "time_type": "1969-07-20 20:17:40",
    "time": "10s",
    "concurrency":10
}

Notice that the `config_file` property has been set to the configuration you will use.

Submit the configuration to the data generator to create a file using this configuration by running the following cell.

In [None]:
response = requests.post(f"{datagenUrl}/start", json.dumps(datagen_request), headers=datagenHeaders)

Run the following cell to peek at the generated data.

The file is retrieved in raw form using the `file` endpoint of the API. The _print_ function decodes the binary data to before display.

In [None]:
rawdata = requests.get(f"{datagenUrl}/file/{job_name}.json").content
print(rawdata.decode('utf-8'))

## Generate data using customer configurations

Instead of using `config_file`, use `config` to send a simulation profile as part of the call to the data generator.

Run the following cell to create a JSON object to send to the simulator.

In [None]:
gen_config = {
  "emitters": [
    {
      "name": "simple_record",
      "dimensions": [
        {
          "type": "string",
          "name": "random_string_column",
          "length_distribution": {
            "type": "constant",
            "value": 13
          },
          "cardinality": 0,
          "chars": "#.abcdefghijklmnopqrstuvwxyz"
        },
        {
          "type": "int",
          "name": "distributed_number",
          "distribution": {
            "type": "uniform",
            "min": 0,
            "max": 1000
          },
          "cardinality": 10,
          "cardinality_distribution": {
            "type": "exponential",
            "mean": 5
          }
        }
      ]
    }
  ],
  "interarrival": {
    "type": "constant",
    "value": 1
  },
  "states": [
    {
      "name": "state_1",
      "emitter": "simple_record",
      "delay": {
        "type": "constant",
        "value": 1
      },
      "transitions": [
        {
          "next": "state_1",
          "probability": 1.0
        }
      ]
    }
  ]
}

The configuration includes:

* The definition of a "simple record" that includes two [dimensions](https://github.com/implydata/druid-datagenerator#dimensions-):
   * A column called "random_string_column" that generates random text.
   * A column called "distributed_number" that contains numbers with a [uniform distribution](https://github.com/implydata/druid-datagenerator#uniform-distribution).
* Definition of the [number of rows per second](https://github.com/implydata/druid-datagenerator#interarrival) within the `interarrival` section.
* A [state](https://github.com/implydata/druid-datagenerator#states-) definition for each of the simulator state machines - in this case, just one.

For more information, see the [documentation](https://github.com/implydata/druid-datagenerator#data-generator-configuration) in the main repository.

Run the next cell to set the job name, and then create an object containing the sample configuration above. The full request will then be printed out, giving you the opportunity to see the entire request that will posted to the API.

In [None]:
job_name="20240707_myownthing"

datagen_request = {
    "name": job_name,
    "target": { "type": "file", "path":f"{job_name}.json"},
    "config": gen_config, 
    "time": "10s",
    "time_type": "SIM",
    "concurrency":10
}

print(json.dumps(datagen_request, indent=4))

Now run the cell below to post this full request to the data generator.

In [None]:
response = requests.post(f"{datagenUrl}/start", json.dumps(datagen_request), headers=datagenHeaders)
response.json()

Run the cell below to see the data that was generated.

In [None]:
rawdata = requests.get(f"{datagenUrl}/file/{job_name}.json").content
print (rawdata.decode('utf-8'))

## Summary

* The `config_file` property is used when running simulations using default configurations.
* Use `config` when sending a custom generator configuration.
* Simulation profiles can include custom columns, states, and different time gaps between rows.

## Learn more

* Read more about the simulator profiles in the [official repository](https://github.com/implydata/druid-datagenerator).