# Use the data generator to create a file of sample data

<!--
  ~ Licensed to the Apache Software Foundation (ASF) under one
  ~ or more contributor license agreements.  See the NOTICE file
  ~ distributed with this work for additional information
  ~ regarding copyright ownership.  The ASF licenses this file
  ~ to you under the Apache License, Version 2.0 (the
  ~ "License"); you may not use this file except in compliance
  ~ with the License.  You may obtain a copy of the License at
  ~
  ~   http://www.apache.org/licenses/LICENSE-2.0
  ~
  ~ Unless required by applicable law or agreed to in writing,
  ~ software distributed under the License is distributed on an
  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
  ~ KIND, either express or implied.  See the License for the
  ~ specific language governing permissions and limitations
  ~ under the License.
  -->
This notebook walks through creating files containing sample data from the [Imply Data Generator](https://github.com/implydata/druid-datagenerator). Once generated, you can download the file, or access the data directly over HTTP from Druid, allowing you to ingest generated data directly during batch ingestion.

## Prerequisites

Launch this tutorial and all prerequisites using the `all-services` profile of the Docker Compose file for Jupyter-based Druid tutorials. For more information, see the Learn Druid repository [readme](https://github.com/implydata/learn-druid).

## Initialization

Run the next cell to set up the connection to the data generator.

In [1]:
import requests
import json

datagenUrl = "http://datagen:9999"
datagenHeaders = {'Content-Type': 'application/json'}

## Generate simulated data

Run the following cell to create a JSON object that you will pass to the `start` endpoint in order to start the data generation process.

In [2]:
job_name="gen_clickstream1"

datagen_request = {
    "name": job_name,
    "target": { "type": "file", "path":"clicks.json"},
    "config_file": "clickstream/clickstream.json", 
    "time_type": "1969-07-20 20:17:40",
    "time": "15s",
    "concurrency":5
}

The JSON object contains:

* `name`: a job identifier.
* `target`: the type and filename for the output file.
* `config_file`: the type of simulation to run.
* `time_type`: the start timestamp for the data.
* `time`: period of time the data covers.
* `concurrency` the number state machines running the simulation.

Run the following cell to start a simulation by using the `start` endpoint using the `datagen_request` configuration.

In [3]:
response = requests.post(f"{datagenUrl}/start", json.dumps(datagen_request), headers=datagenHeaders)
response.json()

{'message': 'Starting generator for request.',
 'request': {'name': 'gen_clickstream1',
  'target': {'type': 'file', 'path': '/files/clicks.json'},
  'config_file': 'clickstream/clickstream.json',
  'time_type': '1969-07-20 20:17:40',
  'time': '15s',
  'concurrency': 5}}

Use the status endpoint to retrieve the status of the simulation:

In [4]:
display(requests.get(f"{datagenUrl}/status/{job_name}").json())

{'name': 'gen_clickstream1',
 'config_file': 'clickstream/clickstream.json',
 'target': {'type': 'file', 'path': '/files/clicks.json'},
 'active_sessions': 0,
 'total_records': 11,
 'start_time': '1969-07-20 20:17:40',
 'run_time': 54.948487,
 'status': 'COMPLETE',
 'status_msg': 'Running, Sim Clock: 1969-07-20 20:18:34.949487'}

Run the cell above until the status shows as `COMPLETE`.

## Retrieve the data

Use the `/files` API endpoint to list files available on the server.

In [9]:
display(requests.get(f"{datagenUrl}/files", '').json())

['clicks.json']

Run the following cell to use the `file` endpoint to get the file above.

In [6]:
response = requests.get(datagenUrl + "/file/clicks.json")
response.raise_for_status()
data = response.content
print(data)

b'{"time":"1969-07-20T20:17:40.001","user_id":"2636","event_type":"login","client_ip":"127.229.143.133","client_device":"mobile","client_lang":"French","client_country":"Mexico","referrer":"facebook.com/referring-group","keyword":"None","product":"None"}\n{"time":"1969-07-20T20:17:40.019","user_id":"3882","event_type":"login","client_ip":"127.101.175.214","client_device":"mobile","client_lang":"Hindi","client_country":"Vietnam","referrer":"adserve.com","keyword":"None","product":"None"}\n{"time":"1969-07-20T20:17:40.172","user_id":"1597","event_type":"login","client_ip":"127.120.191.207","client_device":"tablet","client_lang":"French","client_country":"Japan","referrer":"facebook.com/referring-group","keyword":"None","product":"None"}\n{"time":"1969-07-20T20:17:40.235","user_id":"1536","event_type":"login","client_ip":"127.135.127.77","client_device":"mobile","client_lang":"English","client_country":"Vietnam","referrer":"unknown","keyword":"None","product":"None"}\n{"time":"1969-07-20T

## Summary

* The `start` endpoint begins a simulation and outputs a file based on a JSON configuration.
* The `files` endpoint lists all available files.
* Use the `file` endpoint to get generated files.

## Learn more

* Read more about the data generator in the [official repository](https://github.com/implydata/druid-datagenerator).
* Try to use the [HTTP input source](https://druid.apache.org/docs/latest/ingestion/input-sources#http-input-source) in EXTERN to access the generated data directly from Apache Druid.