# Workload Generation Examples


### Python Path

This notebook requires the `mdbrtools` package to be in the Python path.
This code block is only required if you cloned the github repository and did not install `mdbrtools` via pip.

Alternatively, you can install `mdbrtools` as a development dependency with `pip install -e .`


In [2]:
import sys

sys.path.append("..")

## Simple Workload Generation

The Workload Generator takes as input either a `MongoCollection` object (see `./mdbrtools/mongodb.py`) or a list of Python dictionaries.

### Loading JSON Data

In this example, we load a JSON file and parse it into a list of dictionaries.


In [1]:
import json
from pprint import pprint

# load example data
with open("./example_docs.json") as f:
    docs = json.load(f)

# print some example docs
pprint(docs[:3])

[{'age': 30, 'email': 'alice@example.com', 'id': 1, 'name': 'Alice'},
 {'age': 25,
  'email': 'bob@example.com',
  'id': 2,
  'name': 'Bob',
  'phone': '555-1234'},
 {'email': 'charlie@example.com',
  'id': 3,
  'name': 'Charlie',
  'preferences': {'newsletter': True, 'notifications': ['email', 'sms']}}]


### Generating a random workload


In [5]:
from mdbrtools.workload import Workload

# generate a random workload
workload = Workload()
workload.generate(docs, num_queries=10)

# print queries as MQL syntax
print("\nWorkload:")
for query in workload:
    print(query.to_mql())

Parsing schema: 100%|██████████| 10/10 [00:00<00:00, 50533.78it/s]


Generating workload: 100%|██████████| 10/10 [00:00<00:00, 6631.31it/s]


Workload:
{'address.street': '456 Oak St', 'email': {'$exists': True}, 'address.zipcode': {'$ne': '67890'}, 'age': {'$gte': 30}, 'address.city': 'Somewhere', 'name': {'$in': ['Alice', 'Frank', 'Judy', 'Charlie', 'Eve', 'Heidi']}, 'id': {'$lte': 6}}
{'name': 'Ivan', 'hobbies': {'$in': ['hiking', 'photography', 'cycling', 'gaming']}, 'phone': {'$lte': '555-1234'}, 'email': {'$ne': 'ivan@example.com'}, 'age': {'$gt': 40}, 'id': {'$lt': 9}}
{'preferences.newsletter': {'$ne': False}, 'id': {'$gt': 3}, 'age': {'$exists': False}, 'address.city': 'Anywhere', 'hobbies': {'$size': 3}, 'address.street': {'$lte': '123 Elm St'}}
{'address.city': {'$nin': ['Anywhere', 'Elsewhere']}, 'email': 'charlie@example.com', 'hobbies': {'$gte': 'cycling'}, 'preferences.notifications': {'$lte': 'email'}, 'address.zipcode': {'$gte': '12345'}, 'id': {'$lte': 6}, 'preferences.newsletter': True, 'phone': {'$in': ['555-8765']}, 'age': {'$gte': 40}}
{'hobbies': {'$gte': 'photography'}, 'preferences.newsletter': Fals




### Generating a workload with restrictions

In this next example, we force exactly 2 predictes per query, and limit the allowed fields to `address.street` and `preferences.newsletter`.

For even more fine-grained control over the created queries, modify the `operator_config` object. You can

- define which query operators (such as $in, $lt, ...) are allowed for particular data types
- choose the probabilities (`weights`) with which these operators are selected
- enable/disable special operators `$exists`, `$type` and `$size` and their probabilities to be selected (`chance`).

See the `DEFAULT_OPERATOR_CONFIG` in `./mdbrtools/workload.py`.


In [6]:
from mdbrtools.workload import Workload

# generate a random workload
workload = Workload()
workload.generate(
    docs,
    num_queries=10,
    min_predicates=2,
    max_predicates=2,
    allowed_fields=["address.street", "preferences.newsletter"],
)

# print queries as MQL syntax
print("\nWorkload:")
for query in workload:
    print(query.to_mql())

Parsing schema: 100%|██████████| 10/10 [00:00<00:00, 70492.50it/s]
Generating workload: 100%|██████████| 10/10 [00:00<00:00, 38657.18it/s]


Workload:
{'address.street': {'$ne': '456 Oak St'}, 'preferences.newsletter': False}
{'address.street': {'$ne': '123 Elm St'}, 'preferences.newsletter': False}
{'preferences.newsletter': False, 'address.street': {'$ne': '123 Elm St'}}
{'address.street': '123 Elm St', 'preferences.newsletter': False}
{'preferences.newsletter': False, 'address.street': '789 Pine St'}
{'address.street': {'$nin': ['789 Pine St']}, 'preferences.newsletter': True}
{'preferences.newsletter': False, 'address.street': '123 Elm St'}
{'address.street': {'$in': ['123 Elm St', '789 Pine St']}, 'preferences.newsletter': True}
{'address.street': '123 Elm St', 'preferences.newsletter': True}
{'preferences.newsletter': True, 'address.street': '456 Oak St'}





### Enforcing Selectivity of queries

Min. and max. selectivity restrictions can be enforced, but currently this is only supported when connecting to a live MongoDB instance, and not when passing in a list of dictionaries.

In addition, you need to pass an `estimator` object to the `Workload.generate()` method.

Here is some example code to demonstrate this. This requires a MongoDB instance running locally on port `27017` and the above dataset in the `test.example_docs` collection.

You can import the data with the `mongoimport` tool:

```bash
mongoimport notebooks/example_docs.json --jsonArray -d test -c example_docs --drop
```


In [15]:
from mdbrtools.mongodb import MongoCollection
from mdbrtools.estimator import SampleEstimator

# create MongoCollection wrapper
collection = MongoCollection("mongodb://localhost:27017", "test", "example_docs")

# create Estimator
# here we use a sample_ratio of 1.0 because of the small dataset size
# for larger datasets, lower this number for faster estimates
estimator = SampleEstimator(collection, sample_ratio=1.0)

workload = Workload()
workload.generate(
    collection,
    estimator=estimator,
    min_selectivity=0.1,  # match at least 10% of docs (1)
    max_selectivity=0.3,  # match at most 30% of docs (3)
    max_predicates=1,  # only contain single predicate
    num_queries=10,
)

print("\nWorkload:")
for query in workload:
    print(query.to_mql(), end=" -> ")

    # we also print the actual number of documents matched
    matched_docs = collection.collection.count_documents(query.to_mql())
    print(f"query matches {matched_docs} docs")

Parsing schema: 10it [00:00, 7396.06it/s]


10 documents in collection. limit is 0


Generating workload: 100%|██████████| 10/10 [00:00<00:00, 328.20it/s]


Workload:
{'hobbies': 'photography'} -> query matches 1 docs
{'phone': {'$exists': True}} -> query matches 3 docs
{'id': {'$gte': 9}} -> query matches 2 docs
{'age': {'$lt': 28}} -> query matches 2 docs
{'address.zipcode': {'$lte': '67890'}} -> query matches 2 docs
{'hobbies': {'$lte': 'cycling'}} -> query matches 1 docs
{'email': 'charlie@example.com'} -> query matches 1 docs
{'preferences.notifications': {'$lte': 'sms'}} -> query matches 2 docs
{'id': {'$gt': 8}} -> query matches 2 docs
{'name': 'Bob'} -> query matches 1 docs



