# Schema Parsing Examples


### Python Path

This notebook requires the `mdbrtools` package to be in the Python path.
This code block is only required if you cloned the github repository and did not install `mdbrtools` via pip.

Alternatively, you can install `mdbrtools` as a development dependency with `pip install -e .`


In [2]:
import sys

sys.path.append("..")

## Simple Schema Parsing Example

The code block below loads a a few example documents from the JSON file and parses their schema.

The `schema` object contains comprehensive information about the fields, types and values.

Note that `dict(schema)` returns a short summary of the schema as Python dictionary.


In [14]:
from mdbrtools.schema import parse_schema
import json
from pprint import pprint

# load example data
with open("./example_docs.json") as f:
    docs = json.load(f)

schema = parse_schema(docs)
pprint(dict(schema))

Parsing schema: 100%|██████████| 10/10 [00:00<00:00, 45294.86it/s]

{'address': [{'counter': 3, 'type': 'document'}],
 'address.city': [{'counter': 3, 'type': 'str'}],
 'address.street': [{'counter': 3, 'type': 'str'}],
 'address.zipcode': [{'counter': 2, 'type': 'str'}],
 'age': [{'counter': 7, 'type': 'int'}],
 'email': [{'counter': 8, 'type': 'str'}],
 'hobbies': [{'counter': 3, 'type': 'array'}],
 'hobbies.[]': [{'counter': 5, 'type': 'str'}],
 'id': [{'counter': 10, 'type': 'int'}],
 'name': [{'counter': 10, 'type': 'str'}],
 'phone': [{'counter': 3, 'type': 'str'}],
 'preferences': [{'counter': 3, 'type': 'document'}],
 'preferences.newsletter': [{'counter': 3, 'type': 'bool'}],
 'preferences.notifications': [{'counter': 2, 'type': 'array'}],
 'preferences.notifications.[]': [{'counter': 3, 'type': 'str'}]}





## Navigating the `schema` object

At the high level, a `Schema` is a nested tree structure which contains `Field` objects, and `Field` objects contain `Type` objects.

Subclasses of `Type` are

- `Array`
- `Document`
- `PrimitiveType`

`PrimitiveType` represents leaves in this tree representing data types such as int, str, float, etc.


In [29]:
from mdbrtools.schema import parse_schema

docs = [
    {"things": "string_thing"},
    {"things": 123},
    {"things": False},
]

schema = parse_schema(docs)

# access types of `things`
things_types = schema["things"].types

print("\nTypes of things:")
for type_name, type in things_types.items():
    print(f" - found {type.count} of type {type_name}: {type.values}")

Parsing schema: 100%|██████████| 3/3 [00:00<00:00, 66576.25it/s]


Types of things:
 - found 1 of type str: ['string_thing']
 - found 1 of type int: [123]
 - found 1 of type bool: [False]





## Accessing Primitive Values

To retrieve a set of primitive values for a particular field path, you could navigate to the primitive value object
as shown above. But there is a convenience helper method: `schema.get_prim_values()`.

To align with MongoDB's query language semantics, `get_prim_values()` automatically _dives_ one level into an array structure. This can be disabled by setting `dive_into_arrays=False` as additional argument.


In [32]:
from bson import ObjectId
from mdbrtools.schema import parse_schema
from pprint import pprint

docs = [
    {
        "_id": ObjectId("6657c93f0c261ad8866ed948"),
        "a0": [{"a1": [{"a2": [{"number": 1}, {"number": 2}]}]}],
    }
]

schema = parse_schema(docs)

# print basic schema
print("Schema:")
pprint(dict(schema))

# get values for a field path
# by default, get_prim_values() automatically dives into one level of array nesting at each level
print("\nValues for inner number:")
print(schema.get_prim_values("a0.a1.a2.number"))

# same as...
print("\nValues for inner number (dive_into_arrays=False):")
print(schema.get_prim_values("a0.[].a1.[].a2.[].number", dive_into_arrays=False))

Parsing schema: 100%|██████████| 1/1 [00:00<00:00, 12520.31it/s]

Schema:
{'_id': [{'counter': 1, 'type': 'ObjectId'}],
 'a0': [{'counter': 1, 'type': 'array'}],
 'a0.[]': [{'counter': 1, 'type': 'document'}],
 'a0.[].a1': [{'counter': 1, 'type': 'array'}],
 'a0.[].a1.[]': [{'counter': 1, 'type': 'document'}],
 'a0.[].a1.[].a2': [{'counter': 1, 'type': 'array'}],
 'a0.[].a1.[].a2.[]': [{'counter': 2, 'type': 'document'}],
 'a0.[].a1.[].a2.[].number': [{'counter': 2, 'type': 'int'}]}

Values for inner number:
{1, 2}

Values for inner number (dive_into_arrays=False):
{1, 2}

Values for inner number (navigating directly):
[1, 2]



