From 0f695d9338c9747c7eb5deaa0cea65796e2ce86f Mon Sep 17 00:00:00 2001 From: Ryan Date: Mon, 1 Jul 2024 11:15:29 -0400 Subject: [PATCH 1/5] Docs: update homepage with new README --- README.md | 2 +- docs/index.md | 99 ++++++++++++++++++++++++++++++++++++++++++--------- 2 files changed, 84 insertions(+), 17 deletions(-) diff --git a/README.md b/README.md index 0d6a691dd..4b61f7288 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # ![Maggma](docs/logo_w_text.svg) -[![Static Badge](https://img.shields.io/badge/documentation-blue?logo=github)](https://materialsproject.github.io/maggma) [![testing](https://github.com/materialsproject/maggma/workflows/testing/badge.svg)](https://github.com/materialsproject/maggma/actions?query=workflow%3Atesting) [![codecov](https://codecov.io/gh/materialsproject/maggma/branch/main/graph/badge.svg)](https://codecov.io/gh/materialsproject/maggma) [![python](https://img.shields.io/badge/Python-3.8+-blue.svg?logo=python&logoColor=white)]() +[![Static Badge](https://img.shields.io/badge/documentation-blue?logo=github)](https://materialsproject.github.io/maggma) [![testing](https://github.com/materialsproject/maggma/workflows/testing/badge.svg)](https://github.com/materialsproject/maggma/actions?query=workflow%3Atesting) [![codecov](https://codecov.io/gh/materialsproject/maggma/branch/main/graph/badge.svg)](https://codecov.io/gh/materialsproject/maggma) [![python](https://img.shields.io/badge/Python-3.9+-blue.svg?logo=python&logoColor=white)]() ## What is Maggma diff --git a/docs/index.md b/docs/index.md index e26f48462..c259378b9 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,38 +1,105 @@ -# Maggma + +# ![Maggma](logo_w_text.svg) [![Static Badge](https://img.shields.io/badge/documentation-blue?logo=github)](https://materialsproject.github.io/maggma) [![testing](https://github.com/materialsproject/maggma/workflows/testing/badge.svg)](https://github.com/materialsproject/maggma/actions?query=workflow%3Atesting) [![codecov](https://codecov.io/gh/materialsproject/maggma/branch/main/graph/badge.svg)](https://codecov.io/gh/materialsproject/maggma) [![python](https://img.shields.io/badge/Python-3.9+-blue.svg?logo=python&logoColor=white)]() ## What is Maggma -Maggma is a framework to build data pipelines from files on disk all the way to a REST API in scientific environments. Maggma has been developed by the Materials Project (MP) team at Lawrence Berkeley National Laboratory. +Maggma is a framework to build scientific data processing pipelines from data stored in +a variety of formats -- databases, Azure Blobs, files on disk, etc., all the way to a +REST API. The rest of this README contains a brief, high-level overview of what `maggma` can do. +For more, please refer to [the documentation](https://materialsproject.github.io/maggma). -Maggma is written in [Python](http://docs.python-guide.org/en/latest/) and supports Python 3.9+. -## Installation from PyPI +## Installation + +### From PyPI Maggma is published on the [Python Package Index](https://pypi.org/project/maggma/). The preferred tool for installing packages from *PyPi* is **pip**. This tool is provided with all modern versions of Python. -Open your terminal and run the following command. +Open your terminal and run the following command: ``` shell pip install --upgrade maggma ``` +### Direct from `git` + +If you want to install the latest development version, but do not plan to +make any changes to it, you can install as follows: + +``` shell +pip install git+https://github.com/materialsproject/maggma +``` -## Installation from source +### Local Clone You can install Maggma directly from a clone of the [Git repository](https://github.com/materialsproject/maggma). This can be done either by cloning the repo and installing from the local clone, or simply installing directly via **git**. -=== "Local Clone" +``` shell +git clone https://github.com//materialsproject/maggma +cd maggma +python setup.py install +``` + +## Basic Concepts + +`maggma`'s core classes -- [`Store`](#store) and [`Builder`](#builder) -- provide building blocks for +modular data pipelines. Data resides in one or more `Store` and is processed by a +`Builder`. The results of the processing are saved in another `Store`, and so on: + +```mermaid +flowchart LR +    s1(Store 1) --Builder 1--> s2(Store 2) --Builder 2--> s3(Store 3) +s2 -- Builder 3-->s4(Store 4) +``` + +### Store + +A major challenge in building scalable data pipelines is dealing with all the different types of data sources out there. Maggma's `Store` class provides a consistent, unified interface for querying data from arbitrary data sources. It was originally built around MongoDB, so it's interface closely resembles `PyMongo` syntax. However, Maggma makes it possible to use that same syntax to query other types of databases, such as Amazon S3, GridFS, or files on disk, [and many others](https://materialsproject.github.io/maggma/getting_started/stores/#list-of-stores). Stores implement methods to `connect`, `query`, find `distinct` values, `groupby` fields, `update` documents, and `remove` documents. + +The example below demonstrates inserting 4 documents (python `dicts`) into a `MongoStore` with `update`, then +accessing the data using `count`, `query`, and `distinct`. + +```python +>>> turtles = [{"name": "Leonardo", "color": "blue", "tool": "sword"}, + {"name": "Donatello","color": "purple", "tool": "staff"}, + {"name": "Michelangelo", "color": "orange", "tool": "nunchuks"}, + {"name":"Raphael", "color": "red", "tool": "sai"} + ] +>>> store = MongoStore(database="my_db_name", + collection_name="my_collection_name", + username="my_username", + password="my_password", + host="my_hostname", + port=27017, + key="name", + ) +>>> with store: + store.update(turtles) +>>> store.count() +4 +>>> store.query_one({}) +{'_id': ObjectId('66746d29a78e8431daa3463a'), 'name': 'Leonardo', 'color': 'blue', 'tool': 'sword'} +>>> store.distinct('color') +['purple', 'orange', 'blue', 'red'] +``` - ``` shell - git clone https://github.com//materialsproject/maggma - cd maggma - python setup.py install - ``` +### Builder -=== "Direct Git" - ``` shell - pip install git+https://github.com/materialsproject/maggma - ``` +Builders represent a data processing step, analogous to an extract-transform-load (ETL) operation in a data +warehouse model. Much like `Store` provides a consistent interface for accessing data, the `Builder` classes +provide a consistent interface for transforming it. `Builder` transformation are each broken into 3 phases: `get_items`, `process_item`, and `update_targets`: + +1. `get_items`: Retrieve items from the source Store(s) for processing by the next phase +2. `process_item`: Manipulate the input item and create an output document that is sent to the next phase for storage. +3. `update_target`: Add the processed item to the target Store(s). + +Both `get_items` and `update_targets` can perform IO (input/output) to the data stores. `process_item` is expected to not perform any IO so that it can be parallelized by Maggma. Builders can be chained together into an array and then saved as a JSON file to be run on a production system. + +## Origin and Maintainers + +Maggma has been developed and is maintained by the [Materials Project](https://materialsproject.org/) team at Lawrence Berkeley National Laboratory and the [Materials Project Software Foundation](https://github.com/materialsproject/foundation). + +Maggma is written in [Python](http://docs.python-guide.org/en/latest/) and supports Python 3.9+. From f082806f4483d6bd35855539ca4af83973e7c38a Mon Sep 17 00:00:00 2001 From: Ryan Date: Mon, 1 Jul 2024 16:45:28 -0400 Subject: [PATCH 2/5] docs: update list of stores and store section --- docs/getting_started/stores.md | 132 +++++++++++++++++++++++++-------- docs/reference/stores.md | 2 + 2 files changed, 104 insertions(+), 30 deletions(-) diff --git a/docs/getting_started/stores.md b/docs/getting_started/stores.md index 308294fdd..48f23a05b 100644 --- a/docs/getting_started/stores.md +++ b/docs/getting_started/stores.md @@ -1,55 +1,127 @@ # Using `Store` -A `Store` is just a wrapper to access data from a data source. That data source is typically a MongoDB collection, but it could also be an Amazon S3 bucket, a GridFS collection, or folder of files on disk. `maggma` makes interacting with all of these data sources feel the same (see the `Store` interface, below). `Store` can also perform logic, concatenating two or more `Store` together to make them look like one data source for instance. +A `Store` is just a wrapper to access data from a data source. That data source is typically a MongoDB collection, but it could also be an Amazon S3 bucket, a GridFS collection, or folder of files on disk. `maggma` makes interacting with all of these data sources feel the same (see the [`Store` interface](#the-store-interface), below). `Store` can also perform logic, concatenating two or more `Store` together to make them look like one data source for instance. The benefit of the `Store` interface is that you only have to write a `Builder` once. As your data moves or evolves, you simply point it to different `Store` without having to change your processing code. -## List of Stores +## Structuring `Store` data -Current working and tested `Store` include: - -- `MongoStore`: interfaces to a MongoDB Collection using port and hostname. -- `MongoURIStore`: interfaces to a MongoDB Collection using a "mongodb+srv://" URI. -- `MemoryStore`: just a Store that exists temporarily in memory -- `JSONStore`: builds a MemoryStore and then populates it with the contents of the given JSON files -- `FileStore`: query and add metadata to files stored on disk as if they were in a database -- `GridFSStore`: interfaces to GridFS collection in MongoDB using port and hostname. -- `GridFSURIStore`: interfaces to GridFS collection in MongoDB using a "mongodb+srv://" URI. -- `S3Store`: provides an interface to an S3 Bucket either on AWS or self-hosted solutions ([additional documentation](advanced_stores.md)) -- `ConcatStore`: concatenates several Stores together so they look like one Store -- `VaultStore`: uses Vault to get credentials for a MongoDB database -- `AliasingStore`: aliases keys from the underlying store to new names -- `SandboxStore: provides permission control to documents via a `_sbxn` sandbox key -- `JointStore`: joins several MongoDB collections together, merging documents with the same `key`, so they look like one collection -- `AzureBlobStore`: provides an interface to Azure Blobs for the storage of large amount of data -- `MontyStore`: provides an interface to [montydb](https://github.com/davidlatwe/montydb) for in-memory or filesystem-based storage -- `MongograntStore`: (DEPRECATED) uses Mongogrant to get credentials for MongoDB database +Because `Store` is built around a MongoDB-like query syntax, data that goes into `Store` needs to be structured similarly to MongoDB data. In python terms, +that means **the data in a `Store` must be structured as a `list` of `dict`**, +where each `dict` represents a single record (called a 'document'). + +```python +data = [{"AM": "sunrise"}, {"PM": "sunset"} ... ] +``` + +Note that this structure is very similar to the widely-used [JSON](https://en.wikipedia.org/wiki/JSON) format. So structuring your data in this manner +enables highly flexible storage options -- you can easily write it to a `.json` +file, place it in a `Store`, insert it into a Mongo database, etc. `maggma` is +designed to facilitate this. + +In addition to being structured as a `list` of `dict`, **every document (`dict`) +must have a key that uniquely identifies it.** By default, this key is the `task_id`, but it can be set to any value you +like using the `key` argument when you instantiate a `Store`. + +```python +data = [{"task_id": 1, "AM": "sunrise"}, {"task_id: 2, "PM": "sunset"} ... ] +``` ## The `Store` interface All `Store` provide a number of basic methods that facilitate querying, updating, and removing data: -- `query`: Standard mongo style `find` method that lets you search the store. +- `query`: Standard mongo style `find` method that lets you search the store. See [Understanding Queries](query_101.md) for more details about the query syntax. - `query_one`: Same as above but limits returned results to just the first document that matches your query. Very useful for understanding the structure of the returned data. -- `update`: Update the documents into the collection. This will override documents if the key field matches. -- `ensure_index`: This creates an index for the underlying data-source for fast querying. -- `distinct`: Gets distinct values of a field. +- `count`: Counts documents in the `Store` +- `distinct`: Returns a list of distinct values of a field. - `groupby`: Similar to query but performs a grouping operation and returns sets of documents. +- `update`: Update (insert) documents into the `Store`. This will overwrite documents if the key field matches. - `remove_docs`: Removes documents from the underlying data source. -- `last_updated`: Finds the most recently updated `last_updated_field` value and returns that. Useful for knowing how old a data-source is. - `newer_in`: Finds all documents that are newer in the target collection and returns their `key`s. This is a very useful way of performing incremental processing. +- `ensure_index`: Creates an index for the underlying data-source for fast querying. +- `last_updated`: Finds the most recently updated `last_updated_field` value and returns that. Useful for knowing how old a data-source is. + +!!! Note + If you are familiar with `pymongo`, you may find the comparison table below + helpful. This table illustrates how `maggma` method and argument names map + onto `pymongo` concepts. + + + | `maggma` | `pymongo` equivalent | + | -------- | ------- | + | **methods** | + | `query_one` | `find_one` | + | `query` | `find` | + | `count` | `count_documents` | + | `distinct` | `distinct` | + | `groupby` | `group` | + | `update` | `insert` | + | **arguments** | + | `criteria={}` | `filter={}` | + | `properties=[]` | `projection=[]` | -### Initializing a Store -All `Store`s have a few basic arguments that are critical for basic usage. Every `Store` has two attributes that the user should customize based on the data contained in that store: `key` and `last_updated_field`. The `key` defines how the `Store` tells documents apart. Typically this is `_id` in MongoDB, but you could use your own field (be sure all values under the key field can be used to uniquely identify documents). `last_updated_field` tells `Store` how to order the documents by a date, which is typically in the `datetime` format, but can also be an ISO 8601-format (ex: `2009-05-28T16:15:00`) `Store`s can also take a `Validator` object to make sure the data going into it obeys some schema. +## Creating a Store -### Using a Store +All `Store`s have a few basic arguments that are critical for basic usage. Every `Store` has two attributes that the user should customize based on the data contained in that store: `key` and `last_updated_field`. + +The `key` defines how the `Store` tells documents apart. Typically this is `_id` in MongoDB, but you could use your own field (be sure all values under the key field can be used to uniquely identify documents). + +`last_updated_field` tells `Store` how to order the documents by a date, which is typically in the `datetime` format, but can also be an ISO 8601-format (ex: `2009-05-28T16:15:00`) `Store`s can also take a `Validator` object to make sure the data going into it obeys some schema. + +In the example below, we create a `MongoStore`, which connects to a MongoDB database. +To create this store, we have to provide `maggma` the connection details to the +database like the hostname, collection name, and authentication info. Note that +we've set `key='name'` because we want to use that `name` as our unique identifier. + +```python +>>> store = MongoStore(database="my_db_name", + collection_name="my_collection_name", + username="my_username", + password="my_password", + host="my_hostname", + port=27017, + key="name", + ) +``` + +The specific arguments required to create a `Store` depend on the underlying +format. For example, the `MemoryStore`, which just loads data into memory, +requires no arguments to instantiate. Refer to the [list of Stores](#list-of-stores) +below (and their associated documentation) for specific details. + +## Connecting to a `Store` You must connect to a store by running `store.connect()` before querying or updating the store. -If you are operating on the stores inside of another code it is recommended to use the built-in context manager, -which will take care of the `connect()` automatically, e.g.: +If you are operating on the stores inside of another code it is recommended to use the built-in context manager, e.g.: ```python with MongoStore(...) as store: store.query() ``` + +This will take care of the `connect()` automatically while ensuring that the +connection is closed properly after the store tasks are complete. + +## List of Stores + +Current working and tested `Store` include the following. Click the name of +each store for more detailed documentation. + +- [`MongoStore`](/maggma/reference/stores/#maggma.stores.mongolike.MongoStore): interfaces to a MongoDB Collection using port and hostname. +- [`MongoURIStore`](/maggma/reference/stores/#maggma.stores.mongolike.MongoURIStore): interfaces to a MongoDB Collection using a "mongodb+srv://" URI. +- [`MemoryStore`](/maggma/reference/stores/#maggma.stores.mongolike.MemoryStore): just a Store that exists temporarily in memory +- [`JSONStore`](/maggma/reference/stores/#maggma.stores.mongolike.JSONStore): builds a MemoryStore and then populates it with the contents of the given JSON files +- [`FileStore`](/maggma/reference/stores/#maggma.stores.file_store.FileStore): query and add metadata to files stored on disk as if they were in a database +- [`GridFSStore`](/maggma/reference/stores/#maggma.stores.gridfs.GridFSStore): interfaces to GridFS collection in MongoDB using port and hostname. +- [`GridFSURIStore`](/maggma/reference/stores/#maggma.stores.gridfs.GridFSURIStore): interfaces to GridFS collection in MongoDB using a "mongodb+srv://" URI. +- [`S3Store`](/maggma/reference/stores/#maggma.stores.aws.S3Store): provides an interface to an S3 Bucket either on AWS or self-hosted solutions ([additional documentation](advanced_stores.md)) +- [`ConcatStore`](/maggma/reference/stores/#maggma.stores.compound_stores.ConcatStore): concatenates several Stores together so they look like one Store +- [`VaultStore`](/maggma/reference/stores/#maggma.stores.advanced_stores.VaultStore): uses Vault to get credentials for a MongoDB database +- [`AliasingStore`](/maggma/reference/stores/#maggma.stores.advanced_stores.AliasingStore): aliases keys from the underlying store to new names +- `SandboxStore: provides permission control to documents via a `_sbxn` sandbox key +- [`JointStore`](/maggma/reference/stores/#maggma.stores.compound_stores.JointStore): joins several MongoDB collections together, merging documents with the same `key`, so they look like one collection +- [`AzureBlobStore`](/maggma/reference/stores/#maggma.stores.azure.AzureBlobStore): provides an interface to Azure Blobs for the storage of large amount of data +- [`MontyStore`](/maggma/reference/stores/#maggma.stores.mongolike.MontyStore): provides an interface to [montydb](https://github.com/davidlatwe/montydb) for in-memory or filesystem-based storage +- [`MongograntStore`](/maggma/reference/stores/#maggma.stores.advanced_stores.MongograntStore): (DEPRECATED) uses Mongogrant to get credentials for MongoDB database diff --git a/docs/reference/stores.md b/docs/reference/stores.md index 4aff7b958..7a75241da 100644 --- a/docs/reference/stores.md +++ b/docs/reference/stores.md @@ -22,6 +22,8 @@ ::: maggma.stores.aws +::: maggma.stores.azure + ::: maggma.stores.advanced_stores ::: maggma.stores.compound_stores From 036ff63c0e0249e6be8c9a490918bfdb02d26a5a Mon Sep 17 00:00:00 2001 From: Ryan Date: Mon, 1 Jul 2024 16:47:49 -0400 Subject: [PATCH 3/5] docs: add quickstart; rename getting started --- docs/quickstart.md | 109 +++++++++++++++++++++++++++++++++++++++++++++ mkdocs.yml | 4 +- 2 files changed, 112 insertions(+), 1 deletion(-) create mode 100644 docs/quickstart.md diff --git a/docs/quickstart.md b/docs/quickstart.md new file mode 100644 index 000000000..0e3314644 --- /dev/null +++ b/docs/quickstart.md @@ -0,0 +1,109 @@ +# 5-minute `maggma` quickstart + +## Install + +Open your terminal and run the following command. + +``` shell +pip install --upgrade maggma +``` + +## Format your data + +Structure your data as a `list` of `dict` objects, +where each `dict` represents a single record (called a 'document'). Below, +we've created some data to represent info about the Teenage Mutant Ninja Turtles. + +```python +>>> turtles = [{"name": "Leonardo", "color": "blue", "tool": "sword"}, + {"name": "Donatello","color": "purple", "tool": "staff"}, + {"name": "Michelangelo", "color": "orange", "tool": "nunchuks"}, + {"name":"Raphael", "color": "red", "tool": "sai"} + ] +``` + +Structuring your data in this manner +enables highly flexible storage options -- you can easily write it to a `.json` +file, place it in a `Store`, insert it into a Mongo database, etc. `maggma` is +designed to facilitate this. + +In addition to being structured as a `list` of `dict`, **every document (`dict`) +must have a key that uniquely identifies it.** By default, this key is the `task_id`, but it can be set to any value you +like using the `key` argument when you instantiate a `Store`. In the example above, +`name` can serve as a key because all documents have it, and the values are all unique. + + +See [Using Stores](getting_started/stores.md/#structuring-store-data) for more details on structuring data. + +## Create a `Store` + +`maggma` contains `Store` classes that connect to MongoDB, Azure, S3 buckets, +`.json` files, system memory, and many more data sources. Regardless of the +underlying storage platform, all `Store` classes implement the same interface +for connecting and querying. + +The simplest store to use is the `MemoryStore`. It simply loads your data into +memory and makes it accessible via `Store` methods like `query`, `distinct`, etc. Note that for this particular store, your data is not saved anywhere - +once you close it, the data are lost from RAM! Note that in this example, +we've set `key='name'` when creating the `Store` because we want to use `name` as our unique identifier. + +```python +>>> from maggma.stores import MemoryStore +>>> store = MemoryStore(key="name") +``` + +See [Using Stores](getting_started/stores.md/#list-of-stores) for more details on available `Store` classes. + +## Connect to the `Store` + +Before you can interact with a store, you have to `connect()`. This is as simple +as + +```python +store.connect() +``` + +When you are finished, you can close the connection with `store.close()`. + +A cleaner (and recommended) way to make sure connections are appropriately closed +is to access `Store` through a context manager (a `with` statement), like this: + +```python +with store as s: + s.query() +``` + +## Add your data to the `Store` + +To add data to the store, use `update()`. + +```python +with store as s: + s.update(turtles) +``` + +## Query the `Store` + +Now that you have added your data to a `Store`, you can leverage `maggma`'s +powerful API to query and analyze it. Here are some examples: + +See how many documents the `Store` contains +```python +>>> store.count() +4 +``` + +Query a single document to see its structure +```python +>>> store.query_one({}) +{'_id': ObjectId('66746d29a78e8431daa3463a'), 'name': 'Leonardo', 'color': 'blue', 'tool': 'sword'} +``` + +List all the unique values of the `color` field +```python +>>> store.distinct('color') +['purple', 'orange', 'blue', 'red'] +``` + +See [Understanding Queries](getting_started/query_101.md) for more example queries and [the `Store` interface](getting_started/stores.md/#the-store-interface) for more details about available `Store` +methods. diff --git a/mkdocs.yml b/mkdocs.yml index 15bf7f5f8..aad3845ab 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -8,7 +8,9 @@ theme: nav: - Home: index.md - Core Concepts: concepts.md - - Getting Started: + - Quickstart: quickstart.md + - User Guide: + - Understanding Queries: getting_started/query_101.md - Using Stores: getting_started/stores.md - Working with FileStore: getting_started/using_file_store.md - Writing a Builder: getting_started/simple_builder.md From 8454489ed3177021b37553b5e6d24b1fae811398 Mon Sep 17 00:00:00 2001 From: Ryan Date: Mon, 1 Jul 2024 16:48:37 -0400 Subject: [PATCH 4/5] docs: add to store guide --- docs/getting_started/stores.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/getting_started/stores.md b/docs/getting_started/stores.md index 48f23a05b..f11c0f81f 100644 --- a/docs/getting_started/stores.md +++ b/docs/getting_started/stores.md @@ -27,6 +27,10 @@ like using the `key` argument when you instantiate a `Store`. data = [{"task_id": 1, "AM": "sunrise"}, {"task_id: 2, "PM": "sunset"} ... ] ``` +Just to emphasize - **every document must have a `task_id`, and the value of `task_id` must be unique for every document**. The rest of the document structure +is up to you, but `maggma` works best when every document follows a pre-defined +schema (i.e., all `dict` have the same set of keys / same structure). + ## The `Store` interface All `Store` provide a number of basic methods that facilitate querying, updating, and removing data: From d099a1ce3b098bd67404592aa3dc1e577056fdd1 Mon Sep 17 00:00:00 2001 From: Ryan Date: Mon, 1 Jul 2024 16:54:04 -0400 Subject: [PATCH 5/5] docs: add queries tutorial --- docs/getting_started/query_101.md | 180 ++++++++++++++++++++++++++++++ 1 file changed, 180 insertions(+) create mode 100644 docs/getting_started/query_101.md diff --git a/docs/getting_started/query_101.md b/docs/getting_started/query_101.md new file mode 100644 index 000000000..503823e0e --- /dev/null +++ b/docs/getting_started/query_101.md @@ -0,0 +1,180 @@ +# Understanding Queries + +Putting your data into a `maggma` `Store` gives you powerful search, summary, +and analytical capabilities. All are based on "queries", which specify how +you want to search your data, and which parts of it you want to get in return. + +`maggma` query syntax closely follows [MongoDB Query syntax](https://www.mongodb.com/docs/manual/tutorial/query-documents/). In this tutorial, we'll cover the syntax of the most common query operations. You can refer to the +[MongoDB](https://www.mongodb.com/docs/manual/tutorial/query-documents/) or [pymongo](https://pymongo.readthedocs.io/en/stable/tutorial.html) (python interface to MongoDB) documentation for examples of more advanced use cases. + +Let's create an example dataset describing the [Teenage Mutant Ninja Turtles](https://en.wikipedia.org/wiki/Teenage_Mutant_Ninja_Turtles). + +```python +>>> turtles = [{"name": "Leonardo", + "color": "blue", + "tool": "sword", + "occupation": "ninja" + }, + {"name": "Donatello", + "color": "purple", + "tool": "staff", + "occupation": "ninja" + }, + {"name": "Michelangelo", + "color": "orange", + "tool": "nunchuks", + "occupation": "ninja" + }, + {"name":"Raphael", + "color": "red", + "tool": "sai", + "occupation": "ninja" + }, + {"name":"Splinter", + "occupation": "sensei" + } + ] +``` + +Notice how this data follows the principles described in [Structuring `Store` data](stores.md/#structuring-store-data): +- every document (`dict`) has a `name` key with a unique value +- every document has a common set of keys (`name`, +`occupation`). +- Note that SOME documents also share the keys `tool` and `color`, but not all. This is OK. + +For the rest of this tutorial, we will assume that this data has already been +added to a `Store` called `tmnt_store`, which we are going to query. + +## The `query` method + +`Store.query()` is the primary method you will use to search your data. + +- `query` +always returns a generator yielding any and all documents that match the query +you provide. +- There are no mandatory arguments. If you run `query()` you will get a generator +containing all documents in the `Store` +- The first (optional) argument is `criteria`, which is a query formatted as a `dict` as described in the next section. +- You can also specify `properties`, which is a list of fields from the documents you want to return. This is useful when working with large documents because then you only have to download the data you need rather than the entire document. +- You can also `skip` every N documents, `limit` the number of documents returned, and `sort` the result by some field. + +Since `query` returns a generator, you will typically want to turn the results into a list, or use them in a `for` loop. + +Turn into a list +```python +results = [d for d in store.query()] +``` + +Use in a `for` loop +```python +for doc in store.query(): + print(doc) +``` + +## The structure of a query + +A query is also a `dict`. Each key in the dict corresponds to a fjeld in the +documents you want to query (such as `name`, `color`, etc.), and the value +is the value of that key that you want to match. For example, a query to +select all documents where `occupation` is `ninja`, would look like + +```python +{"occupation": "ninja"} +``` + +This query will be passed as an argument to `Store` methods like `query_one`, +`query`, and `count`, as demonstrated next. + + +## Example queries + +### Match a single value + +To select all records where a field matches a single value, set the key to +the field you want to match and its value to the value you are looking for. + +Return all records where 'occupation' is 'ninja' +```python +>>> with tmnt_store as store: +... results = list(store.query({"occupation": "ninja"})) +>>> len(results) +4 +``` + +Return all records where 'name' is 'Splinter' + +```python +>>> with tmnt_store as store: +... results = list(store.query({"name": "Splinter"})) +>>> len(results) +1 +``` + +### Match any value in a list: `$in` + +To find all documents where a field matches one of several different +values, use `$in` with a list of the value you want to search. + +```python +>>> with tmnt_store as store: +... results = list(store.query({"color": {"$in": ["red", "blue"]}})) +>>> len(results) +2 +``` + +`$in` is an example of a "query operator". Others include: + +- `$nin`: a value is NOT in a list (the inverse of the above example) +- `$gt`, `$gte`: greater than, greater than or equal to a value +- `$lt`, `$lte`: greater than, greater than or equal to a value +- `$ne`: not equal to a value +- `$not`: inverts the effect of a query expression, returning results that + do NOT match. + +See the [MongoDB docs](https://www.mongodb.com/docs/manual/reference/operator/query/#query-selectors) for a complete list. + +!!! Note + + When using query operators like `$in`, you must include a nested `dict` in + your query, where the operator is the key and the search parameters are + the value, e.g., the dictionary `{"$in": ["red", "blue"]}` is the **value** + associated with the search field (`color`) in the parent dictionary. + +### Nested fields + +Suppose that our documents had a nested structure, for example, by having +separate fields for first and last name: + +```python +>>> turtles = [{"name": + {"first": "Leonardo", + "last": "turtle" + }, + "color": "blue", + "tool": "sword", + "occupation": "ninja" + }, + ... + ] +``` + +You can query nested fields by placing a period `.` between each level in the +hierarchy. For example: + +```python +>>> with tmnt_store as store: +... results = list(store.query({"name.first": "Splinter"})) +>>> len(results) +1 +``` + +### Numerical Values + +You can query numerical values in analogous fashion to the examples given above. + +!!! Note + When querying on numerical values, be mindful of the `type` of the data. + Data stored in `json` format is often converted entirely to `str`, so if + you use a numerical query operator like `$gte`, you might not get the + results you expect unless you first verify that the numerical data + in the `Store` is a `float` or `int` .