Permalink
Branch: master
Find file Copy path
96cab96 Jan 11, 2016
3 contributors

Users who have contributed to this file

@kay-kim @tychoish @elitan
572 lines (429 sloc) 17.5 KB
===============
Product Catalog
===============
.. default-domain:: mongodb
.. contents:: On this page
:local:
:backlinks: none
:depth: 1
:class: singlecol
Overview
--------
This document describes the basic patterns and principles for
designing an E-Commerce product catalog system using MongoDB as a
storage engine.
Problem
~~~~~~~
Product catalogs must have the capacity to store many differed types
of objects with different sets of attributes. These kinds of data
collections are quite compatible with MongoDB's data model, but many
important considerations and design decisions remain.
Solution
~~~~~~~~
For relational databases, there are several solutions that address
this problem, each with a different performance profile. This section
examines several of these options and then describes the preferred
MongoDB solution.
SQL and Relational Data Models
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. _e-commerce-catalog-concrete-table-inheritance:
Concrete Table Inheritance
``````````````````````````
One approach, in a relational model, is to create a table for each
product category. Consider the following example SQL statement for
creating database tables:
.. code-block:: sql
CREATE TABLE `product_audio_album` (
`sku` char(8) NOT NULL,
...
`artist` varchar(255) DEFAULT NULL,
`genre_0` varchar(255) DEFAULT NULL,
`genre_1` varchar(255) DEFAULT NULL,
...,
PRIMARY KEY(`sku`))
...
CREATE TABLE `product_film` (
`sku` char(8) NOT NULL,
...
`title` varchar(255) DEFAULT NULL,
`rating` char(8) DEFAULT NULL,
...,
PRIMARY KEY(`sku`))
...
This approach has limited flexibility for two key reasons:
- You must create a new table for every new category of products.
- You must explicitly tailor all queries for the exact type of
product.
.. _e-commerce-catalog-single-table-inheritance:
Single Table Inheritance
````````````````````````
Another relational data model uses a single table for all product
categories and adds new columns anytime you need to store data
regarding a new type of product. Consider the following SQL statement:
.. code-block:: sql
CREATE TABLE `product` (
`sku` char(8) NOT NULL,
...
`artist` varchar(255) DEFAULT NULL,
`genre_0` varchar(255) DEFAULT NULL,
`genre_1` varchar(255) DEFAULT NULL,
...
`title` varchar(255) DEFAULT NULL,
`rating` char(8) DEFAULT NULL,
...,
PRIMARY KEY(`sku`))
This approach is more flexible than concrete table inheritance: it
allows single queries to span different product types, but at the
expense of space.
.. _e-commerce-catalog-multiple-table-inheritance:
Multiple Table Inheritance
``````````````````````````
Also in the relational model, you may use a "multiple table
inheritance" pattern to represent common attributes in a generic
"product" table, with some variations in individual category product
tables. Consider the following SQL statement:
.. code-block:: sql
CREATE TABLE `product` (
`sku` char(8) NOT NULL,
`title` varchar(255) DEFAULT NULL,
`description` varchar(255) DEFAULT NULL,
`price`, ...
PRIMARY KEY(`sku`))
CREATE TABLE `product_audio_album` (
`sku` char(8) NOT NULL,
...
`artist` varchar(255) DEFAULT NULL,
`genre_0` varchar(255) DEFAULT NULL,
`genre_1` varchar(255) DEFAULT NULL,
...,
PRIMARY KEY(`sku`),
FOREIGN KEY(`sku`) REFERENCES `product`(`sku`))
...
CREATE TABLE `product_film` (
`sku` char(8) NOT NULL,
...
`title` varchar(255) DEFAULT NULL,
`rating` char(8) DEFAULT NULL,
...,
PRIMARY KEY(`sku`),
FOREIGN KEY(`sku`) REFERENCES `product`(`sku`))
...
Multiple table inheritance is more space-efficient than :ref:`single
table inheritance <e-commerce-catalog-single-table-inheritance>` and
somewhat more flexible than :ref:`concrete table inheritance
<e-commerce-catalog-single-table-inheritance>`. However, this model
does require an expensive ``JOIN`` operation to obtain all relevant
attributes relevant to a product.
.. _e-commerce-catalog-entity-attribute-values:
Entity Attribute Values
```````````````````````
The final substantive pattern from relational modeling is the
entity-attribute-value schema where you would create a meta-model for
product data. In this approach, you maintain a table with three
columns, e.g. ``entity_id``, ``attribute_id``, ``value``, and these
triples describe each product.
Consider the description of an audio recording. You may have a series
of rows representing the following relationships:
+-----------------+-------------+------------------+
| Entity | Attribute | Value |
+=================+=============+==================+
| sku_00e8da9b | type | Audio Album |
+-----------------+-------------+------------------+
| sku_00e8da9b | title | A Love Supreme |
+-----------------+-------------+------------------+
| sku_00e8da9b | ... | ... |
+-----------------+-------------+------------------+
| sku_00e8da9b | artist | John Coltrane |
+-----------------+-------------+------------------+
| sku_00e8da9b | genre | Jazz |
+-----------------+-------------+------------------+
| sku_00e8da9b | genre | General |
+-----------------+-------------+------------------+
| ... | ... | ... |
+-----------------+-------------+------------------+
This schema is totally flexible:
- any entity can have any set of any attributes.
- New product categories do not require *any* changes to the data model in the database.
The downside for these models, is that all nontrivial queries require
large numbers of ``JOIN`` operations that results in large performance
penalties.
Avoid Modeling Product Data
```````````````````````````
Additionally some e-commerce solutions with relational database
systems avoid choosing one of the data models above, and serialize
all of this data into a ``BLOB`` column. While simple, the details
become difficult to access for search and sort.
Non-Relational Data Model
`````````````````````````
Because MongoDB is a non-relational database, the data model for your
product catalog can benefit from this additional flexibility. The best
models use a single MongoDB collection to store all the product data,
which is similar to the :ref:`single table inheritance
<e-commerce-catalog-single-table-inheritance>` relational model. MongoDB's dynamic schema
means that each :term:`document` need not conform to the same
schema. As a result, the document for each product only needs to
contain attributes relevant to that product.
Schema
~~~~~~
At the beginning of the document, the schema must contain general
product information, to facilitate searches of the entire
catalog. Then, a ``details`` sub-document that contains fields that
vary between product types. Consider the following example document
for an album product.
.. code-block:: javascript
{
sku: "00e8da9b",
type: "Audio Album",
title: "A Love Supreme",
description: "by John Coltrane",
asin: "B0000A118M",
shipping: {
weight: 6,
dimensions: {
width: 10,
height: 10,
depth: 1
},
},
pricing: {
list: 1200,
retail: 1100,
savings: 100,
pct_savings: 8
},
details: {
title: "A Love Supreme [Original Recording Reissued]",
artist: "John Coltrane",
genre: [ "Jazz", "General" ],
...
tracks: [
"A Love Supreme Part I: Acknowledgement",
"A Love Supreme Part II - Resolution",
"A Love Supreme, Part III: Pursuance",
"A Love Supreme, Part IV-Psalm"
],
},
}
A movie item would have the same fields for general product
information, shipping, and pricing, but have different details
sub-document. Consider the following:
.. code-block:: javascript
{
sku: "00e8da9d",
type: "Film",
...,
asin: "B000P0J0AQ",
shipping: { ... },
pricing: { ... },
details: {
title: "The Matrix",
director: [ "Andy Wachowski", "Larry Wachowski" ],
writer: [ "Andy Wachowski", "Larry Wachowski" ],
...,
aspect_ratio: "1.66:1"
},
}
.. note::
In MongoDB, you can have fields that hold multiple values
(i.e. arrays) without any restrictions on the number of fields or
values (as with ``genre_0`` and ``genre_1``) and also without
the need for a JOIN operation.
Operations
----------
For most deployments the primary use of the product catalog is to
perform search operations. This section provides an overview of
various types of queries that may be useful for supporting an e-commerce
site. All examples in this document use the Python programming
language and the :api:`PyMongo <python/current>` :term:`driver` for
MongoDB, but you can implement this system using any language you
choose.
Find Albums by Genre and Sort by Year Produced
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Querying
````````
This query returns the documents for the products of a specific genre,
sorted in reverse chronological order:
.. code-block:: python
query = db.products.find({'type':'Audio Album',
'details.genre': 'Jazz'})
query = query.sort([('details.issue_date', -1)])
Indexing
````````
To support this query, create a compound index on all the properties
used in the filter and in the sort:
.. code-block:: python
db.products.ensure_index([
('type', 1),
('details.genre', 1),
('details.issue_date', -1)])
.. note::
The final component of the index is the sort field. This allows
MongoDB to traverse the index in the sorted order to preclude a
slow in-memory sort.
Find Products Sorted by Percentage Discount Descending
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
While most searches will be for a particular type of product (e.g
album, movie, etc.,) in some situations you may want to return all
products in a certain price range, or discount percentage.
Querying
````````
To return this data use the pricing information that exists in all
products to find the products with the highest percentage discount:
.. code-block:: python
query = db.products.find( { 'pricing.pct_savings': {'$gt': 25 })
query = query.sort([('pricing.pct_savings', -1)])
Indexing
````````
To support this type of query, you will want to create an index on the
``pricing.pct_savings`` field:
.. code-block:: python
db.products.ensure_index('pricing.pct_savings')
Since MongoDB can read indexes in ascending or descending order, the
order of the index does not matter.
.. note::
If you want to perform range queries (e.g. "return all products
over $25") and then sort by another property like ``pricing.retail``,
MongoDB cannot use the index as effectively in this situation.
The field that you want to select a range, or perform sort
operations, must be the *last* field in a compound index in order
to avoid scanning an entire collection. Using different properties
within a single combined range query and sort operation requires
some scanning which will limit the speed of your query.
Find Movies Based on Starring Actor
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Querying
````````
Use the following query to select documents within the details of a
specified product type (i.e. ``Film``) of product (a movie) to find
products that contain a certain value (i.e. a specific actor in the
``details.actor`` field,) with the results sorted by date descending:
.. code-block:: python
query = db.products.find({'type': 'Film',
'details.actor': 'Keanu Reeves'})
query = query.sort([('details.issue_date', -1)])
Indexing
````````
To support this query, you may want to create the following index.
.. code-block:: python
db.products.ensure_index([
('type', 1),
('details.actor', 1),
('details.issue_date', -1)])
This index begins with the ``type`` field and then narrows by the
other search field, where the final component of the index is the sort
field to maximize index efficiency.
Find Movies with a Particular Word in the Title
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Regardless of database engine, in order to retrieve this information
the system will need to scan some number of documents or records to
satisfy this query.
Querying
````````
MongoDB supports regular expressions within queries. In Python, you
can use the ":py:mod:`re`" module to construct the query:
.. code-block:: python
import re
re_hacker = re.compile(r'.*hacker.*', re.IGNORECASE)
query = db.products.find({'type': 'Film', 'title': re_hacker})
query = query.sort([('details.issue_date', -1)])
MongoDB provides a special syntax for regular expression queries
without the need for the :py:mod:`re` module. Consider the following
alternative which is equivalent to the above example:
.. code-block:: python
query = db.products.find({
'type': 'Film',
'title': {'$regex': '.*hacker.*', '$options':'i'}})
query = query.sort([('details.issue_date', -1)])
The :operator:`$options <$regex>` operator specifies a case
insensitive match.
Indexing
````````
The indexing strategy for these kinds of queries is different from
previous attempts. Here, create an index on ``{ type: 1,
details.issue_date: -1, title: 1 }`` using the following command at
the Python/PyMongo console:
.. code-block:: python
db.products.ensure_index([
('type', 1),
('details.issue_date', -1),
('title', 1)])
This index makes it possible to avoid scanning whole documents by
using the index for scanning the title rather than forcing MongoDB to
scan whole documents for the title field. Additionally, to support the
sort on the ``details.issue_date`` field, by placing this field
*before* the ``title`` field, ensures that the result set is already
ordered before MongoDB filters title field.
Scaling
-------
Sharding
~~~~~~~~
Database performance for these kinds of deployments are dependent on
indexes. You may use :term:`sharding` to enhance performance by
allowing MongoDB to keep larger portions of those indexes in RAM. In
sharded configurations, select a :term:`shard key` that allows
:program:`mongos` to route queries directly to a single shard or small
group of shards.
Since most of the queries in this system include the ``type`` field,
include this in the shard key. Beyond this, the remainder of the shard
key is difficult to predict without information about your database's
actual activity and distribution. Consider that:
- ``details.issue_date`` would be a poor addition to the shard key
because, although it appears in a number of queries, no queries
were *selective* by this field.
- you should include one or more fields in the ``detail`` document
that you query frequently, and a field that has quasi-random
features, to prevent large unsplittable chunks.
In the following example, assume that the ``details.genre`` field is
the second-most queried field after ``type``. Enable sharding using
the following :dbcommand:`shardCollection` operation at the
Python/PyMongo console:
.. code-block:: pycon
>>> db.command('shardCollection', 'product', {
... key : { 'type': 1, 'details.genre' : 1, 'sku':1 } })
{ "collectionsharded" : "details.genre", "ok" : 1 }
.. note::
Even if you choose a "poor" shard key that requires
:program:`mongos` to broadcast all to all shards, you will still
see some benefits from sharding, because:
#. Sharding makes a larger amount of memory
available to store indexes, and
#. MongoDB will parallelize queries across shards, reducing
latency.
Read Preference
~~~~~~~~~~~~~~~
While :term:`sharding` is the best way to scale operations, some data
sets make it impossible to partition data so that :program:`mongos`
can route queries to specific shards. In these situations
:program:`mongos` sends the query to all shards and then combines the
results before returning to the client.
In these situations, you can add additional read performance by
allowing :program:`mongos` to read from the :term:`secondary`
instances in a :term:`replica set` by configuring :term:`read
preference` in your client. Read preference is configurable on a
per-connection or per-operation basis. In :api:`PyMongo
<python/current>`, set the :py:attr:`read_preference
<pymongo.collection.Collection.read_preference>` argument.
The
:py:attr:`SECONDARY <pymongo:pymongo.mongo_client.MongoClient.read_preference>`
property in the following example, permits reads from a
:term:`secondary` (as well as a primary) for the entire connection .
.. code-block:: python
conn = pymongo.MongoClient(read_preference=pymongo.SECONDARY)
Conversely, the
:py:attr:`SECONDARY_ONLY <pymongo:pymongo.mongo_client.MongoClient.read_preference>`
read preference means that the
client will only send read operation only to the secondary member
.. code-block:: python
conn = pymongo.MongoClient(read_preference=pymongo.SECONDARY_ONLY)
You can also specify
:py:class:`read_preference <pymongo.read_preferences.ReadPreference>`
for specific queries, as follows:
.. code-block:: python
results = db.product.find(..., read_preference=pymongo.SECONDARY)
or
.. code-block:: python
results = db.product.find(..., read_preference=pymongo.SECONDARY_ONLY)
.. seealso:: :manual:`Replica Set Read Preference </core/read-preference>`
Additional Resources
--------------------
`Retail Reference Architecture Part 1 (Blog Post) <http://www.mongodb.com/blog/post/retail-reference-architecture-part-1-building-flexible-searchable-low-latency-product?jmp=docs>`_