Usage with Solr

Shane Harvey edited this page Dec 23, 2016 · 23 revisions

NOTE: in mongo-connector versions < 2.5.0, the Solr doc manager was packaged as part of mongo-connector. In mongo-connector versions >= 2.5.0, the solr doc manager is available as a plugin. For more information on how to install the solr doc managers, please see the solr doc manager documentation.

Solr doc manager: https://github.com/mongodb-labs/solr-doc-manager

Installation

New in mongo-connector 2.5.0, to install mongo-connector with the solr-doc-manager run:

pip install 'mongo-connector[solr]'

Setup

Create Solr cores

Please consult the Solr documentation on how to create an manage Solr cores.

Make sure the LukeRequestHandler is Enabled

This line should be present in your solrconfig.xml file:

<requestHandler name="/admin/luke" class="org.apache.solr.handler.admin.LukeRequestHandler" />
Set up your Schema

Mongo Connector stores metadata in each document to help handle rollbacks. To support these data, you'll need to add the following to your schema.xml:

<field name="_ts" type="long" indexed="true" stored="true" />
<field name="ns" type="string" indexed="true" stored="true"/>

See also the section on "schema.xml" below. Mongo Connector does not support schemaless Solr at this time.

The Basics

Mongo Connector can replicate to the Solr search engine using the Solr DocManager. To start the connector, you must pass in the base URL for the Solr core to which you want to synchronize. The most basic usage is the following:

mongo-connector -m localhost:27017 -t http://localhost:8983/solr -d solr_doc_manager

old usage (before 2.0 release):

mongo-connector -m localhost:27017 -t http://localhost:8983/solr -d <your-doc-manager-folder>/solr_doc_manager.py

If you wanted to send all MongoDB documents to the "MyCore" core, you would do it like this:

mongo-connector -m localhost:27017 -t http://localhost:8983/solr/MyCore -d solr_doc_manager

Note that if you are running solr-5.5 you must add the core to the URL

This assumes there is a MongoDB replica set running on port 27017 and that Solr is running on port 8983 both on the local machine.

Mongo Connector and schema.xml

Configuring Solr

Please refer to the Apache documentation for configuring Solr and SolrCloud.

N.B.: Key Names and Document Flattening

Mongo Connector automatically "flattens" MongoDB documents. Fields within sub-documents can be referenced by their "dot-separated path" within the document. Likewise, array fields are unrolled, so that individual elements are accessible by the field's original name, plus a ".", plus the index within the array that the element occupied. An example:

{
    "subdoc": {
        "a": 1,
        "b": 2,
        "c": 3,
        "array": [
            {"name": "elmo"},
            {"name": "oscar"}
        ]
    }
}

will become the following in Solr:

{
    "subdoc.a": 1,
    "subdoc.b": 2,
    "subdoc.c": 3,
    "subdoc.array.0.name": "elmo",
    "subdoc.array.1.name": "oscar"
}

Schema.xml

Additionally, Mongo Connector comes with an example schema.xml file that can help get you started integrating MongoDB with Solr search. Solr reads schema.xml in order to find field types, fields that documents may have, the primary key, and more. Mongo Connector will try to obtain the schema for Solr using the LukeRequestHandler at a special URI admin/luke/?show=schema&wt=json that is appended to the base Solr URL. So, in the example above, Mongo Connector will try to obtain the schema for Solr by sending a GET request to http://localhost:8983/solr/admin/luke/?show=schema&wt=json.

Mongo Connector will drop fields from MongoDB documents that aren't declared in your Solr core's schema in order to avoid Solr throwing exceptions and failing to insert those documents. If you don't define the fields you want in schema.xml and reload the Solr core, Mongo Connector will merrily continue stripping your MongoDB documents of the offending fields. You can check what Solr thinks the schema to your core is by visiting the aforementioned endpoint in your browser.

Unique Keys between Solr and MongoDB

MongoDB generally uses a field called _id to store unique keys in documents. Solr by default uses id for the same purpose. In both databases, these fields have mandatory presence in a document, so submitting a document unchanged from MongoDB to Solr while the unique key is still id will result in an exception from Solr, and the document will not be inserted. In order for Mongo Connector to replicate to Solr successfully, Solr needs to see the expected unique key in each document. There are two ways to do this:

  1. Mongo Connector can translate _id to id when operations are replicated to Solr if you specify the option --unique-key=id to mongo-connector. The new id field will hold a string-ified version of what was stored in the _id field.

  2. You can switch Solr's unique key to _id instead of id. If you're working from the schema.xml provided as part of Mongo Connector, this is already done for you! Otherwise, you can accomplish this by editing the schema.xml file and replacing the line:

    <uniqueKey>id</uniqueKey>
    

    with the line:

    <uniqueKey>_id</uniqueKey>
    

    You'll also need to add a field definition for this key. Inside the <fields></fields> tags, you should insert:

    <field name="_id" type="string" indexed="true" stored="true" />
    

    Finally, you'll need to reload your Solr core.

Managing Commit Behavior

Mongo Connector does not force a commit on every write operation; rather, a Solr administrator should configure commit behavior in solrconfig.xml. This generally increases overall performance, since not every operation has to be flushed to disk immediately.

Mongo Connector also provides the --auto-commit-interval option to override any option set in solrconfig.xml, though the former should be preferred if possible. This option takes as an argument a number which is to be the maximum number of seconds allowed before a write must be committed. An argument of 0 means that every write operation is committed immediately:

# commit every write immediately
mongo-connector --auto-commit-interval=0 -d solr_doc_manager -t http://localhost:8983/solr

Solr-5.5

There are a few small changes between solr-4.9 and solr-5.5.

  1. The schema.xml file did not work with solr-5.5 until this commit. If you are using an older version of our schema.xml file, please update.

  2. Solr-5.5 expects the core name to be appended to the solr address. Previously we were setting SOLR_URL to 'http://localhost:8983/solr', and with 5.5 we have to set SOLR_URL to 'http://localhost:8983/solr/<mycore>'.

  3. The way GridFS files are parsed has been changed. In 4.9 the data is stored in the 'content' field when copied to solr, in 5.5 it's copied to '_text_' and includes some meta-data.