Skip to content

Commit

Permalink
docs updates; docstore, UPDATE, indextool, guides
Browse files Browse the repository at this point in the history
(cherry picked from commit 73cf973)
  • Loading branch information
adriannuta committed Oct 11, 2019
1 parent c5e24b7 commit f71616b
Show file tree
Hide file tree
Showing 8 changed files with 100 additions and 56 deletions.
Expand Up @@ -72,7 +72,7 @@ The commands are as follows:
contain spaces instead of separators (accordingly to your
charset_table settings) and lowercased letters in words.

- ``--html_strip INDEXNAME`` filters stdin using HTML stripper settings
- ``--htmlstrip INDEXNAME`` filters stdin using HTML stripper settings
for a given index, and prints the filtering results to stdout. Note
that the settings will be taken from sphinx.conf, and not the index
header.
Expand Down
60 changes: 49 additions & 11 deletions docs/getting-started/indexes.rst
Expand Up @@ -15,8 +15,8 @@ In addition, a special index based on RealTime type, called `percolate`, can be
In the current version, indexes use a schema like a normal database table. The schema can have 3 big types of columns:

* the first column is always an unsigned 64 bit non-zero number, called `id`. Unlike in a database, there is no mechanism of auto incrementing, so you need to be sure the documents ids are unique
* fulltext fields - they contain indexed content. There can be multiple fulltext fields per index. Fulltext searches can be made on all fields or selective. Currently the original text is not stored, so if it’s required to show their content in search results, a trip to the origin source must be made using the ids (or other identifier) obtained from the search
* attributes - their values are stored and are not used in fulltext matching. Instead they can be used for regular filtering, grouping, sorting. They can be also used in expressions of score ranking.
* full-text fields - they contain indexed content. There can be multiple full-text fields per index. Full-text searches can be made on all fields or selective. Starting with 3.2 it's possible to also store the original content and retrieve it in results.
* attributes - their values are stored and are not used in full-text matching. Instead they can be used for regular filtering, grouping, sorting. They can be also used in expressions of score ranking.

Field and attribute names must start with a letter and can contain letters, digits and underscore.

Expand Down Expand Up @@ -65,21 +65,59 @@ As the engine can't globally do a uniqueness on the document ids, an important t

For this, there is an option that allows defining a list of document ids which are suppressed by the delta index. For more details, check :ref:`sql_query_killlist`.

An example of a plain index configuration using a MySQL source:

.. code-block:: none
source mysource {
type = mysql
path = /path/to/realtime
sql_host = localhost
sql_user = myuser
sql_pass = mypass
sql_db = mydb
sql_query = SELECT id, title, description, category_id from mytable
sql_attr_uint = category_id
sql_field_string = title
}
index myindex {
type = plain
source = mysource
path = /path/to/myindex
...
}
Real-Time indexes
~~~~~~~~~~~~~~~~~

RealTime indexes allow online updates, but updating fulltext data and non-numeric attributes require a full row replace.
Real-Time indexes allow online updates, but updating full-text data and non-numeric attributes require a full row replace.

The RealTIme index starts empty and you can add, replace, update or delete data in the same fashion as for a database table. The updates are first held into a memory zone, defined by :ref:`rt_mem_limit`.
The Real-Time index starts empty and you can add, replace, update or delete data in the same fashion as for a database table. The updates are first held into a memory zone, defined by :ref:`rt_mem_limit`.
When this gets filled, it is dumped as disk chunk - which as structure is similar with a plain index. As the number of disk chunks increase, the search performance decreases, as the searching is done sequentially on the chunks.
To avoid that, there is a command that can merge the disk chunks into a single one - :ref:`optimize_index_syntax`.

Populating a RealTime can be done in two ways: firing INSERTs or converting a plain index to become RealTime.
Populating a Real-Time index can be done in two ways: firing INSERTs or converting a plain index to become RealTime.
In case of INSERTs, using a single worker (a script or code) that inserts one record at a time can be slow. You can speed this by batching many rows into one and by using multiple workers that perform inserting.
Parallel inserts will be faster but also come at using more CPU. The size of the data buffer memory (which we call RAM chunk) also influence the speed of inserting.

An example of Real-Time index configuration:


.. code-block:: none
index realtime {
type = rt
path = /path/to/realtime
rt_field = title
rt_field = description
rt_attr_uint = category_id
rt_attr_string = title
rt_attr_json = metadata
...
}
Local distributed indexes
~~~~~~~~~~~~~~~~~~~~~~~~~
Expand All @@ -89,8 +127,8 @@ In our case, a distributed index would look like:

.. code-block:: none
index_dist {
type = distributed
index index_dist {
type = distributed
local = index1
local = index2
...
Expand All @@ -105,7 +143,7 @@ Remote distributed indexes and high availability
.. code-block:: none
index mydist {
type = distributed
type = distributed
agent = box1:9312:shard1
agent = box2:9312:shard2
agent = box3:9312:shard3
Expand All @@ -117,7 +155,7 @@ Here we have split the data over 4 servers, each serving one of the shards. If o
.. code-block:: none
index mydist {
type = distributed
type = distributed
agent = box1:9312|box5:9312:shard1
agent = box2:9312:|box6:9312:shard2
agent = box3:9312:|box7:9312:shard3
Expand All @@ -140,8 +178,8 @@ replication address and port range in the config. Define :ref:`data_dir <data_di
.. code-block:: none
searchd {
listen = 9312
listen = 192.168.1.101:9360-9370:replication
listen = 9312
listen = 192.168.1.101:9360-9370:replication
data_dir = /var/lib/manticore/
...
}
Expand Down
16 changes: 15 additions & 1 deletion docs/getting-started/official-packages.rst
Expand Up @@ -183,6 +183,7 @@ To create a new RT index, you need to define it in the sphinx.conf. A simple def
rt_field = title
rt_attr_uint = attr1
rt_attr_uint = attr2
stored_field = title
}
To get the index online you need to either restart the daemon or send a HUP signal to it.
Expand All @@ -194,7 +195,7 @@ Unlike RT, the plain index requires setting up the source and run the indexing p
For this we need to edit the sphinx.conf configuration file. The initial configuration comes with a sample plain index along with a source.
For simplicity we use a MySQL source.

First, the database credentials need to be adjusted
First, the database credentials need to be adjusted in the source configuration:

.. code-block:: none
Expand Down Expand Up @@ -247,6 +248,19 @@ In our example group_id and date_added are attributes:
sql_attr_uint = group_id
sql_attr_timestamp = date_added
If we want to also store the texts or enable some features (for example wildcarding), we have to edit the index configuration:

.. code-block:: none
index test1
{
...
stored_fields = title
min_infix_len = 3
...
Once we have this setup, we can run the indexing process:

.. code-block:: none
Expand Down
2 changes: 1 addition & 1 deletion docs/getting-started/searching.rst
Expand Up @@ -124,7 +124,7 @@ The ranking score is relative to the query itself as long as it includes metrics
Data tokenization
~~~~~~~~~~~~~~~~~

Search engines don't store text as it is. Instead they extract words and create several structures that allows fast full-text searching. From the found words, a dictionary is build, which allows a quick look to discover if the word is present or not in the index. In addition, other structures records the documents and fields in which the word was found (as well as position of it inside a field). All these are used when a full-text match is performed.
Search engines don't store text as it is for performing searches on it. Instead they extract words and create several structures that allows fast full-text searching. From the found words, a dictionary is build, which allows a quick look to discover if the word is present or not in the index. In addition, other structures records the documents and fields in which the word was found (as well as position of it inside a field). All these are used when a full-text match is performed.

The process of demarcating and classifying words is called tokenization. The tokenization is applied at both indexing and searching and it operates at character and word level. On the character level, the engine allows only certain characters to pass, this is defined by the charset_table, anything else is replaced with a whitespace (which is considered the default word separator). The charset_table also allows mappings, for example lowercasing or simply replacing one character with another. Beside this, characters can be ignored, blended, defined as a phrase boundary.
At the word level, the base setting is the min_word_len which defines the minimum word length in characters to be accepted in the index. A common request is to match singular with plural forms of words. For this, morphology processors can be used. Going further, we might want a word to be matched as another one - because they are synonyms. For this, the wordforms feature can be used, which allows one or more words to be mapped to another one.
Expand Down
6 changes: 4 additions & 2 deletions docs/indexing/data_types.rst
Expand Up @@ -14,11 +14,13 @@ The identificator of a document in the index. Document IDs must be unique signed
Text
^^^^

It is the full-text field part of the index. The content of these fields is indexed and not stored in the original form.
It is the full-text field part of the index.
The text is passed through an analyzer pipeline that converts the text to words, applies morphology transformations etc.
Full-text fields can only be used in MATCH() clause, they are not returned in the result set and cannot be used for sorting or aggregation.
Full-text fields can only be used in MATCH() clause and cannot be used for sorting or aggregation.
Words are stored in an inverted index along with references to the fields they belong and positions in the field.
This allows to search a word inside each field and to use advanced operators like proximity.
By default the original text of the fields is only indexed and not stored, thus not possible to be returned in the results.
Starting with version 3.2.0 it's possible to optionally store and retrieve in results the original content.

String
^^^^^^^
Expand Down
19 changes: 5 additions & 14 deletions docs/indexing/full-text_fields.rst
Expand Up @@ -12,18 +12,9 @@ search through “title” only) or a subset of fields (eg. to “title” and
“abstract” only). Manticore index format generally supports up to 256
fields.

Note that the original contents of the fields are **not** stored in
the Manticore index. The text that you send to Manticore gets processed, and a
full-text index (a special data structure that enables quick searches
for a keyword) gets built from that text. But the original text contents
are then simply discarded. Manticore assumes that you store those contents
elsewhere anyway.

Moreover, it is impossible to *fully* reconstruct the original text,
because the specific whitespace, capitalization, punctuation, etc will
all be lost during indexing. It is theoretically possible to partially
reconstruct a given document from the Manticore full-text index, but that
would be a slow process (especially if the :ref:`CRC
dictionary <dict>` is used, which
does not even store the original keywords and works with their hashes
instead).
The text that you send to Manticore gets processed, and a
full-text index (a special data structure that enables quick searches
for a keyword) gets built from that text.
Prior Manticore Search 3.2 the original content of fields is discarded and it's not possible to
*fully* reconstruct it. In newer versions, original content can be optionally stored in index.
34 changes: 17 additions & 17 deletions docs/indexing/indexes.rst
Expand Up @@ -179,25 +179,25 @@ To control what access mode will be used :ref:`access_plain_attrs`, :ref:`access

Here is a table which can help you select your desired mode:

+-------------------------+-----------------------------------+--------------------------------------+----------------------------------------------+----------------------------+
| index part | keep it on disk | keep it in memory | cached in memory on daemon start | lock it in memory |
+-------------------------+-----------------------------------+--------------------------------------+----------------------------------------------+----------------------------+
+-------------------------+-----------------------------------+-----------------------------------------+----------------------------------------------+----------------------------+
| index part | keep it on disk | keep it in memory | cached in memory on daemon start | lock it in memory |
+-------------------------+-----------------------------------+-----------------------------------------+----------------------------------------------+----------------------------+
| .spa (plain attributes) | access_plain_attrs=mmap - the file will be mapped to RAM, but your OS will | access_plain_attrs = mmap_preread (default) | access_plain_attrs = mlock |
| .spe (skip lists) | decide whether to really load it to RAM or not and can easily swap it | | |
| .spi (word lists) | out (default) | | |
| .spt (lookups) | | | |
| .spm (killed docs) | | | |
+-------------------------+-----------------------------------+--------------------------------------+----------------------------------------------+----------------------------+
| .spe (skip lists) | decide whether to really load it to RAM or not and can easily swap it | | |
| .spi (word lists) | out (default) | | |
| .spt (lookups) | | | |
| .spm (killed docs) | | | |
+-------------------------+-----------------------------------+-----------------------------------------+----------------------------------------------+----------------------------+
| .spb (blob attributes) | access_blob_attrs=mmap - the file will be mapped to RAM, but your OS will | access_blob_attrs = mmap_preread (default) | access_blob_attrs = mlock |
| (string, mva and json | decide whether to really load it to RAM or not and can easily swap it | | |
| attributes) | out (default) | | |
+-------------------------+-----------------------------------+--------------------------------------+----------------------------------------------+----------------------------+
| .spd (doc lists) | access_doclists = file (default) | access_doclists = mmap, may be still | no | access_doclists = mlock |
| | | swapped out by OS | | |
+-------------------------+-----------------------------------+--------------------------------------+----------------------------------------------+----------------------------+
| .spp (hit lists) | access_hitlists = file (default) | access_hitlists = mmap, may be still | no | access_hitlists = mlock |
| | | swapped out by OS | | |
+-------------------------+-----------------------------------+--------------------------------------+----------------------------------------------+----------------------------+
| (string, mva and json | decide whether to really load it to RAM or not and can easily swap it | | |
| attributes) | out (default) | | |
+-------------------------+-----------------------------------+-----------------------------------------+----------------------------------------------+----------------------------+
| .spd (doc lists) | access_doclists = file (default) | access_doclists = mmap, may be still | no | access_doclists = mlock |
| | | swapped out by OS | | |
+-------------------------+-----------------------------------+-----------------------------------------+----------------------------------------------+----------------------------+
| .spp (hit lists) | access_hitlists = file (default) | access_hitlists = mmap, may be still | no | access_hitlists = mlock |
| | | swapped out by OS | | |
+-------------------------+-----------------------------------+-----------------------------------------+----------------------------------------------+----------------------------+

, , ,

Expand Down

0 comments on commit f71616b

Please sign in to comment.