docs updates; docstore, UPDATE, indextool, guides

(cherry picked from commit 73cf973)
manticoresoftware · Oct 11, 2019 · f71616b · f71616b
1 parent c5e24b7
commit f71616b
Show file tree

Hide file tree

Showing 8 changed files with 100 additions and 56 deletions.
diff --git a/docs/command_line_tools_reference/indextool_command_reference.rst b/docs/command_line_tools_reference/indextool_command_reference.rst
@@ -72,7 +72,7 @@ The commands are as follows:
    contain spaces instead of separators (accordingly to your
    charset_table settings) and lowercased letters in words.
 
--  ``--html_strip INDEXNAME`` filters stdin using HTML stripper settings
+-  ``--htmlstrip INDEXNAME`` filters stdin using HTML stripper settings
    for a given index, and prints the filtering results to stdout. Note
    that the settings will be taken from sphinx.conf, and not the index
    header.

diff --git a/docs/getting-started/indexes.rst b/docs/getting-started/indexes.rst
@@ -15,8 +15,8 @@ In addition, a special index based on RealTime type, called `percolate`, can be
 In the current version, indexes use a schema like a normal database table. The schema can have 3 big types of columns:
 
 * the first column is always an unsigned 64 bit non-zero number, called `id`. Unlike in a database, there is no mechanism of auto incrementing, so you need to be sure the documents ids are unique
-* fulltext fields - they contain indexed content. There can be multiple fulltext fields per index. Fulltext searches can be made on all fields or selective. Currently the original text is not stored, so if it’s required to show their content in search results, a trip to the origin source must be made using the ids (or other identifier) obtained from the search
-* attributes - their values are stored and are not used in fulltext matching. Instead they can be used for regular filtering, grouping, sorting. They can be also used in expressions of score ranking.
+* full-text fields - they contain indexed content. There can be multiple full-text fields per index. Full-text searches can be made on all fields or selective. Starting with 3.2 it's possible to also store the original content and retrieve it in results.
+* attributes - their values are stored and are not used in full-text matching. Instead they can be used for regular filtering, grouping, sorting. They can be also used in expressions of score ranking.
 
 Field and attribute names must start with a letter and can contain letters, digits and underscore.
 
@@ -65,21 +65,59 @@ As the engine can't globally do a uniqueness on the document ids, an important t
 
 For this, there is an option that allows defining a list of document ids which are suppressed by the delta index. For more details, check :ref:`sql_query_killlist`.
 
+An example of a plain index configuration using a MySQL source:
+
+.. code-block::  none
+
+  source mysource {
+    type             = mysql
+	path             = /path/to/realtime
+    sql_host         = localhost
+	sql_user         = myuser
+	sql_pass         = mypass
+	sql_db           = mydb
+	sql_query        =  SELECT id, title, description, category_id  from mytable
+	sql_attr_uint    = category_id
+	sql_field_string = title
+   }
+   
+  index myindex {
+    type   = plain
+	source = mysource
+	path   = /path/to/myindex
+    ...
+   }
+   
+
 Real-Time indexes
 ~~~~~~~~~~~~~~~~~
 
-RealTime indexes allow online updates, but updating fulltext data and non-numeric attributes require a full row replace.
+Real-Time indexes allow online updates, but updating full-text data and non-numeric attributes require a full row replace.
 
-The RealTIme index  starts empty and you can add, replace, update or delete data in the same fashion as for a database table. The updates are first held into a memory zone, defined by :ref:`rt_mem_limit`. 
+The Real-Time index  starts empty and you can add, replace, update or delete data in the same fashion as for a database table. The updates are first held into a memory zone, defined by :ref:`rt_mem_limit`. 
 When this gets filled, it is dumped as disk chunk -  which as structure is similar with a plain index. As the number of disk chunks increase, the search performance decreases, as the searching is done sequentially on the chunks.
 To avoid that, there is a command that can merge the disk chunks into a single one - :ref:`optimize_index_syntax`. 
 
-Populating a RealTime can be done in two ways: firing INSERTs or converting a plain index to become RealTime.
+Populating a Real-Time index can be done in two ways: firing INSERTs or converting a plain index to become RealTime.
 In case of INSERTs, using a single worker (a script or code) that inserts one record at a time can be slow. You can speed this by batching many rows into one and by using multiple workers that perform inserting. 
 Parallel inserts will be faster but also come at using more CPU. The size of the data buffer memory (which we call RAM chunk) also influence the speed of inserting.
 
+An example of Real-Time index configuration:
 
 
+.. code-block::  none
+
+  index realtime {
+    type           = rt
+	path           = /path/to/realtime
+    rt_field       = title
+    rt_field       = description
+	rt_attr_uint   = category_id
+	rt_attr_string = title
+	rt_attr_json   = metadata
+    ...
+   }
+   
 
 Local distributed indexes
 ~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -89,8 +127,8 @@ In our case, a distributed index would look like:
 
 .. code-block::  none
 
-  index_dist {
-    type = distributed
+  index index_dist {
+    type  = distributed
     local = index1
     local = index2
     ...
@@ -105,7 +143,7 @@ Remote distributed indexes and high availability
 .. code-block:: none
 
    index mydist {
-             type = distributed
+             type  = distributed
              agent = box1:9312:shard1
              agent = box2:9312:shard2
              agent = box3:9312:shard3
@@ -117,7 +155,7 @@ Here we have split the data over 4 servers, each serving one of the shards. If o
 .. code-block:: none
 
    index mydist {
-             type = distributed
+             type  = distributed
              agent = box1:9312|box5:9312:shard1
              agent = box2:9312:|box6:9312:shard2
              agent = box3:9312:|box7:9312:shard3
@@ -140,8 +178,8 @@ replication address and port range in the config. Define :ref:`data_dir <data_di
 .. code-block::  none
 
   searchd {
-    listen = 9312
-    listen = 192.168.1.101:9360-9370:replication
+    listen   = 9312
+    listen   = 192.168.1.101:9360-9370:replication
     data_dir = /var/lib/manticore/
     ...
    }

diff --git a/docs/getting-started/official-packages.rst b/docs/getting-started/official-packages.rst
@@ -183,6 +183,7 @@ To create a new RT index, you need to define it in the sphinx.conf. A simple def
          rt_field = title
          rt_attr_uint = attr1
          rt_attr_uint = attr2
+		 stored_field = title
    }
 
 To get the index online you need to either restart the daemon or send a HUP signal to it.
@@ -194,7 +195,7 @@ Unlike RT, the plain index requires setting up the source and run the indexing p
 For this we need to edit the sphinx.conf configuration file. The initial configuration comes with a sample plain index along with a source.
 For simplicity we use a MySQL source.
 
-First, the database credentials need to be adjusted 
+First, the database credentials need to be adjusted in the source configuration:
 
 .. code-block:: none
    
@@ -247,6 +248,19 @@ In our example group_id and date_added are attributes:
       sql_attr_uint           = group_id
       sql_attr_timestamp      = date_added
 
+
+If we want to also store the texts or enable some features (for example wildcarding), we have to edit the index configuration:
+
+.. code-block:: none
+
+      index test1
+	  {
+	  ...
+	      stored_fields           = title
+          min_infix_len           = 3
+	  ...
+
+
 Once we have this setup, we can run the indexing process:
 
 .. code-block:: none

diff --git a/docs/getting-started/searching.rst b/docs/getting-started/searching.rst
@@ -124,7 +124,7 @@ The ranking score is relative to the query itself as long as it includes metrics
 Data tokenization
 ~~~~~~~~~~~~~~~~~
 
-Search engines don't store text as it is. Instead they extract words and create several structures that allows fast full-text searching. From the found words, a dictionary is build, which allows a quick look to discover if the word is present or not in the index. In addition, other structures records the documents and fields in which the word was found (as well as position of it inside a field). All these are used when a full-text match is performed.
+Search engines don't store text as it is for performing searches on it. Instead they extract words and create several structures that allows fast full-text searching. From the found words, a dictionary is build, which allows a quick look to discover if the word is present or not in the index. In addition, other structures records the documents and fields in which the word was found (as well as position of it inside a field). All these are used when a full-text match is performed.
 
 The process of demarcating and classifying words is called tokenization. The tokenization is applied at both indexing and searching and it operates at character and word level. On the character level, the engine allows only certain characters to pass, this is defined by the charset_table, anything else is replaced with a whitespace (which is considered the default word separator). The charset_table also allows mappings, for example lowercasing or simply replacing one character with another. Beside this, characters can be ignored, blended, defined as a phrase boundary.
 At the word level, the base setting is the min_word_len which defines the minimum word length in characters to be accepted in the index. A common request is to match singular with plural forms of words. For this, morphology processors can be used. Going further, we might want a word to be matched as another one - because they are synonyms. For this, the wordforms feature can be used, which allows one or more words to be mapped to another one. 

diff --git a/docs/indexing/data_types.rst b/docs/indexing/data_types.rst
@@ -14,11 +14,13 @@ The identificator of a document in the index. Document IDs must be unique signed
 Text
 ^^^^
 
-It is the full-text field part of the index. The content of these fields is indexed and not stored in the original form.
+It is the full-text field part of the index. 
 The text is passed through an analyzer pipeline that converts the text to words, applies morphology transformations etc.
-Full-text fields can only be used in MATCH() clause, they are not returned in the result set and cannot be used for sorting or aggregation.
+Full-text fields can only be used in MATCH() clause and cannot be used for sorting or aggregation.
 Words are stored in an inverted index along with references to the fields they belong and positions in the field.
 This allows to search a word inside each field and to use advanced operators like proximity.
+By default the original text of the fields is only indexed and not stored, thus not possible to be returned in the results.
+Starting with version 3.2.0 it's possible to optionally store and retrieve in results the original content. 
 
 String
 ^^^^^^^

diff --git a/docs/indexing/full-text_fields.rst b/docs/indexing/full-text_fields.rst
@@ -12,18 +12,9 @@ search through “title” only) or a subset of fields (eg. to “title” and
 “abstract” only). Manticore index format generally supports up to 256
 fields.
 
-Note that the original contents of the fields are **not** stored in
-the Manticore index. The text that you send to Manticore gets processed, and a
-full-text index (a special data structure that enables quick searches
-for a keyword) gets built from that text. But the original text contents
-are then simply discarded. Manticore assumes that you store those contents
-elsewhere anyway.
 
-Moreover, it is impossible to *fully* reconstruct the original text,
-because the specific whitespace, capitalization, punctuation, etc will
-all be lost during indexing. It is theoretically possible to partially
-reconstruct a given document from the Manticore full-text index, but that
-would be a slow process (especially if the :ref:`CRC
-dictionary <dict>` is used, which
-does not even store the original keywords and works with their hashes
-instead).
+The text that you send to Manticore gets processed, and a
+full-text index (a special data structure that enables quick searches
+for a keyword) gets built from that text.
+Prior Manticore Search 3.2 the original content of fields is discarded and it's not possible to
+*fully* reconstruct it. In newer versions, original content can be optionally stored in index.
diff --git a/docs/indexing/indexes.rst b/docs/indexing/indexes.rst
@@ -179,25 +179,25 @@ To control what access mode will be used :ref:`access_plain_attrs`, :ref:`access
 
 Here is a table which can help you select your desired mode:
 
-+-------------------------+-----------------------------------+--------------------------------------+----------------------------------------------+----------------------------+
-| index part              | keep it on disk                   | keep it in memory                    | cached in memory on daemon start             | lock it in memory          |
-+-------------------------+-----------------------------------+--------------------------------------+----------------------------------------------+----------------------------+
++-------------------------+-----------------------------------+-----------------------------------------+----------------------------------------------+----------------------------+
+| index part              | keep it on disk                   | keep it in memory                       | cached in memory on daemon start             | lock it in memory          |
++-------------------------+-----------------------------------+-----------------------------------------+----------------------------------------------+----------------------------+
 | .spa (plain attributes) | access_plain_attrs=mmap - the file will be mapped to RAM, but your OS will  | access_plain_attrs = mmap_preread (default)  | access_plain_attrs = mlock |
-| .spe (skip lists)       | decide whether to really load it to RAM or not and can easily swap it    |                                              |                            |
-| .spi (word lists)       | out (default)                                                            |                                              |                            |
-| .spt (lookups)          |                                                                          |                                              |                            |
-| .spm (killed docs)      |                                                                          |                                              |                            |
-+-------------------------+-----------------------------------+--------------------------------------+----------------------------------------------+----------------------------+
+| .spe (skip lists)       | decide whether to really load it to RAM or not and can easily swap it       |                                              |                            |
+| .spi (word lists)       | out (default)                                                               |                                              |                            |
+| .spt (lookups)          |                                                                             |                                              |                            |
+| .spm (killed docs)      |                                                                             |                                              |                            |
++-------------------------+-----------------------------------+-----------------------------------------+----------------------------------------------+----------------------------+
 | .spb (blob attributes)  | access_blob_attrs=mmap - the file will be mapped to RAM, but your OS will   | access_blob_attrs = mmap_preread (default)   | access_blob_attrs = mlock  |
-| (string, mva and json   | decide whether to really load it to RAM or not and can easily swap it    |                                              |                            |
-| attributes)             | out (default)                                                            |                                              |                            |
-+-------------------------+-----------------------------------+--------------------------------------+----------------------------------------------+----------------------------+
-| .spd (doc lists)        | access_doclists = file (default)  | access_doclists = mmap, may be still | no                                           | access_doclists = mlock    |
-|                         |                                   | swapped out by OS                    |                                              |                            |
-+-------------------------+-----------------------------------+--------------------------------------+----------------------------------------------+----------------------------+
-| .spp (hit lists)        | access_hitlists = file (default)  | access_hitlists = mmap, may be still | no                                           | access_hitlists = mlock    |
-|                         |                                   | swapped out by OS                    |                                              |                            |
-+-------------------------+-----------------------------------+--------------------------------------+----------------------------------------------+----------------------------+
+| (string, mva and json   | decide whether to really load it to RAM or not and can easily swap it       |                                              |                            |
+| attributes)             | out (default)                                                               |                                              |                            |
++-------------------------+-----------------------------------+-----------------------------------------+----------------------------------------------+----------------------------+
+| .spd (doc lists)        | access_doclists = file (default)  | access_doclists = mmap, may be still    | no                                           | access_doclists = mlock    |
+|                         |                                   | swapped out by OS                       |                                              |                            |
++-------------------------+-----------------------------------+-----------------------------------------+----------------------------------------------+----------------------------+
+| .spp (hit lists)        | access_hitlists = file (default)  | access_hitlists = mmap, may be still    | no                                           | access_hitlists = mlock    |
+|                         |                                   | swapped out by OS                       |                                              |                            |
++-------------------------+-----------------------------------+-----------------------------------------+----------------------------------------------+----------------------------+
 
 , , ,