concept: index search

Use Case considerations

Which reasons for searching do exist?

Why do helpdesk users or admins want to search? Which different reasons do exist and in which way would searching be most sensible?

helpdesk wants to help now. What do they search for?
admin wants to get an overview. When do they do this? What is the expected output?

What contents is searched for?

Do admins search directly for a specific token serial?

Do heldesk users search for a distinct user? In which case? What do they expect to see? Only the user or also e.g. audit entries of the users?

Why do they search for tokentypes or any attributes?

Goal: Identify important use cases to start with. Identify which types or needs for searches do exist and which are performed in the UI and which maybe are performed better with e.g. the token janitor or which results are only supposed to be statistical data...

Index Search

privacyIDEA should get a searchbar on the dashboard to enable the admins to quickly find information within privacyIDEA. The most prominent types of information are Tokens and Users. For tokens the user would like to not search only by serial, but maybe also by description, some tokeninfo or rollout state - we do not know exactly and implementing this would involve lots of database queries and lots of additional logics within privacyIDEA. Users have to be resolved to be searched. Even with quick Resolver connections, this can take unbearingly long. Also the database/server which is responsible for the authentication, should not be busy with processing search queries.

To overcome these issues, privacyIDEA could provide interfaces to one or two search providers like elasticsearch, whoosh and others. These index the data of privacyIDEA the admin would like to be able to search for. The index can even run on a different machine than privacyIDEA. However, this brings the risk of searching outdated data.

Keeping the cached index in sync with privacyIDEA can be done with logstash/elasticsearch as explained here https://www.elastic.co/de/blog/how-to-keep-elasticsearch-synchronized-with-a-relational-database-using-logstash

The privacyIDEA side

We need to specify the API interface and the lib interface for privacyIDEA. After this, everyone may implement their own search providers (of course we will ship at least a simple MySQL search provider).

define the REST and the lib interface for privacyIDEA
Define the index design. How to map data to the index? Most likely the table level is the most suitable.
Write documentation how to setup and sync the indices
Implement a search provider framework
Implement providers (accessible with SQLAlchemy)
- MySQL index
- elasticsearch These providers will hold the interface to query search engines for token, user and policy information

Specification

REST specification

Request

GET https://yourprivacyideaserver/search
    query=<searchstring>, e.g. "serial:S23", "hans" or "serial:TOTP* tokeninfo:software" (required)
    searchtypes=<token,user,tokeninfo> (optional)
    deep=<true|false> (optional)
    skip=<token,user,tokeninfo> (optional)
    only_counts=<true|false> (optional)

query holds the search string but may also contain additional conditions by e.g. serial:PIS*. By default, the search call will return once a section got results (early return).
searchtypes specifies which part(s) of privacyIDEA should be searched. At present this should be either user or serial but may be extended to policies, events and all sorts of documentation. If the searchtype is not given, the search will apply the query everywhere. The search of the different sections happens in the order of element in searchtypes.
deep: By default, the search call will return once a section got results (early return). If the deep parameter is set, this early return will not happen and all the searchtypes will be searched before returning.
skip: The specified searchtypes are skipped. This is useful when continuing a search (see early return)
only_counts: by default, the matched objects and the counts are returned. return optionally specifies if only the counts should be returned.

Early return

By default the search results are returned once there are matches for a searchtype. The searchtypes that were already searched are returned in the "searched" field. Therefore the search can be continued by issuing another request to /search by passing the returned searchtypes from "searched" field as the skip parameter (those are just skipped by the server).

Alternatively: one could issue a new request to /search by giving the full response of the old request as a continue parameter. By doing this, privacyIDEA can actually append the new results to the old response. The deep parameter is used to prevent an early return of the search results. With this parameter, all searchtypes will be processed before the full result is returned.

Are there other possibilities to realize continuous return of results?

Response

If the number of results is too large (say >300 matches or set by policy), only the counts/stats of the results are returned. The potential counts to be returned are discussed in concept: stats endpoint for dashboard.

The list of results for each searchtype has a counts value with the number of matches to easily identify if there are matches for this searchtype. Additionally, the processed searchtypes are returned in the value field "searched", e.g.

        {
          "id": 1,
          "jsonrpc": "2.0",
          "result": {
            "status": true,
            "value": [
              {
                "searched": ["user", "token"],
                "user": {
                  "realm1": [{ <user1> }, { <user2> }], 
                  "realm2": [{ <user3> }, { <user4> }],
                  "counts": {
                    "total": 4,
                    "realm1": {"total": 2, ...},
                    "realm2": {"total": 2, ...},
                  },
                },
                "token": { <token1> },
                  "counts": {
                    "total": 1,
                    "tokentype_totp": 1,
                    "assigned": 1,
                    "unassigned": 0,
                    "software": 0,
                    "hardware": 0,
                  },
              },
            ],
          }
        }

lib specification

Analogous to the audit and logging providers we will implement a SearchProvider base class which has a number of methods.

search(query, scope=None, deep=False)
create_index()
update_index()

Starting points for the implementation

One of the most active SQLAlchemy/Flask interfaces to search indexers is flask-msearch. It can use three backends "simple", whoosh and elasticsearch.
Another, not so active interface is Flask-WhooshAlchemy3 (not active anymore since 2020)
whoosh is an indexer written in python https://github.com/mchaput/whoosh. It stores the index in plain text files.
elasticsearch is a powerful search engine, written in Java, developed by elastic https://www.elastic.co/elasticsearch/

Access restrictions

Access restrictions by policies would be implemented at the endpoint level. Since the general /search endpoint involves access to different policy-restricted data, we need additional decorators for the endpoint.

Search on Dashboard

The search will be shown as a single search bar. The results will be shown as a list per searchtype (since we need different column names). The result lists will be shown only if there are results and hidden if the search bar is empty.

Stats on Dashboard

Since there are different count values returned for each search type. the /search endpoint can be used to display statistics. To speed up the query, the search endpoint is queried with counts_only which prevents the returning of the actual matched objects (see concept: stats endpoint for dashboard).