Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Use the simple dictionary in fts for the user directory #8959

Merged
merged 2 commits into from
Dec 17, 2020

Conversation

babolivier
Copy link
Contributor

@babolivier babolivier commented Dec 17, 2020

Some user directory searches would fail because PostgreSQL would ignore common English words. For example, when looking up a user with the display name "Bela":

$ curl -H "Authorization: Bearer [...]" https://trelawney.brendanabolivier.com/_matrix/client/r0/user_directory/search -d '{"search_term": "b"}'
{"limited":false,"results":[{"user_id":"@sometestuser:labs.abolivier.bzh","display_name":"Bela","avatar_url":null}]}
$ curl -H "Authorization: Bearer [...]" https://trelawney.brendanabolivier.com/_matrix/client/r0/user_directory/search -d '{"search_term": "be"}' 
{"limited":false,"results":[]}

In this example, it identifies "be" as a common English word and don't consider it for a full text search.

The fix to this is to use the "simple" dictionary which doesn't have any stop word, i.e. any common word that needs to be ignored when running a full text search. The result of that change in the example above is:

$ curl -H "Authorization: Bearer [...]" https://trelawney.brendanabolivier.com/_matrix/client/r0/user_directory/search -d '{"search_term": "b"}'
{"limited":false,"results":[{"user_id":"@sometestuser:labs.abolivier.bzh","display_name":"Bela","avatar_url":null}]}
$ curl -H "Authorization: Bearer [...]" https://trelawney.brendanabolivier.com/_matrix/client/r0/user_directory/search -d '{"search_term": "be"}'
{"limited":false,"results":[{"user_id":"@sometestuser:labs.abolivier.bzh","display_name":"Bela","avatar_url":null}]}

Looking up "be" now correctly yields our "Bela" user in the search results.

fwiw my initial intention is mostly to fix this for search_user_dir but I've also changed _update_profile_in_user_dir_txn that way for consistency.

@babolivier babolivier self-assigned this Dec 17, 2020
@babolivier babolivier requested a review from a team December 17, 2020 10:52
@babolivier babolivier force-pushed the babolivier/user_dir_lang branch 3 times, most recently from 3230f54 to b372bb2 Compare December 17, 2020 11:28
@anoadragon453
Copy link
Member

Probably a less common usecase I'll admit, but would this hamper results for displaynames that actually did have "be" as a separate word in them? Say "Be Awesome" or something.

@babolivier
Copy link
Contributor Author

babolivier commented Dec 17, 2020

@anoadragon453 No, it would still work - this PR just makes it so "be" isn't considered a word to ignore for fts.

Copy link
Contributor

@clokep clokep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the conversation on matrix this seems OK!

I think this fixes #2931, by the way.

tests/storage/test_user_directory.py Show resolved Hide resolved
tests/storage/test_user_directory.py Show resolved Hide resolved
@babolivier babolivier merged commit f2783fc into develop Dec 17, 2020
@babolivier babolivier deleted the babolivier/user_dir_lang branch December 17, 2020 13:42
@callahad
Copy link
Contributor

Probably a less common usecase I'll admit, but would this hamper results for displaynames that actually did have "be" as a separate word in them? Say "Be Awesome" or something.

For display names with stop words ("Be Awesome"), do existing databases have bad data? Is there anything we can do to remediate that?

If I'm reading this right, update_profile_in_user_dir() was previously inserting into the user_directory_search table using an English tsvector. Would that mean stopwords like "be" get omitted, and would this be missing in the tsvector for entries populated prior to this patch?

@callahad
Copy link
Contributor

(cc @babolivier for the above)

@callahad
Copy link
Contributor

For example:

postgres=# SELECT to_tsvector('english', 'be awesome');
 to_tsvector 
-------------
 'awesom':2
(1 row)

postgres=# SELECT to_tsvector('simple', 'be awesome');
    to_tsvector     
--------------------
 'awesome':2 'be':1
(1 row)

And then considering this patch changes stuff like this:

diff --git a/synapse/storage/databases/main/user_directory.py b/synapse/storage/databases/main/user_directory.py
index d87ceec6da8..b4fa8a7b613 100644
--- a/synapse/storage/databases/main/user_directory.py
+++ b/synapse/storage/databases/main/user_directory.py
@@ -393,9 +393,9 @@ def _update_profile_in_user_dir_txn(txn):
                     sql = """
                         INSERT INTO user_directory_search(user_id, vector)
                         VALUES (?,
-                            setweight(to_tsvector('english', ?), 'A')
-                            || setweight(to_tsvector('english', ?), 'D')
-                            || setweight(to_tsvector('english', COALESCE(?, '')), 'B')
+                            setweight(to_tsvector('simple', ?), 'A')
+                            || setweight(to_tsvector('simple', ?), 'D')
+                            || setweight(to_tsvector('simple', COALESCE(?, '')), 'B')
                         ) ON CONFLICT (user_id) DO UPDATE SET vector=EXCLUDED.vector
                     """
                     txn.execute(

...so what about older contents of user_directory_search?

clokep added a commit that referenced this pull request Jan 6, 2021
Synapse 1.25.0rc1 (2021-01-06)
==============================

Removal warning
---------------

The old [Purge Room API](https://github.com/matrix-org/synapse/tree/master/docs/admin_api/purge_room.md)
and [Shutdown Room API](https://github.com/matrix-org/synapse/tree/master/docs/admin_api/shutdown_room.md)
are deprecated and will be removed in a future release. They will be replaced by the
[Delete Room API](https://github.com/matrix-org/synapse/tree/master/docs/admin_api/rooms.md#delete-room-api).

`POST /_synapse/admin/v1/rooms/<room_id>/delete` replaces `POST /_synapse/admin/v1/purge_room` and
`POST /_synapse/admin/v1/shutdown_room/<room_id>`.

Features
--------

- Add an admin API that lets server admins get power in rooms in which local users have power. ([\#8756](#8756))
- Add optional HTTP authentication to replication endpoints. ([\#8853](#8853))
- Improve the error messages printed as a result of configuration problems for extension modules. ([\#8874](#8874))
- Add the number of local devices to Room Details Admin API. Contributed by @dklimpel. ([\#8886](#8886))
- Add `X-Robots-Tag` header to stop web crawlers from indexing media. Contributed by Aaron Raimist. ([\#8887](#8887))
- Spam-checkers may now define their methods as `async`. ([\#8890](#8890))
- Add support for allowing users to pick their own user ID during a single-sign-on login. ([\#8897](#8897), [\#8900](#8900), [\#8911](#8911), [\#8938](#8938), [\#8941](#8941), [\#8942](#8942), [\#8951](#8951))
- Add an `email.invite_client_location` configuration option to send a web client location to the invite endpoint on the identity server which allows customisation of the email template. ([\#8930](#8930))
- The search term in the list room and list user Admin APIs is now treated as case-insensitive. ([\#8931](#8931))
- Apply an IP range blacklist to push and key revocation requests. ([\#8821](#8821), [\#8870](#8870), [\#8954](#8954))
- Add an option to allow re-use of user-interactive authentication sessions for a period of time. ([\#8970](#8970))
- Allow running the redact endpoint on workers. ([\#8994](#8994))

Bugfixes
--------

- Fix bug where we might not correctly calculate the current state for rooms with multiple extremities. ([\#8827](#8827))
- Fix a long-standing bug in the register admin endpoint (`/_synapse/admin/v1/register`) when the `mac` field was not provided. The endpoint now properly returns a 400 error. Contributed by @edwargix. ([\#8837](#8837))
- Fix a long-standing bug on Synapse instances supporting Single-Sign-On, where users would be prompted to enter their password to confirm certain actions, even though they have not set a password. ([\#8858](#8858))
- Fix a longstanding bug where a 500 error would be returned if the `Content-Length` header was not provided to the upload media resource. ([\#8862](#8862))
- Add additional validation to pusher URLs to be compliant with the specification. ([\#8865](#8865))
- Fix the error code that is returned when a user tries to register on a homeserver on which new-user registration has been disabled. ([\#8867](#8867))
- Fix a bug where `PUT /_synapse/admin/v2/users/<user_id>` failed to create a new user when `avatar_url` is specified. Bug introduced in Synapse v1.9.0. ([\#8872](#8872))
- Fix a 500 error when attempting to preview an empty HTML file. ([\#8883](#8883))
- Fix occasional deadlock when handling SIGHUP. ([\#8918](#8918))
- Fix login API to not ratelimit application services that have ratelimiting disabled. ([\#8920](#8920))
- Fix bug where we ratelimited auto joining of rooms on registration (using `auto_join_rooms` config). ([\#8921](#8921))
- Fix a bug where deactivated users appeared in the user directory when their profile information was updated. ([\#8933](#8933), [\#8964](#8964))
- Fix bug introduced in Synapse v1.24.0 which would cause an exception on startup if both `enabled` and `localdb_enabled` were set to `False` in the `password_config` setting of the configuration file. ([\#8937](#8937))
- Fix a bug where 500 errors would be returned if the `m.room_history_visibility` event had invalid content. ([\#8945](#8945))
- Fix a bug causing common English words to not be considered for a user directory search. ([\#8959](#8959))
- Fix bug where application services couldn't register new ghost users if the server had reached its MAU limit. ([\#8962](#8962))
- Fix a long-standing bug where a `m.image` event without a `url` would cause errors on push. ([\#8965](#8965))
- Fix a small bug in v2 state resolution algorithm, which could also cause performance issues for rooms with large numbers of power levels. ([\#8971](#8971))
- Add validation to the `sendToDevice` API to raise a missing parameters error instead of a 500 error. ([\#8975](#8975))
- Add validation of group IDs to raise a 400 error instead of a 500 eror. ([\#8977](#8977))

Improved Documentation
----------------------

- Fix the "Event persist rate" section of the included grafana dashboard by adding missing prometheus rules. ([\#8802](#8802))
- Combine related media admin API docs. ([\#8839](#8839))
- Fix an error in the documentation for the SAML username mapping provider. ([\#8873](#8873))
- Clarify comments around template directories in `sample_config.yaml`. ([\#8891](#8891))
- Moved instructions for database setup, adjusted heading levels and improved syntax highlighting in [INSTALL.md](../INSTALL.md). Contributed by fossterer. ([\#8987](#8987))
- Update the example value of `group_creation_prefix` in the sample configuration. ([\#8992](#8992))
- Link the Synapse developer room to the development section in the docs. ([\#9002](#9002))

Deprecations and Removals
-------------------------

- Deprecate Shutdown Room and Purge Room Admin APIs. ([\#8829](#8829))

Internal Changes
----------------

- Properly store the mapping of external ID to Matrix ID for CAS users. ([\#8856](#8856), [\#8958](#8958))
- Remove some unnecessary stubbing from unit tests. ([\#8861](#8861))
- Remove unused `FakeResponse` class from unit tests. ([\#8864](#8864))
- Pass `room_id` to `get_auth_chain_difference`. ([\#8879](#8879))
- Add type hints to push module. ([\#8880](#8880), [\#8882](#8882), [\#8901](#8901), [\#8940](#8940), [\#8943](#8943), [\#9020](#9020))
- Simplify logic for handling user-interactive-auth via single-sign-on servers. ([\#8881](#8881))
- Skip the SAML tests if the requirements (`pysaml2` and `xmlsec1`) aren't available. ([\#8905](#8905))
- Fix multiarch docker image builds. ([\#8906](#8906))
- Don't publish `latest` docker image until all archs are built. ([\#8909](#8909))
- Various clean-ups to the structured logging and logging context code. ([\#8916](#8916), [\#8935](#8935))
- Automatically drop stale forward-extremities under some specific conditions. ([\#8929](#8929))
- Refactor test utilities for injecting HTTP requests. ([\#8946](#8946))
- Add a maximum size of 50 kilobytes to .well-known lookups. ([\#8950](#8950))
- Fix bug in `generate_log_config` script which made it write empty files. ([\#8952](#8952))
- Clean up tox.ini file; disable coverage checking for non-test runs. ([\#8963](#8963))
- Add type hints to the admin and room list handlers. ([\#8973](#8973))
- Add type hints to the receipts and user directory handlers. ([\#8976](#8976))
- Drop the unused `local_invites` table. ([\#8979](#8979))
- Add type hints to the base storage code. ([\#8980](#8980))
- Support using PyJWT v2.0.0 in the test suite. ([\#8986](#8986))
- Fix `tests.federation.transport.RoomDirectoryFederationTests` and ensure it runs in CI. ([\#8998](#8998))
- Add type hints to the crypto module. ([\#8999](#8999))
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Jan 13, 2021
Synapse 1.25.0 (2021-01-13)
===========================

Ending Support for Python 3.5 and Postgres 9.5
----------------------------------------------

With this release, the Synapse team is announcing a formal deprecation policy for our platform dependencies, like Python and PostgreSQL:

All future releases of Synapse will follow the upstream end-of-life schedules.

Which means:

* This is the last release which guarantees support for Python 3.5.
* We will end support for PostgreSQL 9.5 early next month.
* We will end support for Python 3.6 and PostgreSQL 9.6 near the end of the year.

Crucially, this means __we will not produce .deb packages for Debian 9 (Stretch) or Ubuntu 16.04 (Xenial)__ beyond the transition period described below.

The website https://endoflife.date/ has convenient summaries of the support schedules for projects like [Python](https://endoflife.date/python) and [PostgreSQL](https://endoflife.date/postgresql).

If you are unable to upgrade your environment to a supported version of Python or Postgres, we encourage you to consider using the [Synapse Docker images](./INSTALL.md#docker-images-and-ansible-playbooks) instead.

### Transition Period

We will make a good faith attempt to avoid breaking compatibility in all releases through the end of March 2021. However, critical security vulnerabilities in dependencies or other unanticipated circumstances may arise which necessitate breaking compatibility earlier.

We intend to continue producing .deb packages for Debian 9 (Stretch) and Ubuntu 16.04 (Xenial) through the transition period.

Removal warning
---------------

The old [Purge Room API](https://github.com/matrix-org/synapse/tree/master/docs/admin_api/purge_room.md)
and [Shutdown Room API](https://github.com/matrix-org/synapse/tree/master/docs/admin_api/shutdown_room.md)
are deprecated and will be removed in a future release. They will be replaced by the
[Delete Room API](https://github.com/matrix-org/synapse/tree/master/docs/admin_api/rooms.md#delete-room-api).

`POST /_synapse/admin/v1/rooms/<room_id>/delete` replaces `POST /_synapse/admin/v1/purge_room` and
`POST /_synapse/admin/v1/shutdown_room/<room_id>`.

Bugfixes
--------

- Fix HTTP proxy support when using a proxy that is on a blacklisted IP. Introduced in v1.25.0rc1. Contributed by @Bubu. ([\#9084](matrix-org/synapse#9084))


Synapse 1.25.0rc1 (2021-01-06)
==============================

Features
--------

- Add an admin API that lets server admins get power in rooms in which local users have power. ([\#8756](matrix-org/synapse#8756))
- Add optional HTTP authentication to replication endpoints. ([\#8853](matrix-org/synapse#8853))
- Improve the error messages printed as a result of configuration problems for extension modules. ([\#8874](matrix-org/synapse#8874))
- Add the number of local devices to Room Details Admin API. Contributed by @dklimpel. ([\#8886](matrix-org/synapse#8886))
- Add `X-Robots-Tag` header to stop web crawlers from indexing media. Contributed by Aaron Raimist. ([\#8887](matrix-org/synapse#8887))
- Spam-checkers may now define their methods as `async`. ([\#8890](matrix-org/synapse#8890))
- Add support for allowing users to pick their own user ID during a single-sign-on login. ([\#8897](matrix-org/synapse#8897), [\#8900](matrix-org/synapse#8900), [\#8911](matrix-org/synapse#8911), [\#8938](matrix-org/synapse#8938), [\#8941](matrix-org/synapse#8941), [\#8942](matrix-org/synapse#8942), [\#8951](matrix-org/synapse#8951))
- Add an `email.invite_client_location` configuration option to send a web client location to the invite endpoint on the identity server which allows customisation of the email template. ([\#8930](matrix-org/synapse#8930))
- The search term in the list room and list user Admin APIs is now treated as case-insensitive. ([\#8931](matrix-org/synapse#8931))
- Apply an IP range blacklist to push and key revocation requests. ([\#8821](matrix-org/synapse#8821), [\#8870](matrix-org/synapse#8870), [\#8954](matrix-org/synapse#8954))
- Add an option to allow re-use of user-interactive authentication sessions for a period of time. ([\#8970](matrix-org/synapse#8970))
- Allow running the redact endpoint on workers. ([\#8994](matrix-org/synapse#8994))


Bugfixes
--------

- Fix bug where we might not correctly calculate the current state for rooms with multiple extremities. ([\#8827](matrix-org/synapse#8827))
- Fix a long-standing bug in the register admin endpoint (`/_synapse/admin/v1/register`) when the `mac` field was not provided. The endpoint now properly returns a 400 error. Contributed by @edwargix. ([\#8837](matrix-org/synapse#8837))
- Fix a long-standing bug on Synapse instances supporting Single-Sign-On, where users would be prompted to enter their password to confirm certain actions, even though they have not set a password. ([\#8858](matrix-org/synapse#8858))
- Fix a longstanding bug where a 500 error would be returned if the `Content-Length` header was not provided to the upload media resource. ([\#8862](matrix-org/synapse#8862))
- Add additional validation to pusher URLs to be compliant with the specification. ([\#8865](matrix-org/synapse#8865))
- Fix the error code that is returned when a user tries to register on a homeserver on which new-user registration has been disabled. ([\#8867](matrix-org/synapse#8867))
- Fix a bug where `PUT /_synapse/admin/v2/users/<user_id>` failed to create a new user when `avatar_url` is specified. Bug introduced in Synapse v1.9.0. ([\#8872](matrix-org/synapse#8872))
- Fix a 500 error when attempting to preview an empty HTML file. ([\#8883](matrix-org/synapse#8883))
- Fix occasional deadlock when handling SIGHUP. ([\#8918](matrix-org/synapse#8918))
- Fix login API to not ratelimit application services that have ratelimiting disabled. ([\#8920](matrix-org/synapse#8920))
- Fix bug where we ratelimited auto joining of rooms on registration (using `auto_join_rooms` config). ([\#8921](matrix-org/synapse#8921))
- Fix a bug where deactivated users appeared in the user directory when their profile information was updated. ([\#8933](matrix-org/synapse#8933), [\#8964](matrix-org/synapse#8964))
- Fix bug introduced in Synapse v1.24.0 which would cause an exception on startup if both `enabled` and `localdb_enabled` were set to `False` in the `password_config` setting of the configuration file. ([\#8937](matrix-org/synapse#8937))
- Fix a bug where 500 errors would be returned if the `m.room_history_visibility` event had invalid content. ([\#8945](matrix-org/synapse#8945))
- Fix a bug causing common English words to not be considered for a user directory search. ([\#8959](matrix-org/synapse#8959))
- Fix bug where application services couldn't register new ghost users if the server had reached its MAU limit. ([\#8962](matrix-org/synapse#8962))
- Fix a long-standing bug where a `m.image` event without a `url` would cause errors on push. ([\#8965](matrix-org/synapse#8965))
- Fix a small bug in v2 state resolution algorithm, which could also cause performance issues for rooms with large numbers of power levels. ([\#8971](matrix-org/synapse#8971))
- Add validation to the `sendToDevice` API to raise a missing parameters error instead of a 500 error. ([\#8975](matrix-org/synapse#8975))
- Add validation of group IDs to raise a 400 error instead of a 500 eror. ([\#8977](matrix-org/synapse#8977))


Improved Documentation
----------------------

- Fix the "Event persist rate" section of the included grafana dashboard by adding missing prometheus rules. ([\#8802](matrix-org/synapse#8802))
- Combine related media admin API docs. ([\#8839](matrix-org/synapse#8839))
- Fix an error in the documentation for the SAML username mapping provider. ([\#8873](matrix-org/synapse#8873))
- Clarify comments around template directories in `sample_config.yaml`. ([\#8891](matrix-org/synapse#8891))
- Move instructions for database setup, adjusted heading levels and improved syntax highlighting in [INSTALL.md](../INSTALL.md). Contributed by @fossterer. ([\#8987](matrix-org/synapse#8987))
- Update the example value of `group_creation_prefix` in the sample configuration. ([\#8992](matrix-org/synapse#8992))
- Link the Synapse developer room to the development section in the docs. ([\#9002](matrix-org/synapse#9002))


Deprecations and Removals
-------------------------

- Deprecate Shutdown Room and Purge Room Admin APIs. ([\#8829](matrix-org/synapse#8829))


Internal Changes
----------------

- Properly store the mapping of external ID to Matrix ID for CAS users. ([\#8856](matrix-org/synapse#8856), [\#8958](matrix-org/synapse#8958))
- Remove some unnecessary stubbing from unit tests. ([\#8861](matrix-org/synapse#8861))
- Remove unused `FakeResponse` class from unit tests. ([\#8864](matrix-org/synapse#8864))
- Pass `room_id` to `get_auth_chain_difference`. ([\#8879](matrix-org/synapse#8879))
- Add type hints to push module. ([\#8880](matrix-org/synapse#8880), [\#8882](matrix-org/synapse#8882), [\#8901](matrix-org/synapse#8901), [\#8940](matrix-org/synapse#8940), [\#8943](matrix-org/synapse#8943), [\#9020](matrix-org/synapse#9020))
- Simplify logic for handling user-interactive-auth via single-sign-on servers. ([\#8881](matrix-org/synapse#8881))
- Skip the SAML tests if the requirements (`pysaml2` and `xmlsec1`) aren't available. ([\#8905](matrix-org/synapse#8905))
- Fix multiarch docker image builds. ([\#8906](matrix-org/synapse#8906))
- Don't publish `latest` docker image until all archs are built. ([\#8909](matrix-org/synapse#8909))
- Various clean-ups to the structured logging and logging context code. ([\#8916](matrix-org/synapse#8916), [\#8935](matrix-org/synapse#8935))
- Automatically drop stale forward-extremities under some specific conditions. ([\#8929](matrix-org/synapse#8929))
- Refactor test utilities for injecting HTTP requests. ([\#8946](matrix-org/synapse#8946))
- Add a maximum size of 50 kilobytes to .well-known lookups. ([\#8950](matrix-org/synapse#8950))
- Fix bug in `generate_log_config` script which made it write empty files. ([\#8952](matrix-org/synapse#8952))
- Clean up tox.ini file; disable coverage checking for non-test runs. ([\#8963](matrix-org/synapse#8963))
- Add type hints to the admin and room list handlers. ([\#8973](matrix-org/synapse#8973))
- Add type hints to the receipts and user directory handlers. ([\#8976](matrix-org/synapse#8976))
- Drop the unused `local_invites` table. ([\#8979](matrix-org/synapse#8979))
- Add type hints to the base storage code. ([\#8980](matrix-org/synapse#8980))
- Support using PyJWT v2.0.0 in the test suite. ([\#8986](matrix-org/synapse#8986))
- Fix `tests.federation.transport.RoomDirectoryFederationTests` and ensure it runs in CI. ([\#8998](matrix-org/synapse#8998))
- Add type hints to the crypto module. ([\#8999](matrix-org/synapse#8999))
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants