Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change DB structure #1902

Merged
merged 3 commits into from
Nov 6, 2023
Merged

Conversation

H3rnand3zzz
Copy link
Contributor

@H3rnand3zzz H3rnand3zzz commented Oct 26, 2023

Backup your DB before performing any testing.

All tests should be done with the final version.

  • Pass all tests
    image
  • Test migration
  • Test sending + editing with /receipts on/off
  • Test empty DB

Even though it does not drop any tables, I want to be on the cautious side and avoid any potential data loss by current or future profanity developers.

It's not ready for release yet and just shows PoC. The point of the PR is to share the idea for further discussion and development. I am not even sure if it works properly or even works in some parts. In 1 day I plan to tidy it up and make it generally better.

Read commit message for further details. Discussion links are going to be attached here and in commit message later.

LMC is (not) fun

Abstract

The initial problem is that when we receive messages from old client, they can have duplicate IDs and currently we don't even write them to DB, but it goes much further than that: current DB visualization (message history) relies on heavy JOIN with redacted messages. Explanation for necessity of replaces_db_id can be found in earlier discussions (#1893, but primarily in #1899), but to simplify, it would improve performance as well as give another benefit which is going to be discussed further. I will provide explanation for message editing (chained edits vs direct edits), short explanation for duplicated IDs problem and explained new message storage approach.

Chained Edits vs Direct Edits

Chained Edit is edit that points to previous edit on second and further edits. E.g.

ID Message replaces_db_id
1 Hlo NULL
2 Hello 1
3 Hello. How are you? 2

We have to traverse all the edits until we can find the original message. This is not an ideal solution and LMC was not designed to be this way. Quoting the XEP,
If the same message is to be corrected several times, the id of the original message is used in each case (e.g. If a message of id 1 is corrected by a message of id 2, a subsequent correction should correct 1, not 2).

Though this misconception is so common that even Profanity, as well as some other popular XMPP clients (e.g. Psi) use this approach for subsequent edits.

Note: sometimes it's regarded in conversation as edit->edit->edit->original.

Direct edit is edit that always points to the original message

ID Message replaces_db_id
1 Hlo NULL
2 Hello 1
3 Hello. How are you? 1

This approach allows us to always find the original message without any need to follow chain of edits. Some clients (e.g. Gajim) follow this XEP.

But as some clients make chained edits with different stanza ids, the best way for us to handle both ways would be introducing replaces_db_id and using it for direct-edit approach internally, thus accepting both ways for editing messages.

ID Stanza Message replaces_db_id replaces_stanza_id
1 aa Hlo NULL NULL
2 xx Hello 1 aa
3 zz Hello. How are you? 1 xx

Here we would check xx's replace_db_id and set replaces_db_id as 1 (because it's the original message). Thus we don't rely on external client's approach and can always save messages in a suitable manner (performance- and design-wise).

Duplicated ID problem

Duplicated ID is not a problem within itself. We can easily put messages with duplicated IDs in the DB as even DB field for stanza_id is not set to be UNIQUE. The problem starts with message replacement. When we get message replacement (LMC) for the message with stanza_id that is present 2 or more times, with old approach it wouldn't be (or barely would be) possible to replace them correctly on visualization, not to mention trying to stay performant while doing so. replaces_db_id could be a key to solving this problem, but it still would be just as an inefficient visualization with JOIN which can be called on every scroll up.

New DB approach

Introduced fields:

  • replaces_db_id database ID for correcting message of the original message
  • replaced_by_db_id database ID for original message of the last correcting message

This way we connect all chained edits as direct edits, hence always have reference to the original message. replaced_by_db_id provides ability to replace messages quickly and correctly (JOIN only with latest change wasn't possible even with replaces_db_id).

Editable message storage approach (unacceptable, rejected by design police, left for historical ref.) - `replace_id` -> `replaces_stanza_id` stanza ID of the replaced message - `replaces_db_id` db ID of the replaced message - `original_message` original message before replacement - `message` now is used to store visual message (replaced by LMC)

So to show messages, we just need to select. But to store previous message, we can use original_message which stores a message which was in the original's message column at the moment of change. E.g. "Hlo" being replaced to "hello", and then to "Hello world!" we save "Hlo" to original_message field of the LMC and set the original message to "hello", after this on further change we will save "hello" as an original_message and set the original's message to "Hello world". In the DB it will look like that:

ID Message replaces_db_id original_message
1 Hello world! NULL NULL
2 1 hlo
3 1 hello

Despite looking unintuitive (theoretically, original_message could be renamed to previous_message), we have performant solution that keeps full edit history, performant on visualization (just a select) and robust enough to allow direct linking despite external chain link.


Other changes

Edited visualization of LMC, allowing to receive messages that do not have original message counterpart in the DB (as original might be lost, e.g. received by other client during disconnect). I couldn't find good way to allow deeper (3+ edits) chained edits, so it's now similar to gajim's approach (every second edit we display a new message and then it can be redact once ad infinitum). In future we can use DB ID as a key for buffer instead of stanza ids.

Fix #1899

@H3rnand3zzz H3rnand3zzz marked this pull request as draft October 26, 2023 23:43
@H3rnand3zzz H3rnand3zzz changed the title Change DB structure [POC, WIP, DANGEROUS (READ DESCRIPTION BEFORE TESTING)] Change DB structure [POC, WIP, READ DESCRIPTION BEFORE TESTING] Oct 27, 2023
@H3rnand3zzz H3rnand3zzz changed the title Change DB structure [POC, WIP, READ DESCRIPTION BEFORE TESTING] Change DB structure [READ DESCRIPTION BEFORE TESTING] Oct 27, 2023
@H3rnand3zzz H3rnand3zzz marked this pull request as ready for review October 28, 2023 13:40
H3rnand3zzz added a commit to H3rnand3zzz/profanity that referenced this pull request Oct 28, 2023
Please, backup your DB before performing any testing with changes in code.

Introduce new DB structure and DB migration mechanism.
The structure despite being unintuitive allows better maintainability,
**performance** and data integrity.

The new way is to replace original message, while keeping the original in
LMC messages, this way we can preserve full edit history
and avoid wasteful JOINs on every scroll up.

Change the way LMC messages are being displayed. Now we check if we
can replace a message from current buffer. If we don't have a message in
the buffer, it might've been lost, but we can still display it as a
new message.

Index `timestamp` column as well.

Performance boost is noticeable.

Further information available here:
profanity-im#1893
profanity-im#1899
profanity-im#1902
@H3rnand3zzz H3rnand3zzz changed the title Change DB structure [READ DESCRIPTION BEFORE TESTING] Change DB structure Oct 28, 2023
@jubalh jubalh added this to the next milestone Oct 28, 2023
src/database.c Outdated Show resolved Hide resolved
src/database.c Outdated Show resolved Hide resolved
H3rnand3zzz added a commit to H3rnand3zzz/profanity that referenced this pull request Oct 29, 2023
Please, backup your DB before performing any testing with changes in code.

Introduce new DB structure and DB migration mechanism.
The structure despite being unintuitive allows better maintainability,
**performance** and data integrity.

The new way is to replace original message, while keeping the original in
LMC messages, this way we can preserve full edit history
and avoid wasteful JOINs on every scroll up.

Change the way LMC messages are being displayed. Now we check if we
can replace a message from current buffer. If we don't have a message in
the buffer, it might've been lost, but we can still display it as a
new message.

Index `timestamp` column as well.

Performance boost is noticeable.

Further information available here:
profanity-im#1893
profanity-im#1899
profanity-im#1902
H3rnand3zzz added a commit to H3rnand3zzz/profanity that referenced this pull request Oct 29, 2023
Please, backup your DB before performing any testing with changes in code.

Introduce new DB structure and DB migration mechanism.
The structure despite being unintuitive allows better maintainability,
**performance** and data integrity.

The new way is to replace original message, while keeping the original in
LMC messages, this way we can preserve full edit history
and avoid wasteful JOINs on every scroll up.

Change the way LMC messages are being displayed. Now we check if we
can replace a message from current buffer. If we don't have a message in
the buffer, it might've been lost, but we can still display it as a
new message.

Index `timestamp` column as well.

Further information available here:
profanity-im#1893
profanity-im#1899
profanity-im#1902
H3rnand3zzz added a commit to H3rnand3zzz/profanity that referenced this pull request Oct 29, 2023
Please, backup your DB before performing any testing with changes in code.

Introduce new DB structure and DB migration mechanism.
The structure despite being unintuitive allows better maintainability,
**performance** and data integrity.

The new way is to replace original message, while keeping the original in
LMC messages, this way we can preserve full edit history
and avoid wasteful JOINs on every scroll up.

Change the way LMC messages are being displayed. Now we check if we
can replace a message from current buffer. If we don't have a message in
the buffer, it might've been lost, but we can still display it as a
new message.

Write messages with duplicate stanza IDs in the DB and allow LMC for them.

Index `timestamp` column as well.

Further details are available here:
profanity-im#1893
profanity-im#1899
profanity-im#1902
src/database.c Outdated Show resolved Hide resolved
src/database.c Outdated Show resolved Hide resolved
src/database.c Outdated Show resolved Hide resolved
src/database.c Outdated Show resolved Hide resolved
src/database.c Outdated Show resolved Hide resolved
src/database.c Outdated Show resolved Hide resolved
src/database.c Outdated Show resolved Hide resolved
src/database.c Outdated Show resolved Hide resolved
H3rnand3zzz added a commit to H3rnand3zzz/profanity that referenced this pull request Nov 3, 2023
**Please, backup your DB before performing any testing.**

Introduce new DB structure and DB migration mechanism.
Index `timestamp`, `to_jid`, `from_jid` columns to improve performance.
Add trigger for `replaced_by_db_id` calculation by DB on message insert.

Now LMC messages are interconnected with original messages,
this way we have fast access to last (hence correct) applicable edits,
as well as reference to the original message from the any edit (in case of chained edits).

Change the way LMC messages are being displayed. Now we check if we
can replace a message from current buffer. If we don't have a message in
the buffer, it might've been lost, but we can still display it as a
new message.

Further information available here:
profanity-im#1893
profanity-im#1899
profanity-im#1902
H3rnand3zzz added a commit to H3rnand3zzz/profanity that referenced this pull request Nov 3, 2023
**Please, backup your DB before performing any testing.**

Introduce new DB structure and DB migration mechanism.
Index `timestamp`, `to_jid`, `from_jid` columns to improve performance.
Add trigger for `replaced_by_db_id` calculation by DB on message insert.

Now LMC messages are interconnected with original messages,
this way we have fast access to last (hence correct) applicable edits,
as well as reference to the original message from any edit (in case of chained edits).

Change the way LMC messages are being displayed. Now we check if we
can replace a message from current buffer. If we don't have a message in
the buffer, it might've been lost, but we can still display it as a
new message.

Further information available here:
profanity-im#1893
profanity-im#1899
profanity-im#1902
H3rnand3zzz added a commit to H3rnand3zzz/profanity that referenced this pull request Nov 3, 2023
**Please, backup your DB before performing any testing.**

Introduce new DB structure and DB migration mechanism.
Index `timestamp`, `to_jid`, `from_jid` columns to improve performance.
Add trigger for `replaced_by_db_id` calculation by DB on message insert.

Now LMC messages are interconnected with original messages,
this way we have fast access to last (hence correct) applicable edits,
as well as reference to the original message from any edit (in case of chained edits).

Change the way LMC messages are being displayed. Now we check if we
can replace a message from current buffer. If we don't have a message in
the buffer, it might've been lost, but we can still display it as a
new message.

Further information available here:
profanity-im#1893
profanity-im#1899
profanity-im#1902
@H3rnand3zzz
Copy link
Contributor Author

Last version has been successfully retested (tests list is in the PR message).

H3rnand3zzz added a commit to H3rnand3zzz/profanity that referenced this pull request Nov 3, 2023
**Please, backup your DB before performing any testing.**

Introduce new DB structure and DB migration mechanism.
Index `timestamp`, `to_jid`, `from_jid` columns to improve performance.
Add trigger for `replaced_by_db_id` calculation by DB on message insert.

Now LMC messages are interconnected with original messages,
this way we have fast access to last (hence correct) applicable edits,
as well as reference to the original message from any edit (in case of chained edits).

Change the way LMC messages are being displayed. Now we check if we
can replace a message from current buffer. If we don't have a message in
the buffer, it might've been lost, but we can still display it as a
new message.

Further information available here:
profanity-im#1893
profanity-im#1899
profanity-im#1902
H3rnand3zzz added a commit to H3rnand3zzz/profanity that referenced this pull request Nov 3, 2023
**Please, backup your DB before performing any testing.**

Introduce new DB structure and DB migration mechanism.
Index `timestamp`, `to_jid`, `from_jid` columns to improve performance.
Add trigger for `replaced_by_db_id` calculation by DB on message insert.

Now LMC messages are interconnected with original messages,
this way we have fast access to last (hence correct) applicable edits,
as well as reference to the original message from any edit (in case of chained edits).

Change the way LMC messages are being displayed. Now we check if we
can replace a message from current buffer. If we don't have a message in
the buffer, it might've been lost, but we can still display it as a
new message.

Further information available here:
profanity-im#1893
profanity-im#1899
profanity-im#1902
H3rnand3zzz added a commit to H3rnand3zzz/profanity that referenced this pull request Nov 3, 2023
**Please, backup your DB before performing any testing.**

Introduce new DB structure and DB migration mechanism.
Index `timestamp`, `to_jid`, `from_jid` columns to improve performance.
Add trigger for `replaced_by_db_id` calculation by DB on message insert.

Now LMC messages are interconnected with original messages,
this way we have fast access to last (hence correct) applicable edits,
as well as reference to the original message from any edit (in case of chained edits).

Change the way LMC messages are being displayed. Now we check if we
can replace a message from current buffer. If we don't have a message in
the buffer, it might've been lost, but we can still display it as a
new message.

Further information available here:
profanity-im#1893
profanity-im#1899
profanity-im#1902
H3rnand3zzz added a commit to H3rnand3zzz/profanity that referenced this pull request Nov 3, 2023
**Please, backup your DB before performing any testing.**

Introduce new DB structure and DB migration mechanism.
Index `timestamp`, `to_jid`, `from_jid` columns to improve performance.
Add trigger for `replaced_by_db_id` calculation by DB on message insert.

Now LMC messages are interconnected with original messages,
this way we have fast access to last (hence correct) applicable edits,
as well as reference to the original message from any edit (in case of chained edits).

Change the way LMC messages are being displayed. Now we check if we
can replace a message from current buffer. If we don't have a message in
the buffer, it might've been lost, but we can still display it as a
new message.

Further information available here:
profanity-im#1893
profanity-im#1899
profanity-im#1902
**Please, backup your DB before performing any testing.**

Introduce new DB structure and DB migration mechanism.
Index `timestamp`, `to_jid`, `from_jid` columns to improve performance.
Add trigger for `replaced_by_db_id` calculation by DB on message insert.

Now LMC messages are interconnected with original messages,
this way we have fast access to last (hence correct) applicable edits,
as well as reference to the original message from any edit (in case of chained edits).

Change the way LMC messages are being displayed. Now we check if we
can replace a message from current buffer. If we don't have a message in
the buffer, it might've been lost, but we can still display it as a
new message.

Further information available here:
profanity-im#1893
profanity-im#1899
profanity-im#1902
Enhance data consistency by updating the database to treat empty
strings as NULL values. This change simplifies queries and improves
overall database integrity.
Disallow correcting historical MUC messages, as the XEP-308 requires.

Previous changes introduce problem with
"Illicit LMC attempt from conference@server/user for message from user"

During investigation it was revealed that XEP does not recommend support
of historical MUC messages correction.

```
When used in a Multi-User Chat (XEP-0045) context,
corrections must not be allowed (by the receiver)
for messages received before the sender joined the room -
particularly a full JID leaving the room
then rejoining and correcting a message SHOULD be disallowed,
as the entity behind the full JID in the MUC may have changed.
```
https://xmpp.org/extensions/xep-0308.html#rules

XEP details mentioned by @jubalh
Bug discovered and solution improved by  @jaeckel
Copy link
Member

@sjaeckel sjaeckel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work! Thanks

@jubalh jubalh merged commit 45fe1be into profanity-im:master Nov 6, 2023
6 checks passed
@jubalh
Copy link
Member

jubalh commented Nov 6, 2023

Thanks everybody for the detailed work and reviewing and patience.

sjaeckel added a commit that referenced this pull request Nov 7, 2023
Follow-up of 89dc7a4 resp. #1902

Signed-off-by: Steffen Jaeckel <jaeckel-floss@eyet-services.de>
@H3rnand3zzz H3rnand3zzz deleted the feature/improved-history branch November 7, 2023 16:55
H3rnand3zzz added a commit to H3rnand3zzz/profanity that referenced this pull request Nov 11, 2023
**Please, backup your DB before performing any testing.**

Introduce new DB structure and DB migration mechanism.
Index `timestamp`, `to_jid`, `from_jid` columns to improve performance.
Add trigger for `replaced_by_db_id` calculation by DB on message insert.

Now LMC messages are interconnected with original messages,
this way we have fast access to last (hence correct) applicable edits,
as well as reference to the original message from any edit (in case of chained edits).

Change the way LMC messages are being displayed. Now we check if we
can replace a message from current buffer. If we don't have a message in
the buffer, it might've been lost, but we can still display it as a
new message.

Further information available here:
profanity-im#1893
profanity-im#1899
profanity-im#1902
H3rnand3zzz pushed a commit to H3rnand3zzz/profanity that referenced this pull request Nov 11, 2023
Follow-up of 89dc7a4 resp. profanity-im#1902

Signed-off-by: Steffen Jaeckel <jaeckel-floss@eyet-services.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Save all received messages to DB even if duplicate ID is used
3 participants