MBS-13562: Check editing unlimited strings #3251

yvanzo · 2024-04-29T11:28:11Z

MBS-13562

See the commit messages for details.

The user interface has not been much updated for user-friendliness as this is an edge case.

Testing

Added automated:

unit tests for each validation subroutine
integration tests for each check implementation part (Role::CheckOverlongString, Role::EditArtistCredit, Medium::Util)

I didn’t add automated Selenium tests as there is no change to the user interface beyond the explanatory message.

Manually tested editing a sample database with:

creating/editing an artist with overlong name/sort name
creating/editing a recording with overlong name/artist credit join phrase
creating/editing a release with overlong name/artist credit join phrase
creating/editing a release with overlong track name/track artist credit join phrase

Documenting

Mention this edge case in the Editing_FAQ page. (See changes.)
Update the ticket according to the actual implementation.

Further action

Deploy in production
Merge to master
Release beta
Make the updated Editing_FAQ page visible on MusicBrainz website

reosarevok · 2024-04-29T12:43:48Z

lib/MusicBrainz/Server/Validation.pm

@@ -98,6 +102,12 @@ sub is_database_bigint_id {
    is_positive_integer($t) and $t <= $MAX_POSTGRES_BIGINT;
 }

+sub has_oneline_string_length {


This name is confusing to me - seeing it I wouldn't know if it means it's over the limit or under the limit. What about something boring like is_not_too_long_string?

It would be less confusing w.r.t. “over/under” but it doesn’t say for what it “is not too long”.

Would is_short_oneline_string carry both meanings?

is_short_enough_oneline_string? is_short_enough_single_line_string? Just is_short seems a bit funny because 1024 characters is not short :D But I can live with that too really.

I renamed it has_at_most_oneline_string_length in 4ee30f8 to keep it meaningful and not confusing.

lib/MusicBrainz/Server/Constants.pm

reosarevok · 2024-04-29T12:48:03Z

lib/MusicBrainz/Server/Validation.pm

@@ -102,6 +103,13 @@ sub is_database_bigint_id {
    is_positive_integer($t) and $t <= $MAX_POSTGRES_BIGINT;
 }

+sub is_database_indexable_string {


Why do we need both of this and the 1024 limit? Wouldn't this check be enough for our needs? Or is it just to have a more easily communicable limit?

See the commit messages.

I did check those, but I'm still a bit confused. It seems the real limit that triggered the bug was actually the 2704 bytes one, no? Wouldn't it be enough then to check for 2704 bytes and not worry about the amount of characters and whether they are multibyte or not?

I did check those, but I'm still a bit confused. It seems the real limit that triggered the bug was actually the 2704 bytes one, no?

The bug MBS-13555, yes. And it would also prevent MBS-13536 to happen again.

Wouldn't it be enough then to check for 2704 bytes and not worry about the amount of characters and whether they are multibyte or not?

As explained in commit message, it is more reasonable to additionally limit to 1024 characters.

IIUC, the 8192 limit from MBS-13536 is for all indexes generally, and the 2704 limit is for btree indexes. The index from MBS-13536 was also a btree index, but perhaps it hit the "general" limit first before it checked the more specifiec btree limits (but I am just guessing).

As explained in commit message, it is more reasonable to additionally limit to 1024 characters.

I agree that a fixed character limit is reasonable and also easier to explain to users -- though I'm not sure if the majority of the offenders are mistakes or intended to cause harm:

1055 characters - https://musicbrainz.org/edit/111225345
1056 characters - https://musicbrainz.org/edit/111225330
1126 characters - https://musicbrainz.org/edit/111225347
1176 characters - https://musicbrainz.org/edit/111225344
1377 characters - https://musicbrainz.org/edit/111225342

I'm not familiar with any of the source material though.

Given the additional btree-indexable byte limit check, we could perhaps be a little more generous and use a less-geeky number like 1500 which would cover most of the existing cases. Or maybe allow auto-editors to bypass the limit (as long as it's under 2704 bytes still).

Thanks for checking them; it'd be nice to replace your existing edits with the suggestions you're certain about when you have time, since it's much nicer than truncation, and it may be unlikely for some of these releases to be fixed for a long time! (Note I didn't link to any of the "creative writing" ones.)

The 24 Duos one is a wrong Discogs import where a lot of tracks got smashed together, I'll fix that one. I disagree that Transmission 7 is wrong, but I think it'd be ok either way - if you change that one you should change all though. "Korean Cosmetics" seems to be the artist name and should probably not be added to the title - I'd just find a good place to put ... at the end of that one, maybe just do Moringa Pterygosperma Seed Extract...? New Numbers seems fine to change to just that. The Musicalische Exequien, SWV 279 is also not wrong, but shortening it is probably sensible. That disc has other errors, so I'll do that part myself as well.

Edit: I wrote this comment in reply to @mwiencek before @reosarevok’s reply was visible to me.

Sorry but no, the changes should probably be made for the whole releases not just these tracks, this isn’t worth my time, and there is a lot of trash stuff around (such as the above second example with 1056 characters) that I don’t even want to read, that’s why I made a script to truncate names arbitrarily rather than editing by hand. I’m not a classical editor either, so the classical releases would be better edited by some more knowledgeable editor than me about classical releases. I’m sure that some interested editors are following these edits that are linked from the ticket’s comment and will make the most appropriate changes in time (likely after this check is deployed). As I wrote on IRC before, the data isn’t completely lost since it lies in the editing history anyway. There should be nothing wrong with truncating a few names arbitrarily for the sake of maintaining the database. If we decide to relax this constraint later on, it will just take changing a constant in code and reverting a few edits.

Thanks @reosarevok -- I guess we're all okay with the existing checks then. I didn't mean to start bikeshedding on a specific number, but was just hesitant to restrict possibly-valid data. @yvanzo, it sounds like you can just cancel your existing edits if reosarevok is looking into them.

Thanks @reosarevok -- I guess we're all okay with the existing checks then. I didn't mean to start bikeshedding on a specific number, but was just hesitant to restrict possibly-valid data.

Initially I was more leaning towards 512 bytes to be honest. I picked up 1024 bytes after your own data analysis on IRC as there is indeed no legitimate use case beyond it. Just reminding that titles are supposed to be “minimal summaries”, neither dissertations nor liner notes.

@yvanzo, it sounds like you can just cancel your existing edits if reosarevok is looking into them.

It wouldn’t hurt to approve all of these edits really, even if these titles are further trimmed later on by volunteers.

Edit: To clarify, it isn’t just about my personal time or sensibility. The truncations weren’t decreasing the data quality beyond what is reasonable since the data quality wasn’t great already. Their only purpose was to not block editors from making other changes after having deployed this patch. This is just not our job to edit the database, even though we sometimes do so as volunteers.

reosarevok · 2024-04-29T12:50:45Z

lib/MusicBrainz/Server/Edit/Role/EditArtistCredit.pm

+            if (defined($new_artist_credit_full_name) && !(
+                has_oneline_string_length($new_artist_credit_full_name) &&
+                is_database_indexable_string($new_artist_credit_full_name)
+            )) {


We do this check for three conditions often enough that it feels it should be its own function.

Would a function for all of these conditions in our Validation module do?

I added is_overlong_string in abe06ef.

reosarevok · 2024-04-29T12:53:19Z

root/layout.tt

+            <p>[% l('Some text you entered is overlong, please fix the text or
+                     shorten it and use annotation for the full text instead.') %]</p>


"overlong" is not a common word. What about "Some text you've entered is too long for us to support. Please shorten it, and if necessary enter the full text in the annotation for reference." or something like that?

"overlong" is not a common word. What about "Some text you've entered is too long for us to support.

It is not a common word but it isn’t overcomplicated to understand even the first time you read it. The fact that it is not common makes it easier to search the documentation for as we can refer to it in the Editing FAQ.

Please shorten it, and if necessary enter the full text in the annotation for reference." or something like that?

Committed through ef4fb7e.

root/layout/index.js

mwiencek · 2024-04-29T19:00:01Z

t/lib/t/MusicBrainz/Server/Edit/Medium/Edit.pm

+                Track->new(
+                    artist_credit => $c->model('ArtistCredit')->get_by_id(1),
+                    is_data_track => 0,
+                    name => ('𝄞𝄵𝅘𝅥𝅘𝅥𝅮𝅘𝅥𝅮𝅘𝅥𝅮𝄾𝅘𝅥𝄀𝅘𝅥𝅮𝅘𝅥𝅮𝅘𝅥𝅮𝄾𝆑𝅗𝅥𝄂' x 64),


A minor suggestion here would be to add no utf8; to the enclosing block in case we ever add use utf8; to the file for some reason and accidentally break this test. Edit: This should only matter for the non-bytes length check, though...

I’m not sure to follow here because the Validation tests have the same test string and use utf8; already.

Sorry, I gave wrong info. If these are utf8 strings with multi-byte characters, the files should have use utf8;. Otherwise I don't think length will work properly when they are passed elsewhere...

# P1.pm package P1; #use utf8; our $string = '𝄞𝄵𝅘𝅥𝅘𝅥𝅮𝅘𝅥𝅮𝅘𝅥𝅮𝄾𝅘𝅥𝄀𝅘𝅥𝅮𝅘𝅥𝅮𝅘𝅥𝅮𝄾𝆑𝅗𝅥𝄂' x 64;

# P2.pm package P2; use P1; print length($P1::string) . "\n"; use bytes; print length($P1::string) . "\n";

With #use utf8; commented out it will print 4096 for the first length check, but with use utf8; enabled it will print 1024.

Good catch! Fixed with the commit dfd7f9b.

Some one-line string are unlimited in the database schema, thus causing index issues in the main database (stuck edits), in obsolete mirrors (MBS-13536), and potentially in any piece of software accessing the data from the MusicBrainz database. The default “maximum size allowed for the index type” [^1] seems to be affected by the block size which is set to 8192 bytes by default [^2] when compiling PostgreSQL, at least in the amd64 builds by PGDG [^3]. Thus, 1024 characters should always fit, even if multiple-bytes [^4], preventing MBS-13536 from happening again with any other indexed column. We also noticed that there are very few titles/names with length over 1024 characters at the moment, and that most of these have been intentionally written to cause issues, while the few others are either mistakes or can be arranged otherwise. See MBS-13562 for details. References: * [^1] https://www.postgresql.org/docs/16/sql-createindex.html * [^2] https://www.postgresql.org/docs/16/limits.html * [^3] https://wiki.postgresql.org/wiki/Apt * [^4] https://www.postgresql.org/docs/16/multibyte.html

Because PostgreSQL actually fits up to “three items on every page” [^1], each string must take at most 8192 / 3 = 2704 bytes at most, in order to prevent MBS-13555 from happening again. References: * [^1] https://github.com/postgres/postgres/blob/REL_16_0/src/include/access/nbtree.h#L154-L163

mwiencek

Thanks for your time on this oversilly issue. ;)

mwiencek · 2024-04-29T22:25:54Z

lib/MusicBrainz/Server/Constants.pm

+
+=item $MAX_POSTGRES_INDEXED_STRING_BYTES
+
+Maximum number of bytes for a string to be indexable by Postgres.


In case you feel that it's worth mentioning, it's technically specific to btree indexes, but they are the default type of index and we don't use the other types extensively.

Totally. Now mentioned. See https://github.com/metabrainz/musicbrainz-server/compare/dfd7f9bca469279bb32ad50d301811a5035b92bd..b0a32248ef6a7e3ccafc18ce771d31bba3e4c5a7

This patch enforces the previous definitions of both one-line string length and indexable string size for most editable unlimited strings, no matter whether these are currently indexed or not, in order to prevent both MBS-13536 and MBS-13555 from happening again and also to prevent further related issues in third-party software using this data. The checks are made at editing time, and prevent edits to be created when overlong strings are entered for name, sort name, artist credit name and join phrase. An explanatory message is shown to the user, excepted in the release editor which just shows an ISE 500 message. The implementation comes in three parts: * `Role::CheckOverlongString` for name (and sort name if applicable) of primary entities, their aliases, and medium, * `Role::EditArtistCredit` for artist credit (name and join phrase) of primary entities, * `Medium::Util` for track name and track artist credit.

Co-authored-by: Nicolás Tamargo <reosarevok@metabrainz.org>

Without `use utf8;` the `length` call in our subroutime `has_at_most_oneline_string_length` would assume that each byte represents only one character and thus return the number of bytes instead of the number of multiple-bytes characters. For example, `length('𝄞𝄵𝅘𝅥𝅘𝅥𝅮𝅘𝅥𝅮𝅘𝅥𝅮𝄾𝅘𝅥𝄀𝅘𝅥𝅮𝅘𝅥𝅮𝅘𝅥𝅮𝄾𝆑𝅗𝅥𝄂' x 64)` would be 4096 instead of 1024. Therefore, it wouldn’t be testing our other subroutine as intended. Co-authored-by: Michael Wiencek <mwtuea@gmail.com>

yvanzo requested review from mwiencek and reosarevok April 29, 2024 11:28

yvanzo force-pushed the mbs-13562 branch 3 times, most recently from b9b0f0c to 77b95b2 Compare April 29, 2024 11:59

reosarevok reviewed Apr 29, 2024

View reviewed changes

yvanzo force-pushed the mbs-13562 branch 3 times, most recently from f4481e6 to 182f125 Compare April 29, 2024 15:15

mwiencek reviewed Apr 29, 2024

View reviewed changes

yvanzo added 2 commits April 29, 2024 21:16

yvanzo force-pushed the mbs-13562 branch from 182f125 to 50f5915 Compare April 29, 2024 20:16

mwiencek approved these changes Apr 29, 2024

View reviewed changes

yvanzo and others added 3 commits April 29, 2024 23:42

Amend e5776e7: Improve phrasing by reosarevok

b975a39

Co-authored-by: Nicolás Tamargo <reosarevok@metabrainz.org>

yvanzo force-pushed the mbs-13562 branch from dfd7f9b to b0a3224 Compare April 29, 2024 22:43

yvanzo merged commit b211476 into metabrainz:production Apr 29, 2024
0 of 3 checks passed

yvanzo deleted the mbs-13562 branch April 29, 2024 22:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MBS-13562: Check editing unlimited strings #3251

MBS-13562: Check editing unlimited strings #3251

yvanzo commented Apr 29, 2024 •

edited

reosarevok Apr 29, 2024

yvanzo Apr 29, 2024

reosarevok Apr 29, 2024

yvanzo Apr 29, 2024

reosarevok Apr 29, 2024

yvanzo Apr 29, 2024

reosarevok Apr 29, 2024

yvanzo Apr 29, 2024

mwiencek Apr 29, 2024

mwiencek Apr 29, 2024

reosarevok Apr 29, 2024

yvanzo Apr 29, 2024 •

edited

mwiencek Apr 29, 2024

yvanzo Apr 29, 2024 •

edited

reosarevok Apr 29, 2024

yvanzo Apr 29, 2024

reosarevok Apr 29, 2024

yvanzo Apr 29, 2024

reosarevok Apr 29, 2024

yvanzo Apr 29, 2024

mwiencek Apr 29, 2024 •

edited

yvanzo Apr 29, 2024

mwiencek Apr 29, 2024

yvanzo Apr 29, 2024

mwiencek left a comment

mwiencek Apr 29, 2024

yvanzo Apr 29, 2024

		<p>[% l('Some text you entered is overlong, please fix the text or
		shorten it and use annotation for the full text instead.') %]</p>


		=item $MAX_POSTGRES_INDEXED_STRING_BYTES

		Maximum number of bytes for a string to be indexable by Postgres.

MBS-13562: Check editing unlimited strings #3251

MBS-13562: Check editing unlimited strings #3251

Conversation

yvanzo commented Apr 29, 2024 • edited

Testing

Documenting

Further action

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yvanzo Apr 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yvanzo Apr 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwiencek Apr 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mwiencek left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yvanzo commented Apr 29, 2024 •

edited

yvanzo Apr 29, 2024 •

edited

yvanzo Apr 29, 2024 •

edited

mwiencek Apr 29, 2024 •

edited