Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MBS-13562: Check editing unlimited strings #3251

Merged
merged 5 commits into from Apr 29, 2024

Conversation

yvanzo
Copy link
Contributor

@yvanzo yvanzo commented Apr 29, 2024

MBS-13562

See the commit messages for details.

The user interface has not been much updated for user-friendliness as this is an edge case.

Testing

Added automated:

  • unit tests for each validation subroutine
  • integration tests for each check implementation part (Role::CheckOverlongString, Role::EditArtistCredit, Medium::Util)

I didn’t add automated Selenium tests as there is no change to the user interface beyond the explanatory message.

Manually tested editing a sample database with:

  • creating/editing an artist with overlong name/sort name
  • creating/editing a recording with overlong name/artist credit join phrase
  • creating/editing a release with overlong name/artist credit join phrase
  • creating/editing a release with overlong track name/track artist credit join phrase

Documenting

  • Mention this edge case in the Editing_FAQ page. (See changes.)
  • Update the ticket according to the actual implementation.

Further action

  1. Deploy in production
  2. Merge to master
  3. Release beta
  4. Make the updated Editing_FAQ page visible on MusicBrainz website

@yvanzo yvanzo force-pushed the mbs-13562 branch 3 times, most recently from b9b0f0c to 77b95b2 Compare April 29, 2024 11:59
@@ -98,6 +102,12 @@ sub is_database_bigint_id {
is_positive_integer($t) and $t <= $MAX_POSTGRES_BIGINT;
}

sub has_oneline_string_length {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is confusing to me - seeing it I wouldn't know if it means it's over the limit or under the limit. What about something boring like is_not_too_long_string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be less confusing w.r.t. “over/under” but it doesn’t say for what it “is not too long”.

Would is_short_oneline_string carry both meanings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_short_enough_oneline_string? is_short_enough_single_line_string? Just is_short seems a bit funny because 1024 characters is not short :D But I can live with that too really.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it has_at_most_oneline_string_length in 4ee30f8 to keep it meaningful and not confusing.

lib/MusicBrainz/Server/Constants.pm Show resolved Hide resolved
@@ -102,6 +103,13 @@ sub is_database_bigint_id {
is_positive_integer($t) and $t <= $MAX_POSTGRES_BIGINT;
}

sub is_database_indexable_string {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need both of this and the 1024 limit? Wouldn't this check be enough for our needs? Or is it just to have a more easily communicable limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the commit messages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did check those, but I'm still a bit confused. It seems the real limit that triggered the bug was actually the 2704 bytes one, no? Wouldn't it be enough then to check for 2704 bytes and not worry about the amount of characters and whether they are multibyte or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did check those, but I'm still a bit confused. It seems the real limit that triggered the bug was actually the 2704 bytes one, no?

The bug MBS-13555, yes. And it would also prevent MBS-13536 to happen again.

Wouldn't it be enough then to check for 2704 bytes and not worry about the amount of characters and whether they are multibyte or not?

As explained in commit message, it is more reasonable to additionally limit to 1024 characters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the 8192 limit from MBS-13536 is for all indexes generally, and the 2704 limit is for btree indexes. The index from MBS-13536 was also a btree index, but perhaps it hit the "general" limit first before it checked the more specifiec btree limits (but I am just guessing).

As explained in commit message, it is more reasonable to additionally limit to 1024 characters.

I agree that a fixed character limit is reasonable and also easier to explain to users -- though I'm not sure if the majority of the offenders are mistakes or intended to cause harm:

1055 characters - https://musicbrainz.org/edit/111225345
1056 characters - https://musicbrainz.org/edit/111225330
1126 characters - https://musicbrainz.org/edit/111225347
1176 characters - https://musicbrainz.org/edit/111225344
1377 characters - https://musicbrainz.org/edit/111225342

I'm not familiar with any of the source material though.

Given the additional btree-indexable byte limit check, we could perhaps be a little more generous and use a less-geeky number like 1500 which would cover most of the existing cases. Or maybe allow auto-editors to bypass the limit (as long as it's under 2704 bytes still).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking them; it'd be nice to replace your existing edits with the suggestions you're certain about when you have time, since it's much nicer than truncation, and it may be unlikely for some of these releases to be fixed for a long time! (Note I didn't link to any of the "creative writing" ones.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 24 Duos one is a wrong Discogs import where a lot of tracks got smashed together, I'll fix that one. I disagree that Transmission 7 is wrong, but I think it'd be ok either way - if you change that one you should change all though. "Korean Cosmetics" seems to be the artist name and should probably not be added to the title - I'd just find a good place to put ... at the end of that one, maybe just do Moringa Pterygosperma Seed Extract...? New Numbers seems fine to change to just that. The Musicalische Exequien, SWV 279 is also not wrong, but shortening it is probably sensible. That disc has other errors, so I'll do that part myself as well.

Copy link
Contributor Author

@yvanzo yvanzo Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: I wrote this comment in reply to @mwiencek before @reosarevok’s reply was visible to me.

Sorry but no, the changes should probably be made for the whole releases not just these tracks, this isn’t worth my time, and there is a lot of trash stuff around (such as the above second example with 1056 characters) that I don’t even want to read, that’s why I made a script to truncate names arbitrarily rather than editing by hand. I’m not a classical editor either, so the classical releases would be better edited by some more knowledgeable editor than me about classical releases. I’m sure that some interested editors are following these edits that are linked from the ticket’s comment and will make the most appropriate changes in time (likely after this check is deployed). As I wrote on IRC before, the data isn’t completely lost since it lies in the editing history anyway. There should be nothing wrong with truncating a few names arbitrarily for the sake of maintaining the database. If we decide to relax this constraint later on, it will just take changing a constant in code and reverting a few edits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @reosarevok -- I guess we're all okay with the existing checks then. I didn't mean to start bikeshedding on a specific number, but was just hesitant to restrict possibly-valid data. @yvanzo, it sounds like you can just cancel your existing edits if reosarevok is looking into them.

Copy link
Contributor Author

@yvanzo yvanzo Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @reosarevok -- I guess we're all okay with the existing checks then. I didn't mean to start bikeshedding on a specific number, but was just hesitant to restrict possibly-valid data.

Initially I was more leaning towards 512 bytes to be honest. I picked up 1024 bytes after your own data analysis on IRC as there is indeed no legitimate use case beyond it. Just reminding that titles are supposed to be “minimal summaries”, neither dissertations nor liner notes.

@yvanzo, it sounds like you can just cancel your existing edits if reosarevok is looking into them.

It wouldn’t hurt to approve all of these edits really, even if these titles are further trimmed later on by volunteers.

Edit: To clarify, it isn’t just about my personal time or sensibility. The truncations weren’t decreasing the data quality beyond what is reasonable since the data quality wasn’t great already. Their only purpose was to not block editors from making other changes after having deployed this patch. This is just not our job to edit the database, even though we sometimes do so as volunteers.

Comment on lines 37 to 40
if (defined($new_artist_credit_full_name) && !(
has_oneline_string_length($new_artist_credit_full_name) &&
is_database_indexable_string($new_artist_credit_full_name)
)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do this check for three conditions often enough that it feels it should be its own function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a function for all of these conditions in our Validation module do?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added is_overlong_string in abe06ef.

root/layout.tt Outdated
Comment on lines 96 to 97
<p>[% l('Some text you entered is overlong, please fix the text or
shorten it and use annotation for the full text instead.') %]</p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"overlong" is not a common word. What about "Some text you've entered is too long for us to support. Please shorten it, and if necessary enter the full text in the annotation for reference." or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"overlong" is not a common word. What about "Some text you've entered is too long for us to support.

It is not a common word but it isn’t overcomplicated to understand even the first time you read it. The fact that it is not common makes it easier to search the documentation for as we can refer to it in the Editing FAQ.

Please shorten it, and if necessary enter the full text in the annotation for reference." or something like that?

Committed through ef4fb7e.

root/layout/index.js Outdated Show resolved Hide resolved
@yvanzo yvanzo force-pushed the mbs-13562 branch 3 times, most recently from f4481e6 to 182f125 Compare April 29, 2024 15:15
Track->new(
artist_credit => $c->model('ArtistCredit')->get_by_id(1),
is_data_track => 0,
name => ('𝄞𝄵𝅘𝅥𝅘𝅥𝅮𝅘𝅥𝅮𝅘𝅥𝅮𝄾𝅘𝅥𝄀𝅘𝅥𝅮𝅘𝅥𝅮𝅘𝅥𝅮𝄾𝆑𝅗𝅥𝄂' x 64),
Copy link
Member

@mwiencek mwiencek Apr 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor suggestion here would be to add no utf8; to the enclosing block in case we ever add use utf8; to the file for some reason and accidentally break this test. Edit: This should only matter for the non-bytes length check, though...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure to follow here because the Validation tests have the same test string and use utf8; already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I gave wrong info. If these are utf8 strings with multi-byte characters, the files should have use utf8;. Otherwise I don't think length will work properly when they are passed elsewhere...

# P1.pm
package P1;

#use utf8;
our $string = '𝄞𝄵𝅘𝅥𝅘𝅥𝅮𝅘𝅥𝅮𝅘𝅥𝅮𝄾𝅘𝅥𝄀𝅘𝅥𝅮𝅘𝅥𝅮𝅘𝅥𝅮𝄾𝆑𝅗𝅥𝄂' x 64;
# P2.pm
package P2;

use P1;

print length($P1::string) . "\n";
use bytes;
print length($P1::string) . "\n";

With #use utf8; commented out it will print 4096 for the first length check, but with use utf8; enabled it will print 1024.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Fixed with the commit dfd7f9b.

Some one-line string are unlimited in the database schema,
thus causing index issues in the main database (stuck edits),
in obsolete mirrors (MBS-13536), and potentially in any piece
of software accessing the data from the MusicBrainz database.

The default “maximum size allowed for the index type” [^1] seems to be
affected by the block size which is set to 8192 bytes by default [^2]
when compiling PostgreSQL, at least in the amd64 builds by PGDG [^3].
Thus, 1024 characters should always fit, even if multiple-bytes [^4],
preventing MBS-13536 from happening again with any other indexed column.

We also noticed that there are very few titles/names with length over
1024 characters at the moment, and that most of these have been
intentionally written to cause issues, while the few others are either
mistakes or can be arranged otherwise. See MBS-13562 for details.

References:

* [^1] https://www.postgresql.org/docs/16/sql-createindex.html
* [^2] https://www.postgresql.org/docs/16/limits.html
* [^3] https://wiki.postgresql.org/wiki/Apt
* [^4] https://www.postgresql.org/docs/16/multibyte.html
Because PostgreSQL actually fits up to “three items on every page” [^1],
each string must take at most 8192 / 3 = 2704 bytes at most,
in order to prevent MBS-13555 from happening again.

References:

* [^1] https://github.com/postgres/postgres/blob/REL_16_0/src/include/access/nbtree.h#L154-L163
Copy link
Member

@mwiencek mwiencek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your time on this oversilly issue. ;)


=item $MAX_POSTGRES_INDEXED_STRING_BYTES

Maximum number of bytes for a string to be indexable by Postgres.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case you feel that it's worth mentioning, it's technically specific to btree indexes, but they are the default type of index and we don't use the other types extensively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yvanzo and others added 3 commits April 29, 2024 23:42
This patch enforces the previous definitions of both one-line string
length and indexable string size for most editable unlimited strings,
no matter whether these are currently indexed or not, in order to
prevent both MBS-13536 and MBS-13555 from happening again and also to
prevent further related issues in third-party software using this data.

The checks are made at editing time, and prevent edits to be created
when overlong strings are entered for name, sort name, artist credit
name and join phrase. An explanatory message is shown to the user,
excepted in the release editor which just shows an ISE 500 message.

The implementation comes in three parts:
* `Role::CheckOverlongString` for name (and sort name if applicable)
  of primary entities, their aliases, and medium,
* `Role::EditArtistCredit` for artist credit (name and join phrase)
  of primary entities,
* `Medium::Util` for track name and track artist credit.
Co-authored-by: Nicolás Tamargo <reosarevok@metabrainz.org>
Without `use utf8;` the `length` call in our subroutime
`has_at_most_oneline_string_length` would assume that each byte
represents only one character and thus return the number of bytes
instead of the number of multiple-bytes characters. For example,
`length('𝄞𝄵𝅘𝅥𝅘𝅥𝅮𝅘𝅥𝅮𝅘𝅥𝅮𝄾𝅘𝅥𝄀𝅘𝅥𝅮𝅘𝅥𝅮𝅘𝅥𝅮𝄾𝆑𝅗𝅥𝄂' x 64)` would be 4096 instead of 1024.
Therefore, it wouldn’t be testing our other subroutine as intended.

Co-authored-by: Michael Wiencek <mwtuea@gmail.com>
@yvanzo yvanzo merged commit b211476 into metabrainz:production Apr 29, 2024
0 of 3 checks passed
@yvanzo yvanzo deleted the mbs-13562 branch April 29, 2024 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants