Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate, or more, rom information labels in some cases, probably because of 'serials as id' #7455

Open
i30817 opened this issue Oct 18, 2018 · 10 comments

Comments

@i30817
Copy link
Contributor

i30817 commented Oct 18, 2018

First and foremost consider this:

  • Only RetroArch bugs should be filed here. Not core bugs or game bugs
  • This is not a forum or a help section, this is strictly developer oriented

Description

Some rom information that comes from libretro-database is being duplicated on RA rom information page.

for instance this is a megadrive playlist Alien Soldier (Europe):

screenshot from 2018-10-18 06-11-21

There are more hashsums listed but not visible (about 7 of them all different as far as i can tell). The top is also obviously a result of a merge of a different style of dat since it includes the serial.

The real rom has crc 0496e06c (the last listing on the screenshot and also the 'first' in a different style, of info, probably because take from another type of dat).

Other crcs (not listing md5 to simplify) are 0A9C04C9, 9B1A1D7F, B5CB1CEB, 3F5746BA, 3DB10C48, C4EB3F76

None of these other crcs appear on the original no-intro dat (but i have a slightly older one as should be obvious on the first entry of the list bellow), but it's definitely the result of using serials to 'identify' games and ending up with multiple versions of the game on the 'info' instead of the right version.

grep on a repository of the libretro-database confirms it

grep -Eir "0A9C04C9|9B1A1D7F|B5CB1CEB|3F5746BA|3DB10C48|C4EB3F76"
metadat/no-intro/Sega - Mega Drive - Genesis.dat:	rom ( name "Alien Soldier (USA) (Virtual Console).md" size 2097152 crc 9B1A1D7F md5 54D06FE70C75BF6828A503BB8CE806B0 sha1 A7C6E7D6B3DF26A3CA381A170AD06EFAB0043C97 )
metadat/goodtools/GoodGen 3.21.dat:	rom ( name "Alien Soldier (E) [b1].bin" size 2097152 crc c4eb3f76 md5 72ada8ca42e600ed2128b054af4fc88e sha1 426cbeb947e5a9fe9aac2d3288572cf439d45c4a )
metadat/goodtools/GoodGen 3.21.dat:	rom ( name "Alien Soldier (E) [b2].bin" size 2097664 crc 3db10c48 md5 9847bf660e0effe7f3e6aa971447ed28 sha1 d95a87992705e041d89abcbe9c26194070fcc4c9 )
metadat/goodtools/GoodGen 3.21.dat:	rom ( name "Alien Soldier (E) [f1].bin" size 2097152 crc 3f5746ba md5 328592e16e1a380f6123e0b444956c85 sha1 316dbce9d6d17606d01498b5bd7a5e28dba72f7e )
metadat/goodtools/GoodGen 3.21.dat:	rom ( name "Alien Soldier (E) [f2].bin" size 2097152 crc b5cb1ceb md5 1d577d96dbf03b13babdfd91e163e9a8 sha1 cd29bb5b9a727148d9e3601507187db2dd02cd2d )
metadat/goodtools/GoodGen 3.21.dat:	rom ( name "Alien Soldier (E) [f3].bin" size 2097152 crc 9b1a1d7f md5 54d06fe70c75bf6828a503bb8ce806b0 sha1 a7c6e7d6b3df26a3ca381a170ad06efab0043c97 )
metadat/goodtools/Sega - Mega Drive - Genesis.dat:	rom ( name "Alien Soldier (E) [b1].bin" size 2097152 crc c4eb3f76 md5 72ada8ca42e600ed2128b054af4fc88e sha1 426cbeb947e5a9fe9aac2d3288572cf439d45c4a )
metadat/goodtools/Sega - Mega Drive - Genesis.dat:	rom ( name "Alien Soldier (E) [b2].bin" size 2097664 crc 3db10c48 md5 9847bf660e0effe7f3e6aa971447ed28 sha1 d95a87992705e041d89abcbe9c26194070fcc4c9 )
metadat/goodtools/Sega - Mega Drive - Genesis.dat:	rom ( name "Alien Soldier (E) [f1].bin" size 2097152 crc 3f5746ba md5 328592e16e1a380f6123e0b444956c85 sha1 316dbce9d6d17606d01498b5bd7a5e28dba72f7e )
metadat/goodtools/Sega - Mega Drive - Genesis.dat:	rom ( name "Alien Soldier (E) [f2].bin" size 2097152 crc b5cb1ceb md5 1d577d96dbf03b13babdfd91e163e9a8 sha1 cd29bb5b9a727148d9e3601507187db2dd02cd2d )
metadat/goodtools/Sega - Mega Drive - Genesis.dat:	rom ( name "Alien Soldier (E) [f3].bin" size 2097152 crc 9b1a1d7f md5 54d06fe70c75bf6828a503bb8ce806b0 sha1 a7c6e7d6b3df26a3ca381a170ad06efab0043c97 )

Anyway, it's extremely sketchy to use serials to 'identify' roms especially if RA is going to try to merge all possible dumps on the database - not that only using no-intro would help, for instance the first on that list is the no-intro virtual console version.

Expected behavior

Stop using serials to uniquely identify games. Use them as shared data foreign keys (like pictures of covers) but not for 'actually unique' information like hashcodes or other properties on the per-game entries on dats. A single rom should give a single result, end of story.

Actual behavior

Trying to accommodate every rom version and not calculate hashcodes one time 'because it's slower to scan' leads to this, among other things like hacks getting identified as the 'game' and not using their actual database entries.

Even if you completely hid the info page just to keep using the faster serial scan, a hashcode check is necessary if you want to support features which ask for exactness. What if two roms with the same serial but different hashes try to netplay? What if people want the display name to be the hack name? What if people want the ability to go to a hack page (as in my recently open request) and this opens 'all' of the pages for hacks on that game?

If you're trying to move away from hashcodes upfront calculation - features which need it would need it calculated before a feature which requires uniquness is used which is a much worse user experience imo if the hash is going to be a '4 gb wii disc' or similar.

Steps to reproduce the bug

  1. [First step]
  2. [Second step]
  3. [and so on...]

Bisect Results

[Try to bisect and tell us when this started happening]

Version/Commit

You can find this information under Information/System Information

  • RetroArch: [version/commit]

Environment information

  • OS: [The operating system you're running]
  • Compiler: [In case you are running local builds]
@inactive123
Copy link
Contributor

inactive123 commented Oct 18, 2018

This is a fair point. These duplicate entries being shown need a solution of some sorts.

@i30817
Copy link
Contributor Author

i30817 commented Oct 18, 2018

There is also the case where two or more different dump groups have the same exact rom (hashcode too obviously). In that case my recommendation is to ... let the more popular/respected group win the race (ie: no intro would always win over goodtools).

In fact i'd suggest duplicated entries like that don't even get into the database, just one canonical for the most respected group atm.

(this might actually already being done).

Serials are useful when you want the ambiguity (for instance, game configs that apply to multiple versions of the game because they're all broken or enhanced the same way), or displaying the same cover for a single 'key' (although this is sometimes a bit inaccurate for extensive hacks, which are rare). So they're still useful in the database but more as a secondary key.

@i30817
Copy link
Contributor Author

i30817 commented Oct 18, 2018

You know what, i'm going to give a suggestion on the assumption you're not going to decide to revert the serial scanning.

Give a option, either:

  1. have serial scans and disable features that need hashcodes or calculate them lazily (before caching the playlist obviously). This means from what i can tell conservatively: netplay, the proposed feature for opening the roms webpages of a hack, disable the 'information' page entries that refer to crcs and keep only things like 'publisher', 'release date', 'tags' etc.

You or a 3rd party would still be able to bundle game specific configuration bundles if the serial was treated as primary key (say for fixing letterboxing in ps1 games or frame delay or having arcade cabinet overlays) but they'd would have false positives for certain hacks configurations (which would require 'specific' overriding, ie: if the crc/md5 is known and matches, overlay these changes).

Basically don't assume that just because you have a serial match on the database that the hashcode is the same and think case by case if the operation requires a unique rom.

  1. Or the user changes a setting to make rom scan use hashcodes because they care about hack names and distributed specific configurations.

Leaving things 'lazy' on 1 might be a acceptable result, if you put in a little warning on the user trying a feature requiring exact matches on roms that are 'large enough' on size 'this feature requires a long crc calculation Y/N?'

This way, features would all be accessible, but would have a small delay the first time around or a warning that the delay is going to be much longer. Hack names and descriptions would be the main victim of this idea and the reason why I feel the 2. is still necessary, because 1. would give the wrong result right away for those things.

Alternative to 2 that would work with 1. as the only (invisible) option, force games which have known hacks to be identified by hash:

Keep the hack database separate from the 'published rom' database. After finding a serial check first if it's in the hack database (it would need to be modified to get them but it's feasible). If so require a hashcode calculation before creating the playlist entry and check if it isn't a hack with that before checking the 'published' database (problem: not all hacks are in the database, but that would just mean a coder decision, either assume the serial is a unknown variation of a original published rom for the 'lazy' info of 1. or not put a new entry on the playlist at all and if you're security paranoid even warn the user that romhack is fishy).

This is not as slow as it seems because outside of the snes and nes there aren't really hundreds of hacks (especially cd games which are the slowest offenders).

@i30817
Copy link
Contributor Author

i30817 commented Oct 18, 2018

The reddit discussion also made me realize something else. Having a serial scanner is even worse for 'security' than accepting hacks with the right crcs from romhacking.net, which was the shade being thrown at me (not as bad as listing anything with the right extension though).

Any ROM found anywhere that has the right serial could be confused with a whitelisted version of the original game, when if you kept using the hashes and avoid anything that doesn't match, they'd at least have to put a malicious rom on romhacking.net (and when caught, it would be removed and removed from libretro-database, like the hacks with game ending bugs - at least if people - ie: me - keeps using rhdndat to validate dats).

@ghost
Copy link

ghost commented Dec 22, 2018

have these of fbalpha, nes, snes and others. considering playlists are created based on hash, the other hashes should not show. (posting here to reconfirm this issue)

screenshot_2018-12-22_22-36-01

@orbea
Copy link
Contributor

orbea commented Dec 22, 2018

I noticed this too, it seems dependent on the game more than the core as some games won't cause it.

@ner00
Copy link

ner00 commented May 26, 2020

This is pretty backward and weird that it hasn't been addressed for so long. Just don't use the serial as a UID, it makes no sense because it doesn't work.

@i30817
Copy link
Contributor Author

i30817 commented May 26, 2020

It's fast and that's all that matters to (l)users. Until there is a option to opt-out of the insanity this will still happen.

My main beef with it - besides not being given the option - is how it's warping the functionality of both RA and libretro around this 'heuristic' model. The database becomes almost useless for advanced metadata, not that it has any. Even hack names can't reliably be identified, even if they're there.

Causing bugs like this is bad too ofc.

@ner00
Copy link

ner00 commented May 26, 2020

It's fast and that's all that matters to (l)users.

It depends, doesn't it? Even some users scanning folders full with gigabytes of isos may be more inclined to be patient and have their collection properly arranged, but I understand the performance benefit - especially if you're going to scan a folder multiple times.

Until there is a option to opt-out of the insanity this will still happen.

An opt-out option at the very least would be nice. At least for single file scanning it could consider calculating a checksum by default (with serial as opt-in), and for folder-wide scan being opt-out.

I'm curious to see some benchmarks on crc32 across platforms and file sizes, it will obviously be a whole lot slower than seeking a few bytes at an offset for sure.

@i30817
Copy link
Contributor Author

i30817 commented May 27, 2020

oh, it takes hours, depending if you have a 10 years+ harddisk instead of a ssd. But with the magic of fileystem extended attributes you could only do it once. And as a bonus you could even make softpatches record the 'correct' hardpatched hash, with some trickery.

xattrib memoization doesn't work for hardlinks of ROMs unfortunately, because linux was stupid here, but it may work if your filesystem does native deduplication (not sure about this myself, i only verified that changing a xattrib on a hardlink changed it too on the original).

Or you could be a bit more intelligent when scanning playlists and reusing entries crcs whose rom file last modification is > than the older playlist creation time and have the same path, that would probably be much more portable. Then for softpatches you'd make tool or option or extra property to modify/record the hardpatched CRC in the playlist (modify and not recreate so that 'rom file last modification is > than the older playlist creation time' keeps happening).

'Only 2 hard problems in computing science, naming, caching and off-by-one'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants