Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: Reliable data driven alternative to scanner heuristics #8672

Closed
i30817 opened this issue May 1, 2019 · 9 comments
Closed

RFE: Reliable data driven alternative to scanner heuristics #8672

i30817 opened this issue May 1, 2019 · 9 comments

Comments

@i30817
Copy link
Contributor

i30817 commented May 1, 2019

Description

Users have very little control over how the scanner will classify their games, in the assumption that the scanner and cores in combination are going to figure out which 'entry point' formats should be given to the core.

I'd like to challenge this assumption and give a alternative, optional solution that doesn't require GUI work on RA, just some branches in the scanner, and is more reliable and useful for the user, at the cost of more work for the user (but not that much).

The idea is that the scanner recursion function gains a stack list argument.

On entering a dir:
  if > 1 .detect file exist, assert
  if a file with extension $PLAYLIST.detect exists, where $PLAYLIST is a valid playlist name for a console.
    read the $PLAYLIST.detect file and extract which extensions (or even complete filenames) are written on it. 
    add a struct with the Playlist and extensions to the top of the stack
  if there is a struct on top of the stack whitelist suffixes for a fixed playlist using the struct
  else original code goes here using the old heuristics to find the 'right' playlist
Prior to leaving a dir (returning from the recursion):
 see if the '$PLAYLIST.detect' on the top of the stack exists on this dir, and if so pop it (so the previous one can work for sibling branches on the depth first iteration). 
 If a different .detect file exists... well, assert because that's wrong.

The user could use this to finely control which files are allowed to be entered on which playlist and to 'forbid' some files by simply not whitelisting them. For instance if i want to play dosbox games but do not want bat files starting them but *.conf files (dosbox configs which may have a autoexec section) i'd place a DOS.config with '.conf' on the topmost folder with the DOS collection.

Similarly if a core adds a new feature for a new kind of entry point file, the user can just edit or add a new '.detect' file on a game dir with that new file and get a playlist with it. Then, on choosing a game to run, RA would try to match that playlist to a list of cores and, further filter the cores by if the core.info suports the extension.

Actual behavior

The scanner is supposed to be a fire and forget operation. You choose a directory, and it iterates over the tree finding sets of N 'entry points' and then some heuristics are applied that are supposed to assign them to one playlist, by figuring out the 'platform' the entrypoint is for.

This idea is flawed in at least two ways:

  1. The scanner heuristic might work with the game on fileformat, but won't with the same game on a slightly different fileformat the core accepts too. This is normal because RA would have to have a multiformat cd-image mounter to even be able to standardize the byte-reading, so it uses some raw byte reading and byte array matching. But fileformats come in many forms and some consoles will accept 'generic' formats like iso without necessarily the right 'ID' at the start (homebrews, unlicensed games, etc).

  2. The game might have two (or even more!) valid entry points, which have slightly different behavior. Consider a cue/iso with separate music tracks file set. The scanner has the choice of showing the cue as entry point, or the iso and the second one ends up without digital audio, but is still a 'valid game'. This leads to complicated 'hierarchy' filtering code. To be fair, this example would still be needed on a CRC scanner where the 'entrypoint' and the actual checksummed file are different (ie: the entry point would be the cue and track1.bin the checksummed file). But there are other examples.

  3. Before you mention the cases where 2 or more files with the correct extension exist on the same dir and only one is supposed to be chosen or two or more in a order, the idea above has two solutions for that. In case a single one is the 'real' entry point, algorithm will accept suffixes, not just extensions to scan, so you only need to place a detect file on the dir with complete name of the entry point file and omit the other. In the case where 2+ files are supposed to be given in order (same extension or not), retroarch has the habit to support '.cmd' files in cores that need this, which would make them the right extension to detect in these cases.

I also have the notion that this idea could be valuable for more than just the (proposed) filename parser, but also for a simplification of the serial and the CRC scanner, by allowing a alternative to the heuristics. The serial would still need to parse, and the CRC might need to parse a cue (for instance) to get to the 'correct' file / track to checksum so they'd still need to minimally understand the accepted fileformats, but the entrypoints would be fixed by the user and the fallible magic heuristics would not be used.

Anyway, this feature would be hidden, and the current 'fire and forget' scanner would continue to work; only the people that read the manual would figure out they could make it have less false positives and false negatives with just a few files (usually) on the top most dirs of a console set.

@i30817
Copy link
Contributor Author

i30817 commented Jun 1, 2019

If this feature gets implemented int its simplest form (filenames with simple globbing expressions), it could be extended to dissociate the 'launcher' file from the actual medatada key on the database so images and stuff keeps working, examples:

Dosbox.detect with contents dosbox.conf => game/*.exe:CRC32

For any dosbox.conf file under and in the detect file dir, use it as launcher file and get the metadata key by looking for executables CRC32 you find under the 'game' subdir in the dosbox database and place it on the 'Dosbox' playlist (these come from the name of the .detect file).

Sony - Playstation.detect with content *.cue => *.bin:CRC32

For any .cue file under and in the detect file dir, search for any cue file as launcher file and get the metadata key by looking for '.bin' files CRC32 on the same dir as the cue.

Sony - Playstation.detect with content *.cue => *(Track 1).bin:CRC32

As above, but you 'know' that you only have redump files, therefore you can afford to only scan the game executable track to distinguish the game.

NEC - Turbografx-16.detect with content *.cue => *(Track 2).bin:CRC32

Turbografx Cds need to use the second track to identify because that is where the game actually is

you probably want to mandate a 'non-system' directory separator here so the files can be portable, though i don't know if that happens in playlists already.

Anyway as you can clearly see a fully implemented version of this would allow power users to

  1. don't repeat themselves, by using the file hierarchy.
  2. allow to use specialized core launcher files, up to 'definitely named' launcher files.
  3. dissociate the 'launcher' file from the actually scanned file - some dump formats need to do this to run the game or complete game, for ex. cue/bin or dosbox.conf files with the configuration to start the game and the exe with its checksum.
  4. allow custom directory structures in that disassociation, eg, in the example above dosbox.conf didn't need to be on the same dir as the executable it scanned, up to and including any subdir if * globs eagerly.
  5. speed up the search by only looking for relevant files to search
  6. speed up the search even more by only looking for CRC32 in the 'right' database file that matches the .detect file name, and eliminate some false positives if using :NAME because of this
  7. not inconvenience people that just want to 'scan' without creating files. In fact, this miniparser/language and scanner control could be used by retroarch devs themselves if they wanted a GUI menu to scan 'specific' content under a dir. Sending *.cue => *.bin:CRC32 (among others maybe) to the scanner after the user selecting a dir and a submenu to 'scan for ps1 files'.

I specified :CRC32 at the end because people still make noises about other scanning methods, so you could maybe put in :SERIAL or :NAME, but i personally would go for crc32 every time if both this and #8873 gets implemented.

Though i understand why RA devs are hesitant to use this idea, if regex/path globbing libraries are not exactly portable to consoles or something.

@hizzlekizzle
Copy link
Contributor

of the examples you mentioned, it seems to me that those would make sense to ship as the default behavior anyway. That is, I think it would make sense to put those 'detect' files alongside the databases themselves, rather than in the content dirs. Then, we could ship track and ship sensible defaults and then users would have a central location where they could fine-tune the behavior.

@i30817
Copy link
Contributor Author

i30817 commented Oct 27, 2019

I (low key) agree with that the default behavior could be much more consistently implemented with a setup like this, but it would be slower i suspect (because of multiple types of scan going on if you have more than a line per .detect file and the inherent performance of globbing).

Anyway, the idea is to make it usable in the GUI too as a easy way to use some default rules ('scan for redump ps1' for instance), or to allow the user to configure a certain dir and descendents, if they have a particular weird case (like the dosbox conf launcher idea); so it needs to be programmed with care to accommodate both 'ignore the filesystem and only use this rule' and 'allow rules overrides in the filesystem'.

For instance i have translations in some consoles that do not use redump standard anymore (they need to be converted to iso, and thus the default rule wouldn't work), but i also don't want to eat the performance cost or the 'hierarchic problems' of two applicable entry points or more for a game. Thus the top level would have the 'redump' rule and that particular iso translation would have a 'iso:CRC32' rule.

I haven't thought about replacing the normal scan with this framework as much as 'configuring' scanning options, but now that i think about it, this would probably be for the best because just the effort to decouple the 'scan methods' (crc32, serial, etc) from the scanner itself so they can be used on this would probably churn the 'defaults' anyway.

@i30817
Copy link
Contributor Author

i30817 commented Oct 27, 2019

I'm also not sure if there is a portable globbing library in C that can be adapted to work with RA in all the platforms it has.

@i30817
Copy link
Contributor Author

i30817 commented Oct 27, 2019

A problem this decoupling idea (just the 'second part' of this issue, first part without the 'entrypoint' files would 'just' create normal entries associated to execs like normal in RA dosbox) still has:

If retroarch libretro-database has two or more entries for a single game, for example a DOS game with a installer and a game executable (you can see that the DOS.dat does this here),

dosbox.conf => *.exe:CRC32 would give a list of two possible metadata entries, one for the game executable, another for the installer; for the single 'dosbox.conf' entry point. As this file is user created, it can do anything. Ask for if you want to run the installer, or the game, or a expansion, or netplay (all these happen in some GOG games).

This can of course be 'prevented' by dosbox.conf => gamename.exe:CRC32 but then it turns annoying if it happens often (as it would in dosgames).

There is also the *.conf => *.exe:CRC32 case for when GOG distributes multiple conf files with the game / netplay / installer names.... this would be even messier but not because of the first * (it would expand to 1 or more conf files) but the second is the same problem, except there may actually be multiple entry points so it's possible 2 playlists entries with uncertain 'metadata' depending on the method to resolve the uncertainty.

edited out too complicated ideas, see bellow for better idea.

As a aside it probably should also be possible to skip the 'method' entirely and just pick up a entry of the database directly with a key. For instance installer.conf => INSTALL_1.BAT:CRC32:8cae6a3d (or simply CRC32:8cae6a3d)

@i30817
Copy link
Contributor Author

i30817 commented Oct 27, 2019

I had a better idea for the above, but it's different from normal glob.

So *.conf => *.exe:CRC32 is problematic because the left hand side has no certain association to the executable(s) in the right hand side.

So we simply require that if there is a * on both sides, the expansion of the * has to be the same on both sides. If not, the tuple is discarded and won't scan. It'll still work for sets by dumping groups because the expansion of * of the launcher file (*.cue) is either partially on the signature file (redump, TOSEC etc), or fixed and this rule doesn't apply (track1.bin etc). Might have to adjust to not take into account the last space in some presets, eg: *.cue => * (Track 1).bin:CRC32 instead of *.cue => *(Track 1).bin:CRC32 for redump.

*.conf => fixed.exe:CRC32 or dosbox.conf => *.exe:CRC32 still work, with the first giving potentially multiple entries of the same metadata (if there are multiple conf files in the current dir) and the second potentially giving the 'wrong' metadata for multiple possible exes that exist in the database.

The first is intended and the second is not that horrible because it will still be the game main metadata¹, but as soon as both sides are there only a conf file with part the same name as part of the executable will be considered, which means that the launcher file name will absolutely 'associate to the right metadata'. I think.

It's a simple idea for users to digest 'rename the conf file to the exec/signature file launched by it to use it as playlist entry to launch the game'; with more flexibility if you know about the format.

¹ if it is last positioned in the database for the set that is returned, because apparently that libretro dat places installers first, and the game last. In the DOS.dat case this doesn't appear to matter because the database name is the same for all signature files of the same game, but it could if it wasn't. Maybe this convention doesn't really hold to all of the affected .dats and simply the first is easier.

@i30817
Copy link
Contributor Author

i30817 commented Oct 28, 2019

@hizzlekizzle

I was thinking of the 'decoupling' again and found a obvious problem (again) that i'm unsure how to solve without special treatment and i'd like a opinion if it should be done.

The idea of the 'decoupling' is ofc to specify the 'core launcher' file that appears on a playlist as different from the 'metadata signature file' that actually identifies the game. So far so good.

But that's not the only indirection that RA needs. There is another possible indirection that 'could' and maybe should appear on playlists: m3u files.

A m3u is a collection of 'core launcher' files (the normal example being cue files). So i'm on a bit of a pickle *.m3u => * (Track 1).bin wouldn't match any file with that proposed rule above to make both sides * equal, because the name of the m3u doesn't (or shouldn't) match one single cue and the metadata should probably be a collection instead of single.

I 'feel' a m3u can appear on a playlist with metadata anyway without further complication of this 'mini-scanner-language', by the scanner adding them unconditionally by default (*.m3u ), and the playlist view 'hiding' the scanned cues if they also appear on the m3u and the m3u playlist entry 'inheriting' the metadata of the others to display.

Do you think this is the right approach or should i try something else?

@i30817
Copy link
Contributor Author

i30817 commented Oct 29, 2019

I think i'll open a new issue where i consolidate all of this info and ideas into a legible format if no one minds.

@i30817
Copy link
Contributor Author

i30817 commented Oct 29, 2019

closed for 9656

@i30817 i30817 closed this as completed Oct 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants