Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Policy Doc PDFs #1

Open
lunakv opened this issue Aug 11, 2022 · 3 comments
Open

Parse Policy Doc PDFs #1

lunakv opened this issue Aug 11, 2022 · 3 comments
Labels
enhancement New feature or request

Comments

@lunakv
Copy link
Owner

lunakv commented Aug 11, 2022

The API currently only creates diffs of the CR, because all policy docs are only available as PDFs. To be able to diff policy docs, we must first transform them into some machine-readable representation. There are a number of available PDF parsers available, all working slightly differently, so some research should be done into which one can work best for this use case.

@lunakv lunakv added the enhancement New feature or request label Aug 11, 2022
@lunakv
Copy link
Owner Author

lunakv commented Aug 11, 2022

@multimeric I tried to integrate the MTR grammar you wrote for Venser's Journal, but I ran into some issues regarding bullet lists. Take the current MTR as an example.

  • The newline after the last list item is erroneously removed during the cleanup process. For example, in section 1.3, the list should end
•  Player 
•  Spectator 
The first four roles above are [...],

but instead it's parsed as

•  Player 
•  Spectator The first four roles above are [...].

which makes it impossible to detect where the last item ends and the following paragraph begins.

  • Sometimes the list itself is parsed incorrectly, inserting extra bullet points. In section 1.4, the first list should read
[...]
• Individuals currently suspended by the DCI. Individuals currently suspended from the DCI may not act as tournament officials;
• Other individuals specifically prohibited from participation by DCI or Wizards of the Coast policy (such determination is at Wizards of the Coast’s sole discretion);
[...]

Instead, I assume because the points each take multiple lines, the second one is parsed as

• Other individuals specifically prohibited from participation by DCI or Wizards of the Coast policy 
• (such determination is at Wizards of the Coast’s sole discretion);

which is just obviously wrong. Sometimes a rogue bullet point is inserted on an empty line, either out of nowhere or behind the list item instead of at the beginning of it (this happens for the first item of the second list in section 1.4).

Since you wrote the thing (and I don't have much experience with this kind of parsing), I was wondering if you had any insight into how to solve these issues before I go digging too deep into it.

@multimeric
Copy link

Cool, thanks for looking into this. I remember there were some issues with the parser, but I never finished them off because the VJ guy wasn't actually hosting it anyway.

@KingSupernova31
Copy link

It occurs to me that the MTG Judge Core app is successfully parsing the IPG and MTR. Maybe Andrew Teo would share their data/method for doing that?

@lunakv lunakv mentioned this issue Mar 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants