Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with file encoding #746

Closed
tsaglam opened this issue Oct 17, 2022 · 4 comments
Closed

Issues with file encoding #746

tsaglam opened this issue Oct 17, 2022 · 4 comments
Assignees
Labels
enhancement Issue/PR that involves features, improvements and other changes language PR / Issue deals (partly) with new and/or existing languages for JPlag minor Minor issue/feature/contribution/change

Comments

@tsaglam
Copy link
Member

tsaglam commented Oct 17, 2022

JPlag is currently built for UTF-8 encoding and may reject submissions that have improper encoding (e.g. the Java language module and ANSI files). Maybe we can come up with a more convenient system that is more tolerant.

  • First priority is to find and test a solution for the java language.
  • After that, we could think of language-independent solutions.

We could use heuristics like https://github.com/albfernandez/juniversalchardet, for an example, see:

https://github.com/Feuermagier/autograder/blob/1a47ee7ea91d3d2f7563053528124fd18c2c037f/autograder-core/src/main/java/de/firemage/autograder/core/file/UploadedFile.java#L41-L56

@tsaglam tsaglam added enhancement Issue/PR that involves features, improvements and other changes minor Minor issue/feature/contribution/change help wanted Feel free to give us a hand! Contributions welcome! labels Oct 17, 2022
@dfuchss
Copy link
Member

dfuchss commented Oct 25, 2022

In general this might be not possible easily.
Nevertheless, I found this library : https://unicode-org.github.io/icu/userguide/conversion/detection

@tsaglam
Copy link
Member Author

tsaglam commented Feb 9, 2023

See also past discussions on this topic: #115.

@tsaglam tsaglam pinned this issue Feb 9, 2023
@tsaglam tsaglam added the language PR / Issue deals (partly) with new and/or existing languages for JPlag label Feb 9, 2023
@alberth
Copy link
Contributor

alberth commented Mar 6, 2023

It's theoretically not decidable to obtain the encoding from file content. When you get a file with bytes between 0 and 127, you may conclude it's ASCII, but it could be UTF-8 as well, and if it has an even length, UCS-16 would work too.

You can make a guess, but nothing prevents you from wrongly interpreting bytes in the wrong way.

One way to "solve" this is to move the problem entirely to the language. The language parser has to make sense of the input anyway. Currently the tricky part is that JPlag also reads the text-file to display it to the user. If you could move that part to the language as well, JPlag has no direct contact with the input text anymore.

@tsaglam tsaglam removed the help wanted Feel free to give us a hand! Contributions welcome! label Mar 10, 2023
@sebinside
Copy link
Member

Fixed in #1026

@tsaglam tsaglam unpinned this issue May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue/PR that involves features, improvements and other changes language PR / Issue deals (partly) with new and/or existing languages for JPlag minor Minor issue/feature/contribution/change
Projects
None yet
Development

No branches or pull requests

5 participants