-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF-8 characters in filenames not handled correctly in normal downloads #7188
Comments
Thanks @qqmyers. It's important that we do what we can to not change filenames on download, regardless of S3 or not. I'll move to Up Next. |
I took a quick look at this and wanted to note that files with an en-dash ("–") as in the example above cannot be uploaded via the native API. You get the following error:
In server.log you see this:
(The file I tried was "READ–ME.md") Obviously, it would be more helpful to the user to see the reason why their upload failed rather than emailing support to learn what's in server.log. That said, I know this issue is about filenames are allowed to come out of Dataverse, but it probably makes sense to also example what filenames are allowed in. Here's the code I was testing with:
|
Correction. When I was attempting to upload my READ–ME.md file previously, I was doing it with REST Assured (the code in my previous comment). I just tried again using curl and the file was uploaded but the en-dash was renamed to a â (a-circumflex) such that the file has a name that does copy well into GitHub (READâ��ME.md) so I'll paste a screenshot: When I download it, the popup looks like the screenshot below and the file has a space in it like this: READâ ME.md All this is to say that files with exotic characters in their filenames can be uploaded via the native API. I'm just not sure how to prevent the characters from being munged. (On a related note, as @qqmyers mentioned in Slack, the reason I was seeing the error about disallowed characters above is that the en-dash is getting turned into a question mark, which is disallowed.) The other thing I wanted to say is that the way to reliably get filenames with fancy characters via API is to use SWORD. This is working fine for me in automated testing. Also, more importantly for our users, the GUI works just fine to upload files with fancy characters. In terms of scope, I think I'm going to go ahead and make a pull request to fix the download side. For uploading via native API, someone could spend some time trying to figure out how to do this with curl (and document it in the guides), or could try it in pyDataverse, or other Dataverse client libraries. I'm thinking this upload stuff is out of scope, but I'll probably spend a few minutes on it. |
I tried and failed to get either curl or REST Assured to upload files with UTF-8 filenames. Here are a couple of the pages I was looking at in case they help someone in the future:
Meanwhile, my pull request for fixing download is #7503. |
Yeah, there doesn't appear to be a lot of good info out there and things are confused between download, url encoded upload, and multipart upload, and between talking about the filename and file body, etc. I saw a few hints that the problem may not be in how curl/other clients are sending but in the apache mime4j library always using 8859_1 encoding. I tried a one line fix - see https://github.com/QualitativeDataRepository/dataverse/blob/06c0985052403a27e6c4e43f39450390afbbfd86/src/main/java/edu/harvard/iq/dataverse/api/Datasets.java#L1824-L1830 for the change and comment. That seems to work (and not break ascii chars), but I don't know if that's the right thing to do (versus there being some flag we haven't found to make the client change how the filename is sent to the api/to tell the server that it is utf-8, or if there is a fix in Jersey/mime4j that we should get rather than working around it.). In any case, info for the future and at least maybe a work-around if this is a priority for someone (the code above does result in the utf-8/correct version of the name being in the db in the filemetadata.) (Also FWIW - on Windows I was having trouble reading utf-8 chars in the command console (and Ubuntu on Windows) until I found/turned on the beta utf-8 support. Anyone testing on Windows may need to check into that as well.) |
@qqmyers your fix worked with curl for me but not REST Assured. I still get "message": "Failed to add file to dataset." Have you had any luck with this? |
Jim and I had a quick call about this and decided that any changes to upload are out of scope. Here's the line we're talking about so we have it handy:
for the addFileToDataset method in Datasets.java. |
support download of UTF-8 filenames #7188
In looking at #7184, I realized that that problem had been introduced when the S3 direct download code had been updated to correctly encode UTF-8 chars (which caused an error in direct downloads). That code was never applied to the normal download, or related downloads of metadata files, etc. where the filename is used and I confirmed that filenames such as "1976–2016.txt" - with a long dash (UTF xE2 x80 x93 ) are downloaded as "1976 2016.txt" with normal download. "1976–2016.txt" was the example from #4524.
I didn't test any other UTF-8 characters, or other endpoints but I don't see any UTF-8 encoding outside S3AccessIO. I would expect that the same changes to the content disposition header that were made in S3AccessIO would work elsewhere, e.g.
instances of
"attachment; filename=\"" + fileName + "\""
-->
"attachment; filename*=UTF-8''" + URLEncoder.encode(this.getDataFile().getDisplayName(), "UTF-8").replaceAll("\\+", "%20")
The text was updated successfully, but these errors were encountered: