Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF-8 characters in filenames not handled correctly in normal downloads #7188

Closed
qqmyers opened this issue Aug 11, 2020 · 7 comments · Fixed by #7503
Closed

UTF-8 characters in filenames not handled correctly in normal downloads #7188

qqmyers opened this issue Aug 11, 2020 · 7 comments · Fixed by #7503

Comments

@qqmyers
Copy link
Member

qqmyers commented Aug 11, 2020

In looking at #7184, I realized that that problem had been introduced when the S3 direct download code had been updated to correctly encode UTF-8 chars (which caused an error in direct downloads). That code was never applied to the normal download, or related downloads of metadata files, etc. where the filename is used and I confirmed that filenames such as "1976–2016.txt" - with a long dash (UTF xE2 x80 x93 ) are downloaded as "1976 2016.txt" with normal download. "1976–2016.txt" was the example from #4524.

I didn't test any other UTF-8 characters, or other endpoints but I don't see any UTF-8 encoding outside S3AccessIO. I would expect that the same changes to the content disposition header that were made in S3AccessIO would work elsewhere, e.g.
instances of
"attachment; filename=\"" + fileName + "\""
-->
"attachment; filename*=UTF-8''" + URLEncoder.encode(this.getDataFile().getDisplayName(), "UTF-8").replaceAll("\\+", "%20")

@djbrooke
Copy link
Contributor

Thanks @qqmyers. It's important that we do what we can to not change filenames on download, regardless of S3 or not. I'll move to Up Next.

@pdurbin
Copy link
Member

pdurbin commented Dec 18, 2020

I took a quick look at this and wanted to note that files with an en-dash ("–") as in the example above cannot be uploaded via the native API. You get the following error:

{
    "status": "ERROR",
    "message": "Failed to add file to dataset."
}

In server.log you see this:

Constraint violation found in FileMetadata. File Name cannot contain any of the following characters: / : * ? " < > | ; # . The invalid value is "READ?ME.md".]]

(The file I tried was "READ–ME.md")

Obviously, it would be more helpful to the user to see the reason why their upload failed rather than emailing support to learn what's in server.log. That said, I know this issue is about filenames are allowed to come out of Dataverse, but it probably makes sense to also example what filenames are allowed in.

Here's the code I was testing with:

Path pathtoReadme = Paths.get(Files.createTempDirectory(null) + File.separator + "READ–ME.md");
Files.write(pathtoReadme, "In the beginning...".getBytes());

Response uploadReadme = UtilIT.uploadFileViaNative(datasetId.toString(), pathtoReadme.toString(), apiToken);
uploadReadme.prettyPrint();
uploadReadme.then().assertThat()
    .statusCode(OK.getStatusCode())
    .body("data.files[0].label", equalTo("READ–ME.md"));

@pdurbin pdurbin self-assigned this Jan 5, 2021
@pdurbin pdurbin moved this from Up Next 🛎 to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jan 5, 2021
@pdurbin
Copy link
Member

pdurbin commented Jan 7, 2021

files with an en-dash ("–") as in the example above cannot be uploaded via the native API

Correction. When I was attempting to upload my READ–ME.md file previously, I was doing it with REST Assured (the code in my previous comment). I just tried again using curl and the file was uploaded but the en-dash was renamed to a â (a-circumflex) such that the file has a name that does copy well into GitHub (READâ��ME.md) so I'll paste a screenshot:

Screen Shot 2021-01-07 at 3 27 20 PM

When I download it, the popup looks like the screenshot below and the file has a space in it like this: READâ ME.md

Screen Shot 2021-01-07 at 3 27 54 PM

All this is to say that files with exotic characters in their filenames can be uploaded via the native API. I'm just not sure how to prevent the characters from being munged.

(On a related note, as @qqmyers mentioned in Slack, the reason I was seeing the error about disallowed characters above is that the en-dash is getting turned into a question mark, which is disallowed.)

The other thing I wanted to say is that the way to reliably get filenames with fancy characters via API is to use SWORD. This is working fine for me in automated testing. Also, more importantly for our users, the GUI works just fine to upload files with fancy characters.

In terms of scope, I think I'm going to go ahead and make a pull request to fix the download side. For uploading via native API, someone could spend some time trying to figure out how to do this with curl (and document it in the guides), or could try it in pyDataverse, or other Dataverse client libraries. I'm thinking this upload stuff is out of scope, but I'll probably spend a few minutes on it.

@pdurbin
Copy link
Member

pdurbin commented Jan 8, 2021

I tried and failed to get either curl or REST Assured to upload files with UTF-8 filenames. Here are a couple of the pages I was looking at in case they help someone in the future:

Meanwhile, my pull request for fixing download is #7503.

@pdurbin pdurbin removed their assignment Jan 8, 2021
@qqmyers
Copy link
Member Author

qqmyers commented Jan 8, 2021

Yeah, there doesn't appear to be a lot of good info out there and things are confused between download, url encoded upload, and multipart upload, and between talking about the filename and file body, etc.

I saw a few hints that the problem may not be in how curl/other clients are sending but in the apache mime4j library always using 8859_1 encoding. I tried a one line fix - see https://github.com/QualitativeDataRepository/dataverse/blob/06c0985052403a27e6c4e43f39450390afbbfd86/src/main/java/edu/harvard/iq/dataverse/api/Datasets.java#L1824-L1830 for the change and comment.

That seems to work (and not break ascii chars), but I don't know if that's the right thing to do (versus there being some flag we haven't found to make the client change how the filename is sent to the api/to tell the server that it is utf-8, or if there is a fix in Jersey/mime4j that we should get rather than working around it.). In any case, info for the future and at least maybe a work-around if this is a priority for someone (the code above does result in the utf-8/correct version of the name being in the db in the filemetadata.)

(Also FWIW - on Windows I was having trouble reading utf-8 chars in the command console (and Ubuntu on Windows) until I found/turned on the beta utf-8 support. Anyone testing on Windows may need to check into that as well.)

@pdurbin
Copy link
Member

pdurbin commented Jan 11, 2021

@qqmyers your fix worked with curl for me but not REST Assured. I still get "message": "Failed to add file to dataset." Have you had any luck with this?

@pdurbin
Copy link
Member

pdurbin commented Jan 11, 2021

Jim and I had a quick call about this and decided that any changes to upload are out of scope.

Here's the line we're talking about so we have it handy:

newFilename = new String(contentDispositionHeader.getFileName().getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);

for the addFileToDataset method in Datasets.java.

kcondon added a commit that referenced this issue Jan 11, 2021
support download of UTF-8 filenames #7188
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants