UTF-8 characters in filenames not handled correctly in normal downloads #7188

qqmyers · 2020-08-11T22:04:37Z

In looking at #7184, I realized that that problem had been introduced when the S3 direct download code had been updated to correctly encode UTF-8 chars (which caused an error in direct downloads). That code was never applied to the normal download, or related downloads of metadata files, etc. where the filename is used and I confirmed that filenames such as "1976–2016.txt" - with a long dash (UTF xE2 x80 x93 ) are downloaded as "1976 2016.txt" with normal download. "1976–2016.txt" was the example from #4524.

I didn't test any other UTF-8 characters, or other endpoints but I don't see any UTF-8 encoding outside S3AccessIO. I would expect that the same changes to the content disposition header that were made in S3AccessIO would work elsewhere, e.g.
instances of
"attachment; filename=\"" + fileName + "\""
-->
"attachment; filename*=UTF-8''" + URLEncoder.encode(this.getDataFile().getDisplayName(), "UTF-8").replaceAll("\\+", "%20")

The text was updated successfully, but these errors were encountered:

djbrooke · 2020-08-14T16:51:58Z

Thanks @qqmyers. It's important that we do what we can to not change filenames on download, regardless of S3 or not. I'll move to Up Next.

pdurbin · 2020-12-18T21:26:40Z

I took a quick look at this and wanted to note that files with an en-dash ("–") as in the example above cannot be uploaded via the native API. You get the following error:

{
    "status": "ERROR",
    "message": "Failed to add file to dataset."
}

In server.log you see this:

Constraint violation found in FileMetadata. File Name cannot contain any of the following characters: / : * ? " < > | ; # . The invalid value is "READ?ME.md".]]

(The file I tried was "READ–ME.md")

Obviously, it would be more helpful to the user to see the reason why their upload failed rather than emailing support to learn what's in server.log. That said, I know this issue is about filenames are allowed to come out of Dataverse, but it probably makes sense to also example what filenames are allowed in.

Here's the code I was testing with:

Path pathtoReadme = Paths.get(Files.createTempDirectory(null) + File.separator + "READ–ME.md");
Files.write(pathtoReadme, "In the beginning...".getBytes());

Response uploadReadme = UtilIT.uploadFileViaNative(datasetId.toString(), pathtoReadme.toString(), apiToken);
uploadReadme.prettyPrint();
uploadReadme.then().assertThat()
    .statusCode(OK.getStatusCode())
    .body("data.files[0].label", equalTo("READ–ME.md"));

pdurbin · 2021-01-07T20:47:14Z

files with an en-dash ("–") as in the example above cannot be uploaded via the native API

Correction. When I was attempting to upload my READ–ME.md file previously, I was doing it with REST Assured (the code in my previous comment). I just tried again using curl and the file was uploaded but the en-dash was renamed to a â (a-circumflex) such that the file has a name that does copy well into GitHub (READâ��ME.md) so I'll paste a screenshot:

When I download it, the popup looks like the screenshot below and the file has a space in it like this: READâ ME.md

All this is to say that files with exotic characters in their filenames can be uploaded via the native API. I'm just not sure how to prevent the characters from being munged.

(On a related note, as @qqmyers mentioned in Slack, the reason I was seeing the error about disallowed characters above is that the en-dash is getting turned into a question mark, which is disallowed.)

The other thing I wanted to say is that the way to reliably get filenames with fancy characters via API is to use SWORD. This is working fine for me in automated testing. Also, more importantly for our users, the GUI works just fine to upload files with fancy characters.

In terms of scope, I think I'm going to go ahead and make a pull request to fix the download side. For uploading via native API, someone could spend some time trying to figure out how to do this with curl (and document it in the guides), or could try it in pyDataverse, or other Dataverse client libraries. I'm thinking this upload stuff is out of scope, but I'll probably spend a few minutes on it.

pdurbin · 2021-01-08T20:53:00Z

I tried and failed to get either curl or REST Assured to upload files with UTF-8 filenames. Here are a couple of the pages I was looking at in case they help someone in the future:

Meanwhile, my pull request for fixing download is #7503.

qqmyers · 2021-01-08T23:56:02Z

Yeah, there doesn't appear to be a lot of good info out there and things are confused between download, url encoded upload, and multipart upload, and between talking about the filename and file body, etc.

I saw a few hints that the problem may not be in how curl/other clients are sending but in the apache mime4j library always using 8859_1 encoding. I tried a one line fix - see https://github.com/QualitativeDataRepository/dataverse/blob/06c0985052403a27e6c4e43f39450390afbbfd86/src/main/java/edu/harvard/iq/dataverse/api/Datasets.java#L1824-L1830 for the change and comment.

That seems to work (and not break ascii chars), but I don't know if that's the right thing to do (versus there being some flag we haven't found to make the client change how the filename is sent to the api/to tell the server that it is utf-8, or if there is a fix in Jersey/mime4j that we should get rather than working around it.). In any case, info for the future and at least maybe a work-around if this is a priority for someone (the code above does result in the utf-8/correct version of the name being in the db in the filemetadata.)

(Also FWIW - on Windows I was having trouble reading utf-8 chars in the command console (and Ubuntu on Windows) until I found/turned on the beta utf-8 support. Anyone testing on Windows may need to check into that as well.)

pdurbin · 2021-01-11T18:05:53Z

@qqmyers your fix worked with curl for me but not REST Assured. I still get "message": "Failed to add file to dataset." Have you had any luck with this?

pdurbin · 2021-01-11T19:24:49Z

Jim and I had a quick call about this and decided that any changes to upload are out of scope.

Here's the line we're talking about so we have it handy:

newFilename = new String(contentDispositionHeader.getFileName().getBytes(StandardCharsets.ISO_8859_1), StandardCharsets.UTF_8);

for the addFileToDataset method in Datasets.java.

support download of UTF-8 filenames #7188

djbrooke added this to Up Next 🛎 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Aug 14, 2020

djbrooke self-assigned this Aug 19, 2020

djbrooke added the Small label Aug 19, 2020

djbrooke removed their assignment Sep 11, 2020

pdurbin self-assigned this Jan 5, 2021

pdurbin moved this from Up Next 🛎 to IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jan 5, 2021

mheppler mentioned this issue Jan 6, 2021

Citation - Unicode ® + Emojis 🍔 #5020

Closed

pdurbin added a commit that referenced this issue Jan 7, 2021

support download of UTF-8 filenames #7188

166b635

pdurbin mentioned this issue Jan 7, 2021

support download of UTF-8 filenames #7188 #7503

Merged

djbrooke removed this from IQSS Team - In Progress 💻 in IQSS/dataverse (TO BE RETIRED / DELETED in favor of project 34) Jan 7, 2021

pdurbin added a commit that referenced this issue Jan 8, 2021

ensure spaces aren't turned to plus signs #7188

702b505

pdurbin removed their assignment Jan 8, 2021

pdurbin added a commit that referenced this issue Jan 11, 2021

Mention UTF-8 filenames in release note #7188

074ac6a

kcondon closed this as completed in #7503 Jan 11, 2021

kcondon added a commit that referenced this issue Jan 11, 2021

Merge pull request #7503 from IQSS/7188-utf8-filenames

0bcdcfd

support download of UTF-8 filenames #7188

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-8 characters in filenames not handled correctly in normal downloads #7188

UTF-8 characters in filenames not handled correctly in normal downloads #7188

qqmyers commented Aug 11, 2020

djbrooke commented Aug 14, 2020

pdurbin commented Dec 18, 2020

pdurbin commented Jan 7, 2021

pdurbin commented Jan 8, 2021

qqmyers commented Jan 8, 2021

pdurbin commented Jan 11, 2021

pdurbin commented Jan 11, 2021

UTF-8 characters in filenames not handled correctly in normal downloads #7188

UTF-8 characters in filenames not handled correctly in normal downloads #7188

Comments

qqmyers commented Aug 11, 2020

djbrooke commented Aug 14, 2020

pdurbin commented Dec 18, 2020

pdurbin commented Jan 7, 2021

pdurbin commented Jan 8, 2021

qqmyers commented Jan 8, 2021

pdurbin commented Jan 11, 2021

pdurbin commented Jan 11, 2021