There is one more unexpected space(' ') appear in the attachment name after parsing #809

kenizhou · 2022-06-27T11:44:30Z

Describe the bug
I am using Lotus Notes to extract an eml file , and found the "Content-Disposition" will be wrapped into two lines and there is a TAB('\t') appended in the second line when the attachment name is long.

Please see the sample below:

After parsing by using the mimekit, the TAB '\t' will become a Space (' ')

I am not sure if it is because Lotus Notes did not follow the standard.

Thanks in advance!

jstedfast · 2022-06-27T15:36:09Z

Yea, quoted strings are not technically supposed to be folded.

I'll try to look into this to see if adding a work-around is possible without breaking other things.

kenizhou · 2022-06-28T00:06:57Z

thanks very much!

Fixes issue #809

jstedfast · 2022-06-29T14:15:43Z

Realistically, the original filename probably didn't have a tab -or- a space, the tab was just used when folding the parameter value because each line of a folded header MUST begin with a whitespace character and Lotus probably inserted the tab there to conform to that part of the spec.

I've modified the parameter list parser to not convert tabs to spaces when it unquotes parameter values which is probably the best we can do because it's impossible to know if the sending client inserted a space for folding purposes or if it folded at a space (which is what a client should do in cases when it folds a header).

For example, consider this improved folding of the filename:

Content-Disposition: attachment;
 filename="CR_A-EXCG-2020-0008 - Addition of UAT Email Domain in CMMP-GCN
 Connector.docx"

Unfolding that would produce: CR_A-EXCG-2020-0008 - Addition of UAT Email Domain in CMMP-GCN Connector.docx

...which is most likely the correct original filename.

You might be able to assume that tabs are not realistically part of a filename and so replacing tabs with string.Empty would be the best way to reproduce the original filename, but that logic isn't guaranteed.

kenizhou · 2022-06-30T20:15:15Z

Thanks very much!
Just pull the source down and have a testing, the workaround works fine for me.
but seems the enhancement caused something other issues. there are many unreadable characters appear in the mailbody:

jstedfast · 2022-06-30T21:00:33Z

Unfortunately your screenshot doesn't give me much context. What am I looking at? Are those header values? Or is that the content of a TextPart? Or?

kenizhou · 2022-06-30T21:41:01Z

The content of the textpart.

jstedfast · 2022-06-30T22:01:49Z

Are you using a MimeKit.Text.HtmlToHtml text converter?

kenizhou · 2022-06-30T22:27:01Z

Yes, I am using the "HtmlPreviewVisitor" of your sample source.

jstedfast · 2022-06-30T22:47:09Z

Thanks. I'll look into this bug which I think I probably introduced a few days ago in the converter.

kenizhou · 2022-06-30T22:51:06Z

Noted, thanks!

jstedfast · 2022-07-01T01:24:00Z

In the original message, what are the <meta> tags? And what is the charset specified in the Content-Type header? And how are you saving the file before rendering it with IE11?

jstedfast · 2022-07-01T14:28:04Z

Here's what I think the problem is:

The old HtmlToHtml converter used to ask the HtmlTokenizer to decode character references and then it would encode any character that needed to be encoded (whether it was originally encoded or not) when it wrote it back out. This is part of what caused issue HTML entity for carriage return is being encoded #808 (the other part is that is invalid even though you'd think it would decode to \r, the HTML5 specification says that should throw a parse error).
The new HtmlToHtml logic tells the HtmlTokenizer not to decode character references and passes the raw data through to the output stream unchanged (other than the conversion to unicode) which is arguably a better strategy as it is less likely to alter the meaning of the original HTML if MimeKit has any bugs in the HTML writer.

Why is that a problem? Well, it means that the output of the conversion is no longer guaranteed to be ASCII which means that the charset declared in the <meta> tags is going to be needed by the HTML rendering engine.

Depending on how you pass the output along to the IE11 browser window, this can make a big difference.

In my MessageReader sample, the code gets the converted string and sets it on the browser control via the DocumentText property which means that the browser control knows that the text it is getting is already a string (or, if it's a native control, it'd be converted to UTF-8 or UTF-16) and so no charset conversion would be needed.

If the browser control/window loads the converted HTML via a Stream (e.g. by reading it from a saved output file), then it will end up using the <meta> tags to determine what charset to use to convert it into UTF-8 or UTF-16 (whichever the browser control uses internally), so it is very important that you save the file using the correct System.Text.Encoding and not blindly save the content as UTF-8 (which is the default for File.WriteAllText() and for any TextWriter implementations that you might use).

I'm thinking that the best way to fix this is to continue saving the content to a file using UTF-8 but change the code in the HtmlTagCallback to modify the <meta> tags in order to override any charset values specified in them.

void HtmlTagCallback (HtmlTagContext ctx, HtmlWriter htmlWriter)
{
	if (ctx.TagId == HtmlTagId.Meta && !ctx.IsEndTag) {
		bool isContentType = false;

		ctx.WriteTag (htmlWriter, false);

		// replace charsets with "utf-8" since our output will be in utf-8 (and not whatever the original charset was)
		foreach (var attribute in ctx.Attributes) {
			if (attribute.Id == HtmlAttributeId.Charset) {
				htmlWriter.WriteAttributeName (attribute.Name);
				htmlWriter.WriteAttributeValue ("utf-8");
			} else if (isContentType && attribute.Id == HtmlAttributeId.Content) {
				htmlWriter.WriteAttributeName (attribute.Name);
				htmlWriter.WriteAttributeValue ("text/html; charset=utf-8");
			} else {
				if (attribute.Id == HtmlAttributeId.HttpEquiv && attribute.Value != null
					&& attribute.Value.Equals ("Content-Type", StringComparison.OrdinalIgnoreCase))
					isContentType = true;

				htmlWriter.WriteAttribute (attribute);
			}
		}
	} else if (ctx.TagId == HtmlTagId.Image && !ctx.IsEndTag && stack.Count > 0) {
		ctx.WriteTag (htmlWriter, false);

		// replace the src attribute with a "data:" URL
		foreach (var attribute in ctx.Attributes) {
			if (attribute.Id == HtmlAttributeId.Src) {
				if (!TryGetImage (attribute.Value, out var image)) {
					htmlWriter.WriteAttribute (attribute);
					continue;
				}

				var dataUri = GetDataUri (image);

				htmlWriter.WriteAttributeName (attribute.Name);
				htmlWriter.WriteAttributeValue (dataUri);
			} else {
				htmlWriter.WriteAttribute (attribute);
			}
		}
	} else if (ctx.TagId == HtmlTagId.Body && !ctx.IsEndTag) {
		ctx.WriteTag (htmlWriter, false);

		// add and/or replace oncontextmenu="return false;"
		foreach (var attribute in ctx.Attributes) {
			if (attribute.Name.Equals ("oncontextmenu", StringComparison.OrdinalIgnoreCase))
				continue;

			htmlWriter.WriteAttribute (attribute);
		}

		htmlWriter.WriteAttribute ("oncontextmenu", "return false;");
	} else {
		// pass the tag through to the output
		ctx.WriteTag (htmlWriter, true);
	}
}

jstedfast · 2022-08-20T12:15:14Z

MimeKit v3.4.0 has been released with this fix.

jstedfast added the compatibility Compatibility with existing software label Jun 27, 2022

jstedfast added a commit that referenced this issue Jun 29, 2022

When unquoting parameter values, don't convert tabs to spaces

e3a7102

Fixes issue #809

jstedfast closed this as completed Jun 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

There is one more unexpected space(' ') appear in the attachment name after parsing #809

There is one more unexpected space(' ') appear in the attachment name after parsing #809

kenizhou commented Jun 27, 2022

jstedfast commented Jun 27, 2022

kenizhou commented Jun 28, 2022

jstedfast commented Jun 29, 2022

kenizhou commented Jun 30, 2022

jstedfast commented Jun 30, 2022

kenizhou commented Jun 30, 2022

jstedfast commented Jun 30, 2022

kenizhou commented Jun 30, 2022

jstedfast commented Jun 30, 2022

kenizhou commented Jun 30, 2022

jstedfast commented Jul 1, 2022 •

edited

Loading

jstedfast commented Jul 1, 2022

jstedfast commented Aug 20, 2022

There is one more unexpected space(' ') appear in the attachment name after parsing #809

There is one more unexpected space(' ') appear in the attachment name after parsing #809

Comments

kenizhou commented Jun 27, 2022

jstedfast commented Jun 27, 2022

kenizhou commented Jun 28, 2022

jstedfast commented Jun 29, 2022

kenizhou commented Jun 30, 2022

jstedfast commented Jun 30, 2022

kenizhou commented Jun 30, 2022

jstedfast commented Jun 30, 2022

kenizhou commented Jun 30, 2022

jstedfast commented Jun 30, 2022

kenizhou commented Jun 30, 2022

jstedfast commented Jul 1, 2022 • edited Loading

jstedfast commented Jul 1, 2022

jstedfast commented Aug 20, 2022

jstedfast commented Jul 1, 2022 •

edited

Loading