Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There is one more unexpected space(' ') appear in the attachment name after parsing #809

Closed
kenizhou opened this issue Jun 27, 2022 · 13 comments
Labels
compatibility Compatibility with existing software

Comments

@kenizhou
Copy link

Describe the bug
I am using Lotus Notes to extract an eml file , and found the "Content-Disposition" will be wrapped into two lines and there is a TAB('\t') appended in the second line when the attachment name is long.

Please see the sample below:
image

After parsing by using the mimekit, the TAB '\t' will become a Space (' ')

image

I am not sure if it is because Lotus Notes did not follow the standard.

Thanks in advance!

@jstedfast
Copy link
Owner

Yea, quoted strings are not technically supposed to be folded.

I'll try to look into this to see if adding a work-around is possible without breaking other things.

@jstedfast jstedfast added the compatibility Compatibility with existing software label Jun 27, 2022
@kenizhou
Copy link
Author

thanks very much!

@jstedfast
Copy link
Owner

Realistically, the original filename probably didn't have a tab -or- a space, the tab was just used when folding the parameter value because each line of a folded header MUST begin with a whitespace character and Lotus probably inserted the tab there to conform to that part of the spec.

I've modified the parameter list parser to not convert tabs to spaces when it unquotes parameter values which is probably the best we can do because it's impossible to know if the sending client inserted a space for folding purposes or if it folded at a space (which is what a client should do in cases when it folds a header).

For example, consider this improved folding of the filename:

Content-Disposition: attachment;
 filename="CR_A-EXCG-2020-0008 - Addition of UAT Email Domain in CMMP-GCN
 Connector.docx"

Unfolding that would produce: CR_A-EXCG-2020-0008 - Addition of UAT Email Domain in CMMP-GCN Connector.docx

...which is most likely the correct original filename.

You might be able to assume that tabs are not realistically part of a filename and so replacing tabs with string.Empty would be the best way to reproduce the original filename, but that logic isn't guaranteed.

@kenizhou
Copy link
Author

Thanks very much!
Just pull the source down and have a testing, the workaround works fine for me.
but seems the enhancement caused something other issues. there are many unreadable characters appear in the mailbody:

image

@jstedfast
Copy link
Owner

Unfortunately your screenshot doesn't give me much context. What am I looking at? Are those header values? Or is that the content of a TextPart? Or?

@kenizhou
Copy link
Author

The content of the textpart.
image
image
image

@jstedfast
Copy link
Owner

Are you using a MimeKit.Text.HtmlToHtml text converter?

@kenizhou
Copy link
Author

Yes, I am using the "HtmlPreviewVisitor" of your sample source.

@jstedfast
Copy link
Owner

Thanks. I'll look into this bug which I think I probably introduced a few days ago in the converter.

@kenizhou
Copy link
Author

Noted, thanks!

@jstedfast
Copy link
Owner

jstedfast commented Jul 1, 2022

In the original message, what are the <meta> tags? And what is the charset specified in the Content-Type header? And how are you saving the file before rendering it with IE11?

@jstedfast
Copy link
Owner

Here's what I think the problem is:

  • The old HtmlToHtml converter used to ask the HtmlTokenizer to decode character references and then it would encode any character that needed to be encoded (whether it was originally encoded or not) when it wrote it back out. This is part of what caused issue HTML entity for carriage return is being encoded #808 (the other part is that is invalid even though you'd think it would decode to \r, the HTML5 specification says that should throw a parse error).
  • The new HtmlToHtml logic tells the HtmlTokenizer not to decode character references and passes the raw data through to the output stream unchanged (other than the conversion to unicode) which is arguably a better strategy as it is less likely to alter the meaning of the original HTML if MimeKit has any bugs in the HTML writer.

Why is that a problem? Well, it means that the output of the conversion is no longer guaranteed to be ASCII which means that the charset declared in the <meta> tags is going to be needed by the HTML rendering engine.

Depending on how you pass the output along to the IE11 browser window, this can make a big difference.

In my MessageReader sample, the code gets the converted string and sets it on the browser control via the DocumentText property which means that the browser control knows that the text it is getting is already a string (or, if it's a native control, it'd be converted to UTF-8 or UTF-16) and so no charset conversion would be needed.

If the browser control/window loads the converted HTML via a Stream (e.g. by reading it from a saved output file), then it will end up using the <meta> tags to determine what charset to use to convert it into UTF-8 or UTF-16 (whichever the browser control uses internally), so it is very important that you save the file using the correct System.Text.Encoding and not blindly save the content as UTF-8 (which is the default for File.WriteAllText() and for any TextWriter implementations that you might use).

I'm thinking that the best way to fix this is to continue saving the content to a file using UTF-8 but change the code in the HtmlTagCallback to modify the <meta> tags in order to override any charset values specified in them.

void HtmlTagCallback (HtmlTagContext ctx, HtmlWriter htmlWriter)
{
	if (ctx.TagId == HtmlTagId.Meta && !ctx.IsEndTag) {
		bool isContentType = false;

		ctx.WriteTag (htmlWriter, false);

		// replace charsets with "utf-8" since our output will be in utf-8 (and not whatever the original charset was)
		foreach (var attribute in ctx.Attributes) {
			if (attribute.Id == HtmlAttributeId.Charset) {
				htmlWriter.WriteAttributeName (attribute.Name);
				htmlWriter.WriteAttributeValue ("utf-8");
			} else if (isContentType && attribute.Id == HtmlAttributeId.Content) {
				htmlWriter.WriteAttributeName (attribute.Name);
				htmlWriter.WriteAttributeValue ("text/html; charset=utf-8");
			} else {
				if (attribute.Id == HtmlAttributeId.HttpEquiv && attribute.Value != null
					&& attribute.Value.Equals ("Content-Type", StringComparison.OrdinalIgnoreCase))
					isContentType = true;

				htmlWriter.WriteAttribute (attribute);
			}
		}
	} else if (ctx.TagId == HtmlTagId.Image && !ctx.IsEndTag && stack.Count > 0) {
		ctx.WriteTag (htmlWriter, false);

		// replace the src attribute with a "data:" URL
		foreach (var attribute in ctx.Attributes) {
			if (attribute.Id == HtmlAttributeId.Src) {
				if (!TryGetImage (attribute.Value, out var image)) {
					htmlWriter.WriteAttribute (attribute);
					continue;
				}

				var dataUri = GetDataUri (image);

				htmlWriter.WriteAttributeName (attribute.Name);
				htmlWriter.WriteAttributeValue (dataUri);
			} else {
				htmlWriter.WriteAttribute (attribute);
			}
		}
	} else if (ctx.TagId == HtmlTagId.Body && !ctx.IsEndTag) {
		ctx.WriteTag (htmlWriter, false);

		// add and/or replace oncontextmenu="return false;"
		foreach (var attribute in ctx.Attributes) {
			if (attribute.Name.Equals ("oncontextmenu", StringComparison.OrdinalIgnoreCase))
				continue;

			htmlWriter.WriteAttribute (attribute);
		}

		htmlWriter.WriteAttribute ("oncontextmenu", "return false;");
	} else {
		// pass the tag through to the output
		ctx.WriteTag (htmlWriter, true);
	}
}

@jstedfast
Copy link
Owner

MimeKit v3.4.0 has been released with this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compatibility Compatibility with existing software
Projects
None yet
Development

No branches or pull requests

2 participants