-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad content-types #3
Comments
Actually now I've run across another unparseable field in some emails, namely:
(It's invalid because it's not in quotes and contains a colon, as a 'token' under RFC 2045). What's the general philosophy to be for handling unparseable fields? At least with content-type (in the original post) there's a standard to fall back on (text/plain), but not so obviously with locations. Should the library discard crap, or try to guess where to quotes should go?
|
Ok... so the overriding philosophy is "don't loose any information", the other one is "don't nuke user info with something generated" In this case, we could implement a "quote_if_needed" method on the content-location in the initialization method of field/content_location.rb Maybe, make the method inside of lib/utilities.rb, you can look in the existing ActionMailer on how to implement this, then call that on the passed in value of content-location... as content-location is only ever going to be a single value. But if we have that in utilities, we could then possibly parse any other param fields in content-disposition or content-type, again, before treetop gets to it. Mikel |
What is the proper way to handle bad content types? I've read about as many RFC's today as I can stand, and I missed the content-type fall back recommendations if they're in there.
Of 400mb or so of email I've attempted to parse today (largely the Enron mail corpus) the vast majority of parse errors were on the content-type header.
Mostly things like
(Instead of text/plain)
Or like:
(Missing the ';' delimiter before the value hash)
I committed a fix to my fork[1] that sets content-type to 'text/plain' on parser errors, but that doesn't feel quite right. Should it just ignore that field in the header altogether?
Thanks-
Jim
[1] http://github.com/jlindley/mail/commit/2fd51a8d757bbec2a7ef553b6bc52486b45539ab
The text was updated successfully, but these errors were encountered: