Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Properly parse FROM email headers with colon, comma characters #13

Closed
mccolin opened this issue Dec 31, 2014 · 3 comments · May be fixed by #64
Closed

Properly parse FROM email headers with colon, comma characters #13

mccolin opened this issue Dec 31, 2014 · 3 comments · May be fixed by #64

Comments

@mccolin
Copy link

mccolin commented Dec 31, 2014

We've encountered a few email address headers that are not properly parsed, likely because they contain comma and colon characters. These return null when parsed via parseOneAddress.

Test code:

var Addrs = require('email-addresses');
function testParse(fromHeader) {
  console.log("Input: "+fromHeader);
  var parsed = Addrs.parseOneAddress(fromHeader);
  console.log("Output: %s", inspect(parsed));
}

Here's some sample inputs and null failure output:

Input: iZotope, Inc. <izotope@izotope.com>
Output: null
Input: Goodhertz, Inc. <support@goodhertz.com>
Output: null
Input: HowlRound: A Center for the Theater Commons <webmaster@howlround.com>
Output: null

Removing the colon and comma characters generates successful output:

Input: iZotope, Inc. <izotope@izotope.com>
Output: null
Input: iZotope Inc. <izotope@izotope.com>
Output: { parts: 
   { name: 
      { name: 'display-name',
        tokens: 'iZotope Inc. ',
        semantic: 'iZotope Inc.',
        children: [Object] },
     address: 
      { name: 'addr-spec',
        tokens: 'izotope@izotope.com',
        semantic: 'izotope@izotope.com',
        children: [Object] },
     local: 
      { name: 'local-part',
        tokens: 'izotope',
        semantic: 'izotope',
        children: [Object] },
     domain: 
      { name: 'domain',
        tokens: 'izotope.com',
        semantic: 'izotope.com',
        children: [Object] } },
  name: 'iZotope Inc.',
  address: 'izotope@izotope.com',
  local: 'izotope',
  domain: 'izotope.com' }
Input: Goodhertz, Inc. <support@goodhertz.com>
Output: null
Input: Goodhertz Inc <support@goodhertz.com>
Output: { parts: 
   { name: 
      { name: 'display-name',
        tokens: 'Goodhertz Inc ',
        semantic: 'Goodhertz Inc',
        children: [Object] },
     address: 
      { name: 'addr-spec',
        tokens: 'support@goodhertz.com',
        semantic: 'support@goodhertz.com',
        children: [Object] },
     local: 
      { name: 'local-part',
        tokens: 'support',
        semantic: 'support',
        children: [Object] },
     domain: 
      { name: 'domain',
        tokens: 'goodhertz.com',
        semantic: 'goodhertz.com',
        children: [Object] } },
  name: 'Goodhertz Inc',
  address: 'support@goodhertz.com',
  local: 'support',
  domain: 'goodhertz.com' }
Input: HowlRound: A Center for the Theater Commons <webmaster@howlround.com>
Output: null
Input: HowlRound A Center for the Theater Commons <webmaster@howlround.com>
Output: { parts: 
   { name: 
      { name: 'display-name',
        tokens: 'HowlRound A Center for the Theater Commons ',
        semantic: 'HowlRound A Center for the Theater Commons',
        children: [Object] },
     address: 
      { name: 'addr-spec',
        tokens: 'webmaster@howlround.com',
        semantic: 'webmaster@howlround.com',
        children: [Object] },
     local: 
      { name: 'local-part',
        tokens: 'webmaster',
        semantic: 'webmaster',
        children: [Object] },
     domain: 
      { name: 'domain',
        tokens: 'howlround.com',
        semantic: 'howlround.com',
        children: [Object] } },
  name: 'HowlRound A Center for the Theater Commons',
  address: 'webmaster@howlround.com',
  local: 'webmaster',
  domain: 'howlround.com' }

Are these not parsing properly because they don't meet the RFC 5322 standard?

It'd be great to be able to support these and headers like them, particularly on calls to parseOneAddress, as delimiter checking shouldn't be necessary for a single address. These and headers like them are seen commonly enough in email we are seeing in practice that it may make sense to extend support.

@hlian
Copy link

hlian commented May 25, 2015

Are these not parsing properly because they don't meet the RFC 5322 standard?

I believe this is correct. The name before angle brackets is produced by an atom rule, and the comma cannot appear in an atom.

   atom            =   [CFWS] 1*atext [CFWS]
   specials        =   "(" / ")" /        ; Special characters that do
                       "<" / ">" /        ;  not appear in atext
                       "[" / "]" /
                       ":" / ";" /
                       "@" / "\" /
                       "," / "." /
                       DQUOTE
      Note: The "specials" token does not appear anywhere else in this
      specification.  It is simply the visible (i.e., non-control, non-
      white space) characters that do not appear in atext.  It is
      provided only because it is useful for implementers who use tools
      that lexically analyze messages.  Each of the characters in
      specials can be used to indicate a tokenization point in lexical
      analysis.

That said, if you knew for certainty that you were parsing a single address (where a comma should not appear as a tokenization point), you could escape the comma with a unique string and then unescape in the resulting output without too much difficulty.

@jackbearheart
Copy link
Owner

hlian is right, but I'd also say "or they have to be quoted". The "display name" portion of an email address can be made up of words, which can only contain commas if they are quoted strings. So iZotope, Inc. <izotope@izotope.com> doesn't parse but "iZotope, Inc." <izotope@izotope.com> does.

It may be the assumption of the data (against the RFC) that since this is a single email address, it can drop the quotes. There's many imaginable hacks to get around that. For instance, if the email doesn't have a quote in it, but does have an angle bracket, start with a quote and replace '<' with '"<'. Unfortunate.

@mccolin
Copy link
Author

mccolin commented May 26, 2015

Thanks for the responses. An app using the library can definitely take some of the scrubbing/cleanup actions you're describing before parsing, but it'd be great if the lib had something like a standardize(address) or something similar that could either be called before attempting to parse or automatically included as part of the parse.

There's a certain elegance to this library that I can pass reasonably sensible headers to it and it just works, but having to precondition all of the data makes that feel a little kludgy.

This is all nit-picking, though... thanks for the responses! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants