Properly parse FROM email headers with colon, comma characters #13

mccolin · 2014-12-31T17:38:53Z

We've encountered a few email address headers that are not properly parsed, likely because they contain comma and colon characters. These return null when parsed via parseOneAddress.

Test code:

var Addrs = require('email-addresses');
function testParse(fromHeader) {
  console.log("Input: "+fromHeader);
  var parsed = Addrs.parseOneAddress(fromHeader);
  console.log("Output: %s", inspect(parsed));
}

Here's some sample inputs and null failure output:

Input: iZotope, Inc. <izotope@izotope.com>
Output: null
Input: Goodhertz, Inc. <support@goodhertz.com>
Output: null
Input: HowlRound: A Center for the Theater Commons <webmaster@howlround.com>
Output: null

Removing the colon and comma characters generates successful output:

Input: iZotope, Inc. <izotope@izotope.com>
Output: null
Input: iZotope Inc. <izotope@izotope.com>
Output: { parts: 
   { name: 
      { name: 'display-name',
        tokens: 'iZotope Inc. ',
        semantic: 'iZotope Inc.',
        children: [Object] },
     address: 
      { name: 'addr-spec',
        tokens: 'izotope@izotope.com',
        semantic: 'izotope@izotope.com',
        children: [Object] },
     local: 
      { name: 'local-part',
        tokens: 'izotope',
        semantic: 'izotope',
        children: [Object] },
     domain: 
      { name: 'domain',
        tokens: 'izotope.com',
        semantic: 'izotope.com',
        children: [Object] } },
  name: 'iZotope Inc.',
  address: 'izotope@izotope.com',
  local: 'izotope',
  domain: 'izotope.com' }
Input: Goodhertz, Inc. <support@goodhertz.com>
Output: null
Input: Goodhertz Inc <support@goodhertz.com>
Output: { parts: 
   { name: 
      { name: 'display-name',
        tokens: 'Goodhertz Inc ',
        semantic: 'Goodhertz Inc',
        children: [Object] },
     address: 
      { name: 'addr-spec',
        tokens: 'support@goodhertz.com',
        semantic: 'support@goodhertz.com',
        children: [Object] },
     local: 
      { name: 'local-part',
        tokens: 'support',
        semantic: 'support',
        children: [Object] },
     domain: 
      { name: 'domain',
        tokens: 'goodhertz.com',
        semantic: 'goodhertz.com',
        children: [Object] } },
  name: 'Goodhertz Inc',
  address: 'support@goodhertz.com',
  local: 'support',
  domain: 'goodhertz.com' }
Input: HowlRound: A Center for the Theater Commons <webmaster@howlround.com>
Output: null
Input: HowlRound A Center for the Theater Commons <webmaster@howlround.com>
Output: { parts: 
   { name: 
      { name: 'display-name',
        tokens: 'HowlRound A Center for the Theater Commons ',
        semantic: 'HowlRound A Center for the Theater Commons',
        children: [Object] },
     address: 
      { name: 'addr-spec',
        tokens: 'webmaster@howlround.com',
        semantic: 'webmaster@howlround.com',
        children: [Object] },
     local: 
      { name: 'local-part',
        tokens: 'webmaster',
        semantic: 'webmaster',
        children: [Object] },
     domain: 
      { name: 'domain',
        tokens: 'howlround.com',
        semantic: 'howlround.com',
        children: [Object] } },
  name: 'HowlRound A Center for the Theater Commons',
  address: 'webmaster@howlround.com',
  local: 'webmaster',
  domain: 'howlround.com' }

Are these not parsing properly because they don't meet the RFC 5322 standard?

It'd be great to be able to support these and headers like them, particularly on calls to parseOneAddress, as delimiter checking shouldn't be necessary for a single address. These and headers like them are seen commonly enough in email we are seeing in practice that it may make sense to extend support.

The text was updated successfully, but these errors were encountered:

hlian · 2015-05-25T04:37:10Z

Are these not parsing properly because they don't meet the RFC 5322 standard?

I believe this is correct. The name before angle brackets is produced by an atom rule, and the comma cannot appear in an atom.

   atom            =   [CFWS] 1*atext [CFWS]
   specials        =   "(" / ")" /        ; Special characters that do
                       "<" / ">" /        ;  not appear in atext
                       "[" / "]" /
                       ":" / ";" /
                       "@" / "\" /
                       "," / "." /
                       DQUOTE

      Note: The "specials" token does not appear anywhere else in this
      specification.  It is simply the visible (i.e., non-control, non-
      white space) characters that do not appear in atext.  It is
      provided only because it is useful for implementers who use tools
      that lexically analyze messages.  Each of the characters in
      specials can be used to indicate a tokenization point in lexical
      analysis.

That said, if you knew for certainty that you were parsing a single address (where a comma should not appear as a tokenization point), you could escape the comma with a unique string and then unescape in the resulting output without too much difficulty.

jackbearheart · 2015-05-26T13:22:08Z

hlian is right, but I'd also say "or they have to be quoted". The "display name" portion of an email address can be made up of words, which can only contain commas if they are quoted strings. So iZotope, Inc. <izotope@izotope.com> doesn't parse but "iZotope, Inc." <izotope@izotope.com> does.

It may be the assumption of the data (against the RFC) that since this is a single email address, it can drop the quotes. There's many imaginable hacks to get around that. For instance, if the email doesn't have a quote in it, but does have an angle bracket, start with a quote and replace '<' with '"<'. Unfortunate.

mccolin · 2015-05-26T16:27:13Z

Thanks for the responses. An app using the library can definitely take some of the scrubbing/cleanup actions you're describing before parsing, but it'd be great if the lib had something like a standardize(address) or something similar that could either be called before attempting to parse or automatically included as part of the parse.

There's a certain elegance to this library that I can pass reasonably sensible headers to it and it just works, but having to precondition all of the data makes that feel a little kludgy.

This is all nit-picking, though... thanks for the responses! :-)

jackbearheart closed this as completed Feb 17, 2017

paviad mentioned this issue Aug 16, 2022

Allow square brackets and/or colon in display name with option #64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly parse FROM email headers with colon, comma characters #13

Properly parse FROM email headers with colon, comma characters #13

mccolin commented Dec 31, 2014

hlian commented May 25, 2015

jackbearheart commented May 26, 2015

mccolin commented May 26, 2015

Properly parse FROM email headers with colon, comma characters #13

Properly parse FROM email headers with colon, comma characters #13

Comments

mccolin commented Dec 31, 2014

hlian commented May 25, 2015

jackbearheart commented May 26, 2015

mccolin commented May 26, 2015