Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid XML in generated reports #190

Open
wolfgangkarall opened this issue Mar 17, 2021 · 3 comments
Open

Invalid XML in generated reports #190

wolfgangkarall opened this issue Mar 17, 2021 · 3 comments
Assignees

Comments

@wolfgangkarall
Copy link

Describe the bug
The user-configured org_name (at least) is taken as-is for XML and mail message bodies, but people tend to enter characters that are not suitable as-is in neither.

Examples:

Message Body:

Submitted by Sueño Fueguino
Generated with Mail::DMARC 1.20141206

Corresponding XML:

<org_name>Sue�o Fueguino</org_name>

Also in a more recent version (and this time already the message body is showing signs of breakage, too)

Submitted by Gwt7 IIA - Ingeniería e Informática Asociada
Generated with Mail::DMARC 1.20180125

and XML:

<org_name>Gwt7 IIA - Ingenier�a e Inform�tica Asociada</org_name>

When trying to view this report in Firefox it complains:

XML Parsing Error: not well-formed
Location: file:///home/user/.cache/.fr-wGu1gl/report.xml
Line Number 5, Column 32:
		<org_name>Gwt7 IIA - Ingenier�
---------------------------------------------^

Other XML parsers complain or fail as well.

Note: I'm not an active user but suffer from the XML that gets send by Mail::DMARC on the receiving end that is not being processed by XML parsers because of this. I haven't got a report showing this issue sent by the latest version, but by the looks of it this is still the case in the current code.

@marcbradshaw marcbradshaw self-assigned this Mar 24, 2021
wolfgangkarall added a commit to wolfgangkarall/dmarcts-report-parser that referenced this issue Mar 26, 2021
There are reports containing broken XML, e.g. the ones created by
Mail::DMARC, see msimerson/mail-dmarc#190
@marcbradshaw
Copy link
Collaborator

Note: The Database schema (for mysql at least) specifies 'CHARACTER SET ascii', so this will need to be updated to handle the storage of UTF-8 in reports.
rfc7489 specifies that domains must be converted to a-label form, but is ambiguous regarding the remaining data in the report.
A quick fix may be to convert everything to ascii before saving the report, but this is likely to break (or at least not fix, because they are likely already broken) EAI addresses.

@msimerson
Copy link
Owner

A quick fix may be to convert everything to ascii before saving the report

Sounds like the right choice, based on my read of RFC 8616.

but this is likely to break (or at least not fix, because they are likely already broken) EAI addresses.

True, but will it matter? New reports will be saved with the new converted a-label form, which should fix all future reports, and solve this issue, right?

RFC 8616, Section 6

DMARC and Internationalized Mail

   DMARC RFC7489 defines a policy language that domain owners can
   specify for the domain of the address in an RFC5322.From header
   field.

   Section 6.6.1 of RFC7489 specifies, somewhat imprecisely, how IDNs
   in the RFC5322.From address domain are to be handled.  That section
   is updated to say that all U-labels in the domain are converted to
   A-labels before further processing.  Section 7.1 of RFC7489 is
   similarly updated to say that all U-labels in domains being handled
   are converted to A-labels before further processing.

   DMARC policy records, described in Sections 6.3 and 7.1 of RFC7489,
   can contain email addresses in the "rua" and "ruf" tags.  Since a
   policy record can be used for both internationalized and conventional
   mail, those addresses still have to be conventional addresses, not
   internationalized addresses.  DMARC and Internationalized Mail
   DMARC RFC7489 defines a policy language that domain owners can
   specify for the domain of the address in an RFC5322.From header
   field.

   Section 6.6.1 of RFC7489 specifies, somewhat imprecisely, how IDNs
   in the RFC5322.From address domain are to be handled.  That section
   is updated to say that all U-labels in the domain are converted to
   A-labels before further processing.  Section 7.1 of RFC7489 is
   similarly updated to say that all U-labels in domains being handled
   are converted to A-labels before further processing.

   DMARC policy records, described in Sections 6.3 and 7.1 of RFC7489,
   can contain email addresses in the "rua" and "ruf" tags.  Since a
   policy record can be used for both internationalized and conventional
   mail, those addresses still have to be conventional addresses, not
   internationalized addresses.

@msimerson
Copy link
Owner

Because the data column that stored domains and author info were explicitly declared as ASCII, I think (limited testing) that mysql would have converted any unicode characters to a ? character. Near as I can tell, that original character data is lost. If I'm wrong and MySQL stored the character code correctly, then the changes below will automatically do The Right Thing.

Now that MySQL 8 is the minimum supported version, changing the schema to enable UTF-8 chars is no longer messy and fraught with pitfalls.

The SQL code shown on the mysql wiki page should do the needful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants