New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correct rdf data export (#811) and prevent broken CSV (#1585) #1659

Merged
merged 10 commits into from Feb 19, 2019

Conversation

Projects
None yet
3 participants
@CharlesNepote
Copy link
Contributor

CharlesNepote commented Feb 8, 2019

Description:

  1. Encode URIs for RDF export (see #811).
  2. Filter bad chars which break CSV and log it in a file (prevent but doesn't resolve #1585).

It seems to work with a test dataset of 177 products, but you might try it before deploying in production. I don't know how to test on the whole database.

You should review commit by commit if you want to see clearly what I've done. Commit 9fb9f6f is a mistake and I'm not very confident on commit deletion.

CharlesNepote added some commits Feb 3, 2019

Merge branch 'issue/811-rdf-data-export-seems-broken' of github.com:o…
…penfoodfacts/openfoodfacts-server into issue/811-rdf-data-export-seems-broken

@CharlesNepote CharlesNepote requested review from hangy and stephanegigandet Feb 8, 2019

@stephanegigandet
Copy link
Contributor

stephanegigandet left a comment

Looks good to me.

@hangy

hangy approved these changes Feb 9, 2019

Copy link
Member

hangy left a comment

I don't have an environment to test this right now, but the code changes look fine.

@CharlesNepote

This comment has been minimized.

Copy link
Contributor Author

CharlesNepote commented Feb 10, 2019

i made some tests with the whole database. There is still database content issues which break the CSV export. Strange database issue: there is a carriage return (\n) at the end of the GPS coordinates related to villecomtal-sur-arros-gers-france: 43.400279,0.199525.

So we have to sanitize the content of more fields than I expected.

I suggest to create a sanitize_field_value() function, probably in ProductOpener::Data, with two arguments:

  • $field_value: the content to be sanitized;
  • $log: optionally the output where to log each problem (example: a filehandler, STDOUT, etc.)

The function would return the sanitized content. Non visible ASCII chars would be replaced by a space, except trailing whitespace or "\n", which would be trimed. Example:

sanitize_field_value("something\nanother\n", $LOG) => "something another"

The function would be used both in input (web app, imports) and in ouput (CSV exports). There might be an issue related to \n in some fields: do we have to delete it all the time?

@hangy @stephanegigandet ?

@hangy

This comment has been minimized.

Copy link
Member

hangy commented Feb 17, 2019

i made some tests with the whole database. There is still database content issues which break the CSV export. Strange database issue: there is a carriage return (\n) at the end of the GPS coordinates related to villecomtal-sur-arros-gers-france: 43.400279,0.199525.

If there's just one case of bad source data, cleaning the source file might be easier?

The function would return the sanitized content. Non visible ASCII chars would be replaced by a space, except trailing whitespace or "\n", which would be trimed.

Sounds good! If you want to replace all control characters, you could also use \p{Cc}.

There might be an issue related to \n in some fields: do we have to delete it all the time?

I don't think so. For CSV, RFC 4180 clearly states that fields containing CRLF should be possible (but be quoted with double quotes). In XML, line breaks should be fine in attributes and values, even though they might need to be used with <![CDATA[ to avoid incorrect with some XML parsers. Therefore, multi-line text should be valid field/attribute/column content.

@CharlesNepote

This comment has been minimized.

Copy link
Contributor Author

CharlesNepote commented Feb 17, 2019

Ok I made more testings and enhance the exporter:

  • prevent perl warnings
  • avoid control chars in rdf file

All seems to be fine. CSV file is ok. RDF file is valid (tested with xmllint --stream en.openfoodfacts.localhost.products.rdf doesn't provide any more issues).

For the moment I keep sanitize_field_content() in the ./scripts/export_database.pl. I'll see later if it's interesting to move it to a perl OFF lib.

If there's just one case of bad source data, cleaning the source file might be easier?

Yes we will have to clean to the source, but my experience is that the source might be corrupted again by the time, so I prefer to also clean the output.

Sounds good! If you want to replace all control characters, you could also use \p{Cc}.

I never tested \p{Cc} so I keep char codes for the moment. Maybe I'll try later if it's quicker or if it consume less memory.

There might be an issue related to \n in some fields: do we have to delete it all the time?

I don't think so. For CSV, RFC 4180 clearly states that fields containing CRLF should be possible (but be quoted with double quotes). In XML, line breaks should be fine in attributes and values, even though they might need to be used with <![CDATA[ to avoid incorrect with some XML parsers. Therefore, multi-line text should be valid field/attribute/column content.

Yes RFC states CRLF is possible if fields are quoted. I think it's ok for small files, with few columns. But I think it's dangerous for huge files with dozens of columns as Open Food Facts dataset. In theory CSV tools should deal with CRLF in quoted fields but in fact some of them doesn't play well with that. Also you can't use regular unix text tools when your CSV contain CRLF inside fields: sort, wc -l, head, grep, etc., will not work well. It makes CSV manipulation more complex.

So I think we should avoid CRLF inside fields in the CSV export. RDF file might be another discussion.

My question was more related to the usages. Is there some OFF usages where CRLF are important inside a field? Maybe we could replace CRLF inside some field by a string: "---" or "\\" for example.

By the way we can see later and deploy this pull request which is a first good step.

@hangy

This comment has been minimized.

Copy link
Member

hangy commented Feb 17, 2019

Sounds good! If you want to replace all control characters, you could also use \p{Cc}.

I never tested \p{Cc} so I keep char codes for the moment. Maybe I'll try later if it's quicker or if it consume less memory.

It's just the Unicode category Other, Control. It includes the range you already mentioned, plus different control characters outside of that range.

My question was more related to the usages. Is there some OFF usages where CRLF are important inside a field? Maybe we could replace CRLF inside some field by a string: "---" or "\" for example.

Fair point. I don't know about that. 🙂

@CharlesNepote

This comment has been minimized.

Copy link
Contributor Author

CharlesNepote commented Feb 18, 2019

Sounds good! If you want to replace all control characters, you could also use \p{Cc}.

I never tested \p{Cc} so I keep char codes for the moment. Maybe I'll try later if it's quicker or if it consume less memory.

It's just the Unicode category Other, Control. It includes the range you already mentioned, plus different control characters outside of that range.

Thanks for the link, I'll make some tests to see if there are issues with control chars outside of the range I focused on.

At first, I think we can merge and deploy my pull request as it solves a real issue. @stephanegigandet ?

@stephanegigandet stephanegigandet merged commit 96bb5d1 into master Feb 19, 2019

1 of 2 checks passed

continuous-integration/travis-ci/push The Travis CI build failed
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment