Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Correct rdf data export (#811) and prevent broken CSV (#1585) #1659
It seems to work with a test dataset of 177 products, but you might try it before deploying in production. I don't know how to test on the whole database.
You should review commit by commit if you want to see clearly what I've done. Commit 9fb9f6f is a mistake and I'm not very confident on commit deletion.
i made some tests with the whole database. There is still database content issues which break the CSV export. Strange database issue: there is a carriage return (\n) at the end of the GPS coordinates related to villecomtal-sur-arros-gers-france: 43.400279,0.199525.
So we have to sanitize the content of more fields than I expected.
I suggest to create a
The function would return the sanitized content. Non visible ASCII chars would be replaced by a space, except trailing whitespace or "\n", which would be trimed. Example:
The function would be used both in input (web app, imports) and in ouput (CSV exports). There might be an issue related to \n in some fields: do we have to delete it all the time?
If there's just one case of bad source data, cleaning the source file might be easier?
Sounds good! If you want to replace all control characters, you could also use
I don't think so. For CSV, RFC 4180 clearly states that fields containing CRLF should be possible (but be quoted with double quotes). In XML, line breaks should be fine in attributes and values, even though they might need to be used with
Ok I made more testings and enhance the exporter:
All seems to be fine. CSV file is ok. RDF file is valid (tested with
For the moment I keep
Yes we will have to clean to the source, but my experience is that the source might be corrupted again by the time, so I prefer to also clean the output.
I never tested
Yes RFC states CRLF is possible if fields are quoted. I think it's ok for small files, with few columns. But I think it's dangerous for huge files with dozens of columns as Open Food Facts dataset. In theory CSV tools should deal with CRLF in quoted fields but in fact some of them doesn't play well with that. Also you can't use regular unix text tools when your CSV contain CRLF inside fields:
So I think we should avoid CRLF inside fields in the CSV export. RDF file might be another discussion.
My question was more related to the usages. Is there some OFF usages where CRLF are important inside a field? Maybe we could replace CRLF inside some field by a string: "---" or "\\" for example.
By the way we can see later and deploy this pull request which is a first good step.
It's just the Unicode category Other, Control. It includes the range you already mentioned, plus different control characters outside of that range.
Fair point. I don't know about that.
Thanks for the link, I'll make some tests to see if there are issues with control chars outside of the range I focused on.
At first, I think we can merge and deploy my pull request as it solves a real issue. @stephanegigandet ?