Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying set of CSVs which a lot of entries does not output any json #101

Closed
sbeaupre opened this issue Nov 23, 2019 · 6 comments · Fixed by #103
Closed

Querying set of CSVs which a lot of entries does not output any json #101

sbeaupre opened this issue Nov 23, 2019 · 6 comments · Fixed by #103
Labels

Comments

@sbeaupre
Copy link

I am working on a medium-large open data set of Belgian companies, which is freely downloadable at https://kbopub.economie.fgov.be/kbo-open-data/login

I am trying to execute following query to denormalize data into 1 file and it works fine when the output target is CSV, but nothing happens when outputting JSON:

docker run --rm -it -v $(pwd):/tmp noborus/trdsql -ojson -ih " \
SELECT * FROM /tmp/enterprise.csv as e \
LEFT JOIN /tmp/address.csv as a ON (e.EnterpriseNumber=a.EntityNumber) \
LEFT JOIN /tmp/contact.csv as c ON (e.EnterpriseNumber=c.EntityNumber) \
LEFT JOIN /tmp/denomination.csv as d ON (e.EnterpriseNumber=d.EntityNumber) \
";

Any ideas?

@noborus
Copy link
Owner

noborus commented Nov 23, 2019

I want more information.

Does trdsql exit without outputting anything (errors, etc.) in JSON?
Can you output CSV and JSON with SQL LIMIT?

@sbeaupre
Copy link
Author

CSV output works fine, with or without LIMIT.

JSON output works with a LIMIT (eg. 10), but not without LIMIT

JSON without LIMIT gives no output at all, not even error messages. Tried also with debug flag, but no extra info is returned.

@noborus
Copy link
Owner

noborus commented Nov 25, 2019

Hmm ...
I actually downloaded and tried it.
The result seems to have been output successfully.

docker run --rm -it -v $(pwd):/tmp noborus/trdsql -ojson -ih " \ 
SELECT * FROM /tmp/enterprise.csv as e \
LEFT JOIN /tmp/address.csv as a ON (e.EnterpriseNumber=a.EntityNumber) \
LEFT JOIN /tmp/contact.csv as c ON (e.EnterpriseNumber=c.EntityNumber) \
LEFT JOIN /tmp/denomination.csv as d ON (e.EnterpriseNumber=d.EntityNumber) \
" >tmp.json
ls -alFh tmp.json
-rw-r--r-- 1 noborus noborus 1.4G Nov 25 13:48 tmp.json

@sbeaupre
Copy link
Author

Saw that you are running on linux and realized that it is a memory issue.

Pure from an install on Mac, it works and more than 12GB is consumed to produce the json. Running inside docker with more resources works also fine. Strange that docker gave no error message.

Looking at the code (not a go dev myself), I suppose you are building everything in memory first and then write it to disk, and not streaming the data?

@noborus
Copy link
Owner

noborus commented Nov 25, 2019

Saw that you are running on linux and realized that it is a memory issue.

Pure from an install on Mac, it works and more than 12GB is consumed to produce the json. Running inside docker with more resources works also fine. Strange that docker gave no error message.

I see. understood.

Looking at the code (not a go dev myself), I suppose you are building everything in memory first and then write it to disk, and not streaming the data?

Yes.
Since it is currently an array, it is temporarily stored in memory.

For example

[{"id":"1","name":"Orange"},
 {"id":"2","name":"Melon"},
 {"id":"3","name":"Apple"}]

For large outputs, you will need an output that is not an array (I don't know what to call this).

{"id":"1","name":"Orange"}
{"id":"2","name":"Melon"}
{"id":"3","name":"Apple"}

Not yet. I will consider it from now on.

noborus added a commit that referenced this issue Nov 26, 2019
Add JSONL output format separately from JSON.
JSONL (http://jsonlines.org/) is a JSON format separated by line breaks.
Output can be output line by line, so output starts quickly.
Also, JSONL output requires less memory than JSON output.
resolve #101.
@noborus
Copy link
Owner

noborus commented Nov 27, 2019

I added a new JSONL output option ( #103 ).
This format can reduce memory consumption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants