-
Notifications
You must be signed in to change notification settings - Fork 659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Translate dataset #7
Conversation
Great PR, @Irio! |
Wasn't been able to convert the larger file - Here are the file size differences between government provided and generated CSVs:
|
I'm looking at this later this week. I can try to optimize the Python script to be able to run in “normal” computers too. Is that a priority? |
Some quick translation suggestions: Nome do dado
subquota_description
|
👍 for the gender neutral pronoun. Changed
Talking about optimizing the script, I wouldn't consider a high priority for now. Already generated the files and have them on S3; will be adding the links before merging this branch. If you can't optimize it, we would update the files weekly or monthly (adding the most recent receipts received). If you want to work on it, that's surely welcome - giving for more people the chance of running the whole stack by themselves. I can say in advance that I've tried not using |
Thanks for the explanation about About the EIN you're right, its really for business only. Individuals have an Individual Taxpayer Identification Number (ITIN) — which is different from the SSN (let's say it's CNPJ ~ EIN, CPF ~ ITIN and PIS/PASEP ~ SSN). I'm fine with
Yay! That's great. I might try to optimize anyway just to have a strategy if we stumble on similar issues in the future. My strategy is to write a slower script that depends mostly on file system, not memory. Slower, but will work without requiring 60GB of RAM (and some dollars). My first attempt will be to write bits of the large file in temp files (each bit in a different file), then remove the large file from memory, and then write the bits (one by one) into the CSV. Gonna create a branch based on this one and send you a PR later today, ok? |
@Irio What kind of issue you've had with |
@cuducos That's a known issue of Python 3 when opening files too large for The issue I have is an exception being raised after twice as available memory is used, stopping the script.
|
Done in PR #8: longer and slower script, but worked in my 8Gb RAM computer for |
what do subquota (is it the task?), document value, remark value mean? Also what is the main quantity we are looking at, i.e. the money spent for each task? Also what is the granularity of the data? I assume each individual task which gets reimbursed. Am i correct? |
Following @hyena22's last post/comment, we might add to this PR a translation (en-US) of this table (the one saved for us on |
And answering @hyena22's questions:
Ref. what is the main quantity we're are looking at, I think it depends. The main value itself is |
PT and EN versions of a Markdown file with a definition list or table with the contents of the On your question about [1]:
|
Many thanks, @Irio! Awesome. In our version of this table we could link/include Lúcio's remark on I can work on or help you with this table. Let me know if you wanna me to take over, or how to give you a hand ; ) |
@cuducos Would be a great help if you could create a script for generating these Markdowns from the html already in |
Sure thing. I just added a Issue #9 for that ; ) |
Do you plan to merge your branch into this one? Otherwise I'm already satisfied with the current status and would ask for final review before merge. @cuducos |
My branch can be merged later, don't worry about it. I can't promise a code
|
ab17451
to
cf801f3
Compare
Closes #6.