Translate dataset #7

Irio · 2016-08-08T00:55:27Z

Closes #6.

cuducos · 2016-08-08T09:53:23Z

Great PR, @Irio!

Irio · 2016-08-08T16:04:13Z

Wasn't been able to convert the larger file - AnosAnteriores.xml, of 3GB - to CSV with my machine, with 16GB of RAM. Started an Amazon instance with 60GB of RAM and did it in a few minutes. 👯

Here are the file size differences between government provided and generated CSVs:

-rw-r--r--@ 1 irio  staff    74M Aug  8 12:57 2016-08-08-AnoAnterior.csv
-rw-r--r--@ 1 irio  staff    37M Aug  8 12:57 2016-08-08-AnoAtual.csv
-rw-r--r--@ 1 irio  staff   291M Aug  8 13:01 2016-08-08-AnosAnteriores.csv
-rw-r--r--  1 irio  staff   674M Jul 25 21:30 AnoAnterior.xml
-rw-r--r--  1 irio  staff   317M Jul 25 21:00 AnoAtual.xml
-rw-r--r--  1 irio  staff   2.6G Jul 25 21:33 AnosAnteriores.xml

Irio · 2016-08-08T21:45:35Z

@cuducos Would love to have as much feedback from you as possible on this last commit, 2460341. Generally (e.g. didn't know the word "glosa" from "vlrGlosa") but also specifically on the translation of congress(man) and parliamentar(y), which I used interchangeably.

cuducos · 2016-08-08T23:42:23Z

I'm looking at this later this week. I can try to optimize the Python script to be able to run in “normal” computers too. Is that a priority?

cuducos · 2016-08-09T01:10:21Z

Some quick translation suggestions:

Nome do dado

txNomeParlamentar, ideCadastro and nuCarteiraParlamentar should be congressperson_name, congressperson_id and congressperson_document (as congressmen is gendered)
sgUF could be state (usual in American English), federative_unit (more formal/literal) or fu (uf makes no sense in English)
txtCNPJCPF could be ein (Employer Identification Number, unique ID for tax payers in American English)
If I got it right txtNumero could then be ein_number
txtTrecho could be leg_of_the_trip (more used than stretch)

subquota_description

Publication subscriptions
Consultancy, research and technical work (typo, missing h in technical)
Publicity of parliamentary activity
Flight ticket issue
Congressperson meal
Lodging, except for congressperson from Distrito Federal
Aircraft renting or charter of aircraft
Watercraft renting or charter
Automotive vehicle renting or charter
Telecommunication

Irio · 2016-08-09T13:30:17Z

👍 for the gender neutral pronoun.

Changed txNomeParlamentar, ideCadastro, nuCarteiraParlamentar, sgUF (to state), txtTrecho and all the subquota descriptions according to your suggestions, @cuducos. Have not changed two of them:

txtNumero: seem to refer to the attribute indTipoDocumento, "0, para Nota Fiscal; 1, para Recibo; e 2, para Despesa no Exterior". Maintained as document_number and document_type. Would you prefer another word for "documento" here?

O conteúdo deste dado representa o número de face do documento fiscal emitido ou o número do documento que deu causa à despesa debitada na cota do deputado.

txtCNPJCPF: I prefer not localize the ID names because this could cause confusion for both parties: those who know the Brazilian terms, and those trying to find more about Brazilian EINs. Also, if the payment was made for a person (e.g. real estate), the document used if CPF (SSN?).

Talking about optimizing the script, I wouldn't consider a high priority for now. Already generated the files and have them on S3; will be adding the links before merging this branch. If you can't optimize it, we would update the files weekly or monthly (adding the most recent receipts received).

If you want to work on it, that's surely welcome - giving for more people the chance of running the whole stack by themselves. I can say in advance that I've tried not using xml_soup.find_all (which allocates memory for all the XML nodes) but xml_soup.select('DESPESA:nth-of-type(%s)' % index) on every iteration; also failed for having no memory after thousands of records.

cuducos · 2016-08-09T13:46:03Z

Thanks for the explanation about txtNumero, makes sense!

About the EIN you're right, its really for business only. Individuals have an Individual Taxpayer Identification Number (ITIN) — which is different from the SSN (let's say it's CNPJ ~ EIN, CPF ~ ITIN and PIS/PASEP ~ SSN). I'm fine with cnpjcpf or something like that, and I'd be fine with ein_or_itin or something like that too. It's up to you ; ) In this big mess it might make sense to go with the Brazilian terms… I'd just be more literal and do it as cnpj_or_cpf (I imagine it'd be easier from a gringo to google for “cnpj or cpf” than “cnpjcpf” altogether). But that's merely a tiny tiny detail.

will be adding the links before merging this branch

Yay! That's great.

I might try to optimize anyway just to have a strategy if we stumble on similar issues in the future. My strategy is to write a slower script that depends mostly on file system, not memory. Slower, but will work without requiring 60GB of RAM (and some dollars). My first attempt will be to write bits of the large file in temp files (each bit in a different file), then remove the large file from memory, and then write the bits (one by one) into the CSV. Gonna create a branch based on this one and send you a PR later today, ok?

cuducos · 2016-08-09T22:45:34Z

@Irio What kind of issue you've had with AnosAnteriores.xml? Was a OSError: [Errno 22] Invalid argument on read? That might be a Mac OS X bug, but I have to investigate further…

Irio · 2016-08-09T22:54:37Z

@cuducos That's a known issue of Python 3 when opening files too large for
the available amount of RAM. Documented solutions are switching to Python 2 or
increasing the RAM available.

The issue I have is an exception being raised after twice as available memory is used, stopping the script.
On Tue, Aug 9, 2016 at 19:45 Eduardo Cuducos notifications@github.com
wrote:

@Irio https://github.com/Irio What kind of issue you've had with
AnosAnteriores.xml? Was a OSError: [Errno 22] Invalid argument on read?
That might be a Mac OS X bug, but I have to investigate further…

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAowaYLahnux98KvaKvqIzPIUcBj04wqks5qeQMOgaJpZM4JeoZJ
.

cuducos · 2016-08-10T11:47:29Z

Done in PR #8: longer and slower script, but worked in my 8Gb RAM computer for AnosAnteriores.xml.

hyena22 · 2016-08-10T14:14:42Z

what do subquota (is it the task?), document value, remark value mean? Also what is the main quantity we are looking at, i.e. the money spent for each task?

Also what is the granularity of the data? I assume each individual task which gets reimbursed. Am i correct?

cuducos · 2016-08-10T14:38:46Z

Following @hyena22's last post/comment, we might add to this PR a translation (en-US) of this table (the one saved for us on data/datasets_format.html). This translated version could clarify the translation for each variable and a English description for them. I'm just not sure about where to add this file… any ideas?

cuducos · 2016-08-10T15:11:45Z

And answering @hyena22's questions:

subquota is the kind (type? category?) of expense a record refers to (transport, lodging etc.); take a look in subquota_description in develop/2016-08-08-im-translate-dataset.ipynb;
document_value is the main value of the expense a record refers to;
remark_value I double checked the translation because I didn't even know the term in Portuguese (i.e. glosa); it looks like that sometimes there could be a difference in the estimation of an expense and the actual cost, so a extra payment might be needed — actually we should confirm this with an expert (any ideas who'd know that? cc @Irio @cabral).

Ref. what is the main quantity we're are looking at, I think it depends. The main value itself is document_value, but I think we might triangulate that with more contextual data to have something meaningful.

Irio · 2016-08-10T15:29:56Z

@cuducos

PT and EN versions of a Markdown file with a definition list or table with the contents of the data/datasets_format.html file (and not having it anymore) do the job of explaining. Will work on them and include in this branch.

On your question about remark_value and cost attributes, vlrLiquido/net_value seems to be value we want to spend more time analyzing, since it is the value the politician gets reimbursed[1]. Just contacted Lúcio Big (from http://ops.net.br/) asking him about these specific attributes.

[1]:

O seu conteúdo representa o valor líquido do documento fiscal ou do documento que deu causa à despesa e será calculado pela diferença entre o Valor do Documento e o Valor da Glosa. É este valor que será debitado da cota do deputado. Caso o débito seja do Tipo Telefonia e o valor seja igual a zero, significa que a despesa foi franqueada.

cuducos · 2016-08-10T15:35:21Z

Many thanks, @Irio! Awesome.

In our version of this table we could link/include Lúcio's remark on remark (yes, pun intended).

I can work on or help you with this table. Let me know if you wanna me to take over, or how to give you a hand ; )

Irio · 2016-08-10T15:42:30Z

@cuducos Would be a great help if you could create a script for generating these Markdowns from the html already in data/. This would give me time to wrap up the pull request with scripts for setting up contributors' environments with needed datasets, already translated and updated.

cuducos · 2016-08-10T15:57:58Z

Sure thing. I just added a Issue #9 for that ; )

Irio · 2016-08-11T22:49:07Z

Do you plan to merge your branch into this one? Otherwise I'm already satisfied with the current status and would ask for final review before merge. @cuducos

cuducos · 2016-08-12T08:50:22Z

My branch can be merged later, don't worry about it. I can't promise a code
review before Monday though…
On Thu, Aug 11, 2016 at 23:49 Irio Musskopf notifications@github.com
wrote:

Do you plan to merge your branch into this one? Otherwise I'm already
satisfied with the current status and would ask for final review before
merge. @cuducos https://github.com/cuducos

—
You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#7 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEg38368wO_nc0bks4wZMjv1YdwSkns5ks5qe6bkgaJpZM4JeoZJ
.

Fix #7

Irio added the work in progress label Aug 8, 2016

Irio self-assigned this Aug 8, 2016

cuducos mentioned this pull request Aug 10, 2016

Translate dataset variables/description table #9

Closed

Irio removed the work in progress label Aug 11, 2016

Irio and others added 7 commits August 17, 2016 18:36

Unzip fetched datasets after downloading

fb4dc1d

Retrieve definition of variables together with dataset

bed9b3f

Add script for converting XML datasets to CSV

2c17542

Gitignore .ipynb_checkpoints folders

3ffc590

Add analysis proposing translation of datasets

49d2d77

Update translation of terms after suggestions

9a17f3c

Refactor to run with less RAM

e6edab1

Irio and others added 7 commits August 17, 2016 18:37

Install dependencies required by new scripts with setup bin

01f7ff4

Download backed up datasets on setup script

2892180

Fix link for dataset

6e881f5

Attempt to free memory after closing file (not before)

bf91447

Add script based on analysis for translating the datasets

c5fe92f

cython not used

e2ec91e

removed lxml

cf801f3

Irio force-pushed the im-translate-dataset branch from ab17451 to cf801f3 Compare August 17, 2016 21:37

Fetch translated and compressed files in fetch_datasets.py

ddf1763

hyena22 merged commit 6263f1f into master Aug 17, 2016

hyena22 deleted the im-translate-dataset branch August 17, 2016 21:39

Irio pushed a commit that referenced this pull request Feb 27, 2018

Lazy back-end receipt URL check

d5fd5f3

Fix #7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Translate dataset #7

Translate dataset #7

Irio commented Aug 8, 2016

cuducos commented Aug 8, 2016

Irio commented Aug 8, 2016 •

edited

Loading

Irio commented Aug 8, 2016

cuducos commented Aug 8, 2016

cuducos commented Aug 9, 2016 •

edited

Loading

Irio commented Aug 9, 2016 •

edited

Loading

cuducos commented Aug 9, 2016

cuducos commented Aug 9, 2016

Irio commented Aug 9, 2016 •

edited

Loading

cuducos commented Aug 10, 2016

hyena22 commented Aug 10, 2016 •

edited

Loading

cuducos commented Aug 10, 2016

cuducos commented Aug 10, 2016

Irio commented Aug 10, 2016

cuducos commented Aug 10, 2016

Irio commented Aug 10, 2016

cuducos commented Aug 10, 2016

Irio commented Aug 11, 2016

cuducos commented Aug 12, 2016

Translate dataset #7

Translate dataset #7

Conversation

Irio commented Aug 8, 2016

cuducos commented Aug 8, 2016

Irio commented Aug 8, 2016 • edited Loading

Irio commented Aug 8, 2016

cuducos commented Aug 8, 2016

cuducos commented Aug 9, 2016 • edited Loading

Irio commented Aug 9, 2016 • edited Loading

cuducos commented Aug 9, 2016

cuducos commented Aug 9, 2016

Irio commented Aug 9, 2016 • edited Loading

cuducos commented Aug 10, 2016

hyena22 commented Aug 10, 2016 • edited Loading

cuducos commented Aug 10, 2016

cuducos commented Aug 10, 2016

Irio commented Aug 10, 2016

cuducos commented Aug 10, 2016

Irio commented Aug 10, 2016

cuducos commented Aug 10, 2016

Irio commented Aug 11, 2016

cuducos commented Aug 12, 2016

Irio commented Aug 8, 2016 •

edited

Loading

cuducos commented Aug 9, 2016 •

edited

Loading

Irio commented Aug 9, 2016 •

edited

Loading

Irio commented Aug 9, 2016 •

edited

Loading

hyena22 commented Aug 10, 2016 •

edited

Loading