Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translate dataset #7

Merged
merged 15 commits into from Aug 17, 2016
Merged

Translate dataset #7

merged 15 commits into from Aug 17, 2016

Conversation

Irio
Copy link
Collaborator

@Irio Irio commented Aug 8, 2016

Closes #6.

@Irio Irio self-assigned this Aug 8, 2016
@cuducos
Copy link
Collaborator

cuducos commented Aug 8, 2016

Great PR, @Irio!

@Irio
Copy link
Collaborator Author

Irio commented Aug 8, 2016

Wasn't been able to convert the larger file - AnosAnteriores.xml, of 3GB - to CSV with my machine, with 16GB of RAM. Started an Amazon instance with 60GB of RAM and did it in a few minutes. 👯

Here are the file size differences between government provided and generated CSVs:

-rw-r--r--@ 1 irio  staff    74M Aug  8 12:57 2016-08-08-AnoAnterior.csv
-rw-r--r--@ 1 irio  staff    37M Aug  8 12:57 2016-08-08-AnoAtual.csv
-rw-r--r--@ 1 irio  staff   291M Aug  8 13:01 2016-08-08-AnosAnteriores.csv
-rw-r--r--  1 irio  staff   674M Jul 25 21:30 AnoAnterior.xml
-rw-r--r--  1 irio  staff   317M Jul 25 21:00 AnoAtual.xml
-rw-r--r--  1 irio  staff   2.6G Jul 25 21:33 AnosAnteriores.xml

@Irio
Copy link
Collaborator Author

Irio commented Aug 8, 2016

@cuducos Would love to have as much feedback from you as possible on this last commit, 2460341. Generally (e.g. didn't know the word "glosa" from "vlrGlosa") but also specifically on the translation of congress(man) and parliamentar(y), which I used interchangeably.

@cuducos
Copy link
Collaborator

cuducos commented Aug 8, 2016

I'm looking at this later this week. I can try to optimize the Python script to be able to run in “normal” computers too. Is that a priority?

@cuducos
Copy link
Collaborator

cuducos commented Aug 9, 2016

Some quick translation suggestions:

Nome do dado

  • txNomeParlamentar, ideCadastro and nuCarteiraParlamentar should be congressperson_name, congressperson_id and congressperson_document (as congressmen is gendered)
  • sgUF could be state (usual in American English), federative_unit (more formal/literal) or fu (uf makes no sense in English)
  • txtCNPJCPF could be ein (Employer Identification Number, unique ID for tax payers in American English)
  • If I got it right txtNumero could then be ein_number
  • txtTrecho could be leg_of_the_trip (more used than stretch)

subquota_description

  • Publication subscriptions
  • Consultancy, research and technical work (typo, missing h in technical)
  • Publicity of parliamentary activity
  • Flight ticket issue
  • Congressperson meal
  • Lodging, except for congressperson from Distrito Federal
  • Aircraft renting or charter of aircraft
  • Watercraft renting or charter
  • Automotive vehicle renting or charter
  • Telecommunication

@Irio
Copy link
Collaborator Author

Irio commented Aug 9, 2016

👍 for the gender neutral pronoun.

Changed txNomeParlamentar, ideCadastro, nuCarteiraParlamentar, sgUF (to state), txtTrecho and all the subquota descriptions according to your suggestions, @cuducos. Have not changed two of them:

  • txtNumero: seem to refer to the attribute indTipoDocumento, "0, para Nota Fiscal; 1, para Recibo; e 2, para Despesa no Exterior". Maintained as document_number and document_type. Would you prefer another word for "documento" here?

O conteúdo deste dado representa o número de face do documento fiscal emitido ou o número do documento que deu causa à despesa debitada na cota do deputado.

  • txtCNPJCPF: I prefer not localize the ID names because this could cause confusion for both parties: those who know the Brazilian terms, and those trying to find more about Brazilian EINs. Also, if the payment was made for a person (e.g. real estate), the document used if CPF (SSN?).

Talking about optimizing the script, I wouldn't consider a high priority for now. Already generated the files and have them on S3; will be adding the links before merging this branch. If you can't optimize it, we would update the files weekly or monthly (adding the most recent receipts received).

If you want to work on it, that's surely welcome - giving for more people the chance of running the whole stack by themselves. I can say in advance that I've tried not using xml_soup.find_all (which allocates memory for all the XML nodes) but xml_soup.select('DESPESA:nth-of-type(%s)' % index) on every iteration; also failed for having no memory after thousands of records.

@cuducos
Copy link
Collaborator

cuducos commented Aug 9, 2016

Thanks for the explanation about txtNumero, makes sense!


About the EIN you're right, its really for business only. Individuals have an Individual Taxpayer Identification Number (ITIN) — which is different from the SSN (let's say it's CNPJ ~ EIN, CPF ~ ITIN and PIS/PASEP ~ SSN). I'm fine with cnpjcpf or something like that, and I'd be fine with ein_or_itin or something like that too. It's up to you ; ) In this big mess it might make sense to go with the Brazilian terms… I'd just be more literal and do it as cnpj_or_cpf (I imagine it'd be easier from a gringo to google for “cnpj or cpf” than “cnpjcpf” altogether). But that's merely a tiny tiny detail.


will be adding the links before merging this branch

Yay! That's great.

I might try to optimize anyway just to have a strategy if we stumble on similar issues in the future. My strategy is to write a slower script that depends mostly on file system, not memory. Slower, but will work without requiring 60GB of RAM (and some dollars). My first attempt will be to write bits of the large file in temp files (each bit in a different file), then remove the large file from memory, and then write the bits (one by one) into the CSV. Gonna create a branch based on this one and send you a PR later today, ok?

@cuducos
Copy link
Collaborator

cuducos commented Aug 9, 2016

@Irio What kind of issue you've had with AnosAnteriores.xml? Was a OSError: [Errno 22] Invalid argument on read? That might be a Mac OS X bug, but I have to investigate further…

@Irio
Copy link
Collaborator Author

Irio commented Aug 9, 2016

@cuducos That's a known issue of Python 3 when opening files too large for
the available amount of RAM. Documented solutions are switching to Python 2 or
increasing the RAM available.

The issue I have is an exception being raised after twice as available memory is used, stopping the script.
On Tue, Aug 9, 2016 at 19:45 Eduardo Cuducos notifications@github.com
wrote:

@Irio https://github.com/Irio What kind of issue you've had with
AnosAnteriores.xml? Was a OSError: [Errno 22] Invalid argument on read?
That might be a Mac OS X bug, but I have to investigate further…


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAowaYLahnux98KvaKvqIzPIUcBj04wqks5qeQMOgaJpZM4JeoZJ
.

@cuducos
Copy link
Collaborator

cuducos commented Aug 10, 2016

Done in PR #8: longer and slower script, but worked in my 8Gb RAM computer for AnosAnteriores.xml.

@hyena22
Copy link
Contributor

hyena22 commented Aug 10, 2016

what do subquota (is it the task?), document value, remark value mean? Also what is the main quantity we are looking at, i.e. the money spent for each task?

Also what is the granularity of the data? I assume each individual task which gets reimbursed. Am i correct?

@cuducos
Copy link
Collaborator

cuducos commented Aug 10, 2016

Following @hyena22's last post/comment, we might add to this PR a translation (en-US) of this table (the one saved for us on data/datasets_format.html). This translated version could clarify the translation for each variable and a English description for them. I'm just not sure about where to add this file… any ideas?

@cuducos
Copy link
Collaborator

cuducos commented Aug 10, 2016

And answering @hyena22's questions:

  • subquota is the kind (type? category?) of expense a record refers to (transport, lodging etc.); take a look in subquota_description in develop/2016-08-08-im-translate-dataset.ipynb;
  • document_value is the main value of the expense a record refers to;
  • remark_value I double checked the translation because I didn't even know the term in Portuguese (i.e. glosa); it looks like that sometimes there could be a difference in the estimation of an expense and the actual cost, so a extra payment might be needed — actually we should confirm this with an expert (any ideas who'd know that? cc @Irio @cabral).

Ref. what is the main quantity we're are looking at, I think it depends. The main value itself is document_value, but I think we might triangulate that with more contextual data to have something meaningful.

@Irio
Copy link
Collaborator Author

Irio commented Aug 10, 2016

@cuducos

PT and EN versions of a Markdown file with a definition list or table with the contents of the data/datasets_format.html file (and not having it anymore) do the job of explaining. Will work on them and include in this branch.

On your question about remark_value and cost attributes, vlrLiquido/net_value seems to be value we want to spend more time analyzing, since it is the value the politician gets reimbursed[1]. Just contacted Lúcio Big (from http://ops.net.br/) asking him about these specific attributes.

[1]:

O seu conteúdo representa o valor líquido do documento fiscal ou do documento que deu causa à despesa e será calculado pela diferença entre o Valor do Documento e o Valor da Glosa. É este valor que será debitado da cota do deputado. Caso o débito seja do Tipo Telefonia e o valor seja igual a zero, significa que a despesa foi franqueada.

@cuducos
Copy link
Collaborator

cuducos commented Aug 10, 2016

Many thanks, @Irio! Awesome.

In our version of this table we could link/include Lúcio's remark on remark (yes, pun intended).

I can work on or help you with this table. Let me know if you wanna me to take over, or how to give you a hand ; )

@Irio
Copy link
Collaborator Author

Irio commented Aug 10, 2016

@cuducos Would be a great help if you could create a script for generating these Markdowns from the html already in data/. This would give me time to wrap up the pull request with scripts for setting up contributors' environments with needed datasets, already translated and updated.

@cuducos
Copy link
Collaborator

cuducos commented Aug 10, 2016

Sure thing. I just added a Issue #9 for that ; )

@Irio
Copy link
Collaborator Author

Irio commented Aug 11, 2016

Do you plan to merge your branch into this one? Otherwise I'm already satisfied with the current status and would ask for final review before merge. @cuducos

@cuducos
Copy link
Collaborator

cuducos commented Aug 12, 2016

My branch can be merged later, don't worry about it. I can't promise a code
review before Monday though…
On Thu, Aug 11, 2016 at 23:49 Irio Musskopf notifications@github.com
wrote:

Do you plan to merge your branch into this one? Otherwise I'm already
satisfied with the current status and would ask for final review before
merge. @cuducos https://github.com/cuducos


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#7 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEg38368wO_nc0bks4wZMjv1YdwSkns5ks5qe6bkgaJpZM4JeoZJ
.

@hyena22 hyena22 merged commit 6263f1f into master Aug 17, 2016
@hyena22 hyena22 deleted the im-translate-dataset branch August 17, 2016 21:39
Irio pushed a commit that referenced this pull request Feb 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants