Looking for corruption on the Federal Budget #67

franklinbaldo · 2016-09-15T11:49:10Z

The Brazilian Constitution allows each parliamentary allocate a portion of the federal budget for a specific purpose. But there is a problem because the law also allowed the parliamentary indicate the institution (NGO, Association, Foundation, public agency) that will receive the money. This creates a major risk of embezzlement, if the money is intended to entities controlled by the Parliament itself.

The federal government publishes the list of entities that received funds in this way. This list indicates which entity received the money, what she should do and what was the congressman who was the author of the amendment. address http://portal.convenios.gov.br/images/docs/CGSIS/csv/siconv_emenda.csv.zip

We could build a tool to check the reputation of such entities. This information would indicate higher risk of corruption.

We can get the information about the reputation from various sources: protests because of debts for this CNPJ, jundiciais actions against authority (sites of the courts, JusBrasil), criminal actions against the leaders (courts sites), leaders of donations to campaign parliamentary (TSE) and others.

Português

A constituição brasileira permite que cada parlamentar destine uma parte do orçamento federal para um objetivo específico. Mas existe um problema porque a Lei também permite que o parlamentar indique a instituição (ONG, Associação, Fundação, órgão público) que irá receber esse dinheiro. Isso gera um grande risco de desvio de dinheiro, se o dinheiro for destinado a entidades controladas pelo próprio parlamentar.

O governo federal divulga a lista de entidades que receberam verbas dessa forma. Essa lista indica para qual entidade recebeu o dinheiro, o que ela deveria fazer e quem foi o parlamentar que foi autor da emenda. Endereço: http://portal.convenios.gov.br/images/docs/CGSIS/csv/siconv_emenda.csv.zip

Nós poderíamos construir uma ferramenta que verifique a reputação dessas entidades. Essa informação indicaria emendas com alto risco de corrupção.

Podemos obter as informações sobre a reputação a partir de várias fontes: protestos em razão de dívidas cíveis (buscar pelo CNPJ em sites como http://www.ieptb.com.br/), ações judiciais contra entidade (sites dos tribunais, jusbrasil), ações civis e criminais contra os dirigentes (sites de tribunais), doações de dirigentes a campanha do parlamentar ou do partido (TSE) e outras.

cuducos · 2016-09-15T12:51:32Z

Awesome, @franklinbaldo!

IMHO the steps to bring this in to our project would be something like these:

Write a script that downloads the CSV, translate the headers to English and save it to data/ in the .xz compressed format (e.g. src/fetch_datasets.py)
If is there a documentation of these variables/header clarifying the meaning of each variable, maybe write a script that generates a translated version of it and save it into data/ (e.g. src/translate_datasets.py)
Edit the src/fetch_cnpj_info.py to also fetch the CNPJ from this new dataset

I think this is a great idea and should be put forward ; ) Do you feel like you need any help?

franklinbaldo · 2016-09-15T14:04:22Z

I'm a Lawyer. I know nothing about code. So, I can't develop the tool.

cuducos · 2016-09-15T14:44:31Z

Worry not, my friend — we know how to handle that. Thanks for the awesome idea : )

augusto-herrmann · 2016-09-23T12:02:01Z

Here's another source for the same data, but in API form (documentation). Unfortunately, information on the politician who made the amendment to the budget is not listed, as it was not available at the time the API was built (i.e. data is updated daily, but the fields made available have not been expanded upon for a long time).

paralelo14 · 2016-10-19T13:04:06Z

can i use any lib/framework in python to scrap the data from the websites sugested:

("buscar pelo CNPJ em sites como http://www.ieptb.com.br/), ações judiciais contra entidade (sites dos tribunais, jusbrasil), ações civis e criminais contra os dirigentes (sites de tribunais), doações de dirigentes a campanha do parlamentar ou do partido (TSE) e outras.") ??

or is obrigatory to use python-requests + any html parser ??

I take a look at (http://portal.convenios.gov.br/images/docs/CGSIS/csv/siconv_emenda.csv.zip) and the file have a number that i think is the CNPJ of the ONG.. to i think the first thing is get this number and "translate" it to the name of the ONG to make a better seach on differents search engines...

cuducos · 2016-10-19T13:18:21Z

@anarcoder no need to restrict yourself to a specific lib or framework unless your code ends up being unreproducible by other people. As we don't commit to data/ the scripts from src/ should be able to fetch the data locally. Following that principle, I see no restriction.

We tend to prefer Python as it's already our stack, but I see no problem in using other general purpose and widely available languages that don't require any setup (Ruby, shell script, etc…).

franklinbaldo · 2016-10-25T04:24:38Z

Hi, I am listing other datasets where we can find information about the reputation of NGOs and individuals. They should not appear in any of those lists:

cuducos · 2016-10-25T11:00:31Z

Great list, @franklinbaldo — thanks for that!

And just in case: collecting data on campaign donors is already a topic in #76

…brasil#67 - fetching emendas.csv and saving as data/amendments.xz - translating columns names to english - TODO: download/create columns documentation and fetch beneficiaries info in src/fetch_cnpj_info.py

marcusrehm · 2016-12-13T03:44:06Z

Hi guys, first of all congratulations for the awesome work you're doing here!

I created a script to fetch the emendas.csv file from SICONV. I forked the project and created a branch here. Basically it downloads the dataset and translate the variables. There is a simple notebook to show some records also.

If is there a documentation of these variables/header clarifying the meaning of each variable, maybe write a script that generates a translated version of it and save it into data/ (e.g. src/translate_datasets.py)

Here we got a pdf document in Portuguese, I think I can create a html document with the translation.

Edit the src/fetch_cnpj_info.py to also fetch the CNPJ from this new dataset

This one I will need a little help because maybe will need some refactoring as the read_csv(name) method has the date part of reimbursements filename on it. I don't know what impact it could be at others scripts.

I will try to work on the APIs from Banco Nacional de Mandados de Prisão do CNJ and Cadastro Nacional de Condenados por Improbidade Administrativa do CNJ as @franklinbaldo listed earlier. This one I think we could use to validate the CNPJs in reimbursements datasets also.

cuducos · 2016-12-13T13:02:51Z

Sorry… is @marcusrehm's notebook link working for you? I can't read it:

curl -I https://github.com/datasciencebr/serenata-de-amor/issues/serenata-de-amor/develop/2016-12-12-marcusrehm-amendments.ipynb
HTTP/1.1 406 Not Acceptable

BTW your contribution seems very good, looking forward to read the notebook ; )

baldoequeiroz · 2016-12-14T01:42:03Z

@cuducos I think the link for this file is this https://github.com/marcusrehm/serenata-de-amor/blob/issue-67/develop/2016-12-12-marcusrehm-amendments.ipynb

Rank of congresspersons and beneficiaries (cnpjs) with highest amounts of amendments and their values.

marcusrehm · 2016-12-14T02:43:44Z

Thanks @baldoequeiroz ! @cuducos sorry for the wrong link... Actually I used a wrong link. The correct one is that you pointed..

I did some refactoring in the fetch script and in the notebook also, it was (still is) pretty simple, by now it was just to show the data I got. I will try to work on those items I listed in previous comment.

cuducos · 2016-12-14T09:58:04Z

Great notebook, great data collection @marcusrehm! Many thanks for that. I do believe a lot could be done with this data.

Regarding editing src/fetch_cnpj_info.py to also fetch the CNPJ from this new dataset, actually it's quite simple: you can use a function like that to load the newest file independent of the date prefix. So you load all data from *reimbursements.xz, all data from *amendments.xz (gonna have to rename amendment_beneficiary to cnpj maybe) and you're good to go!

marcusrehm · 2016-12-15T00:31:44Z

Hey @cuducos , I'm already working on that! :)

I referenced issue #167 because I think would be exactly that, isn't it?

cuducos · 2016-12-15T21:00:18Z

I referenced issue #167 because I think would be exactly that, isn't it?

Considering your comment on #167 I think I gots misunderstood, I'm sorry about that. Putting my suggestion in this topic, and @Irio's suggestion on that topic, the proper usage would be (dates here are merely placeholder, not real dates in our data/).

$ python src/fetch_cnpj_info.py data/2016-12-06-reimbursements.xz data/2016-12-11-amendments.py

This would inform fetch_cnpj_info.py to look for CNPJ in these both files (the reimbursement dataset and the amendments dataset).

This way we can query for the full CNPJ data of all these companies ending up with a more complete YYYY-MM-DD-companies.xz.

Does that make sense?

marcusrehm · 2016-12-16T03:02:06Z

Yes @cuducos, it does make sense! My concerns were just that doing this way we need to force the columns holding the CNPJs in all files to be named "cnpj". So doing this we could change (or lose) the meaning of a certain column in a file, it could be out of context of a data model? But it's not a big problem, we could address it in the dataset's documentation.

I'm gonna make the changes in order to fetch_cnpj_info.py work with the fixed column CNPJ and receive a list of file names as arguments ok?

Do you think it would be possible to store the YYYY-MM-DD-companies.xz file in the S3 or in Github? I'm asking it because I've been banned by the receitasws because of the large amount of requests. :)

cuducos · 2016-12-16T08:40:44Z

My concerns were just that doing this way we need to force the columns holding the CNPJs in all files to be named "cnpj". So doing this we could change (or lose) the meaning of a certain column in a file

Good point, but that could be addressed in the code:

# not functional, just a example
cols = {'amendments': 'beneficiary', 'other_dataset': 'something_else}
cnpj_col = cols.get(base_file_name, 'cnpj')

cuducos · 2016-12-16T08:42:04Z

Do you think it would be possible to store the YYYY-MM-DD-companies.xz file in the S3 or in Github?

It is already. Scripts in src/ folder or in the serenata-tool box fetches it from S3 as the default.

marcusrehm · 2016-12-18T16:21:46Z

@cuducos I pushed the files with modifications, now fetch_cnpj_info works this way:
$ python src/fetch_cnpj_info.py data/2016-12-06-reimbursements.xz data/2016-12-11-amendments.py

The only thing to consider is the questions of columns with CNPJ's. As we spoke earlier, it is using a dictionary for dataset / columns:
datasets_cols = {'reimbursements': 'cnpj_cpf', 'current-year': 'cnpj_cpf', 'last-year': 'cnpj_cpf', 'previous-years': 'cnpj_cpf', 'amendments': 'amendment_beneficiary'}

So when a new dataset is added, in order to fetch its CNPJ's one should add the entry at datasets_cols. What do you guys think about this approach?

marcusrehm · 2016-12-18T16:28:30Z

About the dataset of this issue, I renamed the script to fetch_federal_budget_datasets.py because I'm considering other files would be very interesting. I added 2 more files, besides amendments (emendas), it's fetching now suppliers payments (pagamentos de fornecedores) and agreements (convênios).

I thinking with theses datasets we can try to correlate the congressperson of the amendments (and their relatives) with beneficiaries and suppliers.

Do you think it should be better put theses files in a specific folder like data/federal_budget?

marcusrehm · 2016-12-19T12:16:11Z

It is already. Scripts in src/ folder or in the serenata-tool box fetches it from S3 as the default.

Sorry @cuducos ! I was talking about cnpj-info.xz. :)

cuducos · 2016-12-19T12:57:21Z

I was talking about cnpj-info.xz. :)

No need for that I guess. companies dataset have all this info plus geolocation ; )So go for the companies dataset, cnpj-info is an intermediary step.

marcusrehm · 2016-12-22T12:29:06Z

Hi Guys,

I've made available the scripts and a simple analysis (notebook) regarding non-profit entities with agreements in execution that started after the the date entities become impended, but I would like some help in understanding if this reasoning is correct.

@cuducos , @franklinbaldo when you have time could please review it? The notebook is this one. Please it is in WIP, so the analysis I'm talking goes until the notebook section Impeded Non-Profit Entities.

Besides that I also made available the scripts to fetch datasets related to federal agreements and amendments and the registers of companies/persons that suffered some federal sanctions and can't celebrate any kind of contract with the government.

Basically it is:

Documentation:
- companies-with-federal-sanctions-datasets.md. This file has a brief explanation of the datasets containing the list of companies and entities suspended.
- federal-budget-agreements-datasets.md. This one has a brief explanation about the agreements and amendments dataset. Still got some work, it's in WIP also.
Scripts:
- fetch_cnpj_info.py: Changes made in order to call it passing the files to look for CNPJ's as $ python src/fetch_cnpj_info.py data/2016-12-06-reimbursements.xz data/2016-12-11-amendments.py . I think it helps the issue Allow /src scripts to receive data files as command line arguments #167 that @Irio opened.
- fetch_federal_budget_datasets.py: Script to fetch files from SICONV site. By now it gets only the agreements and amendments files, but can be changed to fetch other files easily.
- fetch_federal_sanctions.py: This one fetches the files from Portal da Transparência related to sanctions. It creates the datasets:
  - inident-and-suspended-companies: Cadastro de Empresas Inidôneas e Suspensas (CEIS)
  - national-register-punished-companies: Cadastro Nacional de Empresas Punidas (CNEP)
  - impeded-non-profit-entities: Cadastro de Entidades sem Fins Lucrativos Impedidas (CEPIM)

cuducos · 2016-12-30T15:29:55Z

Hi @marcusrehm,

Many thanks for the notebook. I took a while to go through it because right now we're focused on the CEAP thing. This is the most feasible way to deliver to our Catarse supporters in the following weeks, so this is my priority these days.

However taking SICONV is very promising for the next steps of the project — so I reinforce my thank you: you're giving our first step in that direction an that's awesome.

My utterly douchbag comment would be to try to make your code a a bit more readable. I'm not a PEP8 radical but sometimes you code is very difficult to parse in human brains IMHO.

But please… don't let that douchbag part of the feedback get in our way. Your contribution is really good. It looks like an interesting material to raise attention of the press and also to embody official reports denouncing these cases.

Your comments in the notebook make it easy for newbies to understand what's going on and to make sense of data. I overlooked the editions in the .py files because that closer look works better in the PR environment, with diff and inline comments.

My only concern at this point is how to organize documentation. Serenata de Amor kind of grew up around CEAP and I'm not sure what's the best way to include documentation of SICONV etc. Maybe we need a more robust way to document what we're doing. By now maybe a hot fix would be to add .md and link them in CONTRIBUTING.md.

What do you think about it?

marcusrehm · 2017-01-03T21:32:56Z

Hi @cuducos !

Glad to know that it will help Serenata de Amor going through next steps!

In fact your comments about the code are relevant, I'll make the adjustments to make it more readable. I think it's because of notebook's display when I run it locally and when it is viewed on Github.

About your concern regarding how to organize documentation and subjects (CEAP and Federal Budget), I had the same feeling about it while I was developing the issue. I think we could do as you said, create a section in CONTRIBUTING.md for Federal Budget and put a link to federal-budget-agreements-datasets.md. In this section we could have a brief explanation and a disclaimer that it is a parallel job (a secondary goal), as CEAP is the main goal right now for Serenata de Amor.

But regardless the questions above, the 3 datasets with registers of companies with some kind of issue with the federal government could be used to point out or improve the suspicion of companies that appear in reimbursements of CEAP and/or any other future analysis. Maybe we can create another notebook and cross these datasets with the reimbursements.

Happy New Year for you guys! :)

marcusrehm · 2017-01-22T19:48:15Z

Hi @cuducos ! I made the adjustments related to the layout of the notebooks and created a small section in CONTRIBUTING.md talking about Federal Budget. In addition, I created a notebook crossing the data of companies suspended with CEAP's reimbursements of 2016 and I found some interesting cases.

These notebooks are available here and here.

The code is available at issue-67 branch.

cuducos · 2017-01-25T16:41:55Z

It looks really good, many thanks for that.

My last suggestions:

What about moving your .md files together CEAP.md to a new docs/ directory (and maybe mention it in the table at CONTRIBUTING.md)? Does that make sense?
Can you lis your script and description on the Scripts part of the CONTRIBUTING.md?
Can you provide us with datasets you generated so we can upload to S3 and add them to the toolbox downloads?
And finally… send us a PR ; )

marcusrehm · 2017-01-27T12:18:32Z

What about moving your .md files together CEAP.md to a new docs/ directory (and maybe mention it in the table at CONTRIBUTING.md)? Does that make sense?

Yeah, it really makes sense.

Can you lis your script and description on the Scripts part of the CONTRIBUTING.md?

Yes! :)

Can you provide us with datasets you generated so we can upload to S3 and add them to the toolbox downloads?

@cuducos I can upload them to the issue-67 branch. Is that ok?

jtemporal · 2017-01-27T12:25:18Z

@cuducos I can upload them to the issue-67 branch. Is that ok?

@marcusrehm We don't normally commit data. You can upload it to some file transfer service like WeTransfer and we will upload it to aws so it is available =)

marcusrehm · 2017-01-27T12:38:55Z

Many thanks @jtemporal ! ;)

marcusrehm · 2017-01-27T19:28:11Z

@cuducos , @jtemporal PR #185 created. I'll send the datasets later ok?

marcusrehm · 2017-01-27T22:32:29Z

@cuducos , @jtemporal The datasets are available at WeTransfer. The link to download is https://we.tl/G9I2WV4DGV.

cuducos · 2017-01-30T14:54:44Z

Many thanks, @marcusrehm!

I'm gonna upload the datasets to our S3 and merge your PR soon ; )

All: I'm gonna close this loooong Issue as we have the datasets and an automatized what to get updated versions of data. But this is only the beginning. New ideas on how to use this data within analysis are still welcomed — feel free to open new Issues about these hypothesis and solutions

Issue #67 - Looking for corruption on the Federal Budget

Fix #67

cuducos added data collection medium labels Oct 20, 2016

marcusrehm added a commit to marcusrehm/serenata-de-amor that referenced this issue Dec 14, 2016

Amendments notebook - issue okfn-brasil#67

73cdf26

Rank of congresspersons and beneficiaries (cnpjs) with highest amounts of amendments and their values.

marcusrehm mentioned this issue Dec 15, 2016

Allow /src scripts to receive data files as command line arguments #167

Closed

marcusrehm mentioned this issue Jan 27, 2017

Issue #67 - Looking for corruption on the Federal Budget #185

Merged

cuducos closed this as completed Jan 30, 2017

cuducos added a commit that referenced this issue Jan 30, 2017

Merge pull request #185 from marcusrehm/issue-67

19f7b60

Issue #67 - Looking for corruption on the Federal Budget

Irio pushed a commit that referenced this issue Feb 27, 2018

Format prices in the UI

b063c88

Fix #67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Looking for corruption on the Federal Budget #67

Looking for corruption on the Federal Budget #67

franklinbaldo commented Sep 15, 2016 •

edited

cuducos commented Sep 15, 2016 •

edited

franklinbaldo commented Sep 15, 2016

cuducos commented Sep 15, 2016

augusto-herrmann commented Sep 23, 2016

paralelo14 commented Oct 19, 2016

cuducos commented Oct 19, 2016

franklinbaldo commented Oct 25, 2016

cuducos commented Oct 25, 2016

marcusrehm commented Dec 13, 2016

cuducos commented Dec 13, 2016

baldoequeiroz commented Dec 14, 2016

marcusrehm commented Dec 14, 2016 •

edited

cuducos commented Dec 14, 2016

marcusrehm commented Dec 15, 2016

cuducos commented Dec 15, 2016

marcusrehm commented Dec 16, 2016

cuducos commented Dec 16, 2016

cuducos commented Dec 16, 2016

marcusrehm commented Dec 18, 2016

marcusrehm commented Dec 18, 2016

marcusrehm commented Dec 19, 2016

cuducos commented Dec 19, 2016

marcusrehm commented Dec 22, 2016

cuducos commented Dec 30, 2016

marcusrehm commented Jan 3, 2017 •

edited

marcusrehm commented Jan 22, 2017

cuducos commented Jan 25, 2017

marcusrehm commented Jan 27, 2017

jtemporal commented Jan 27, 2017 •

edited

marcusrehm commented Jan 27, 2017

marcusrehm commented Jan 27, 2017

marcusrehm commented Jan 27, 2017

cuducos commented Jan 30, 2017

Looking for corruption on the Federal Budget #67

Looking for corruption on the Federal Budget #67

Comments

franklinbaldo commented Sep 15, 2016 • edited

Português

cuducos commented Sep 15, 2016 • edited

franklinbaldo commented Sep 15, 2016

cuducos commented Sep 15, 2016

augusto-herrmann commented Sep 23, 2016

paralelo14 commented Oct 19, 2016

cuducos commented Oct 19, 2016

franklinbaldo commented Oct 25, 2016

cuducos commented Oct 25, 2016

marcusrehm commented Dec 13, 2016

cuducos commented Dec 13, 2016

baldoequeiroz commented Dec 14, 2016

marcusrehm commented Dec 14, 2016 • edited

cuducos commented Dec 14, 2016

marcusrehm commented Dec 15, 2016

cuducos commented Dec 15, 2016

marcusrehm commented Dec 16, 2016

cuducos commented Dec 16, 2016

cuducos commented Dec 16, 2016

marcusrehm commented Dec 18, 2016

marcusrehm commented Dec 18, 2016

marcusrehm commented Dec 19, 2016

cuducos commented Dec 19, 2016

marcusrehm commented Dec 22, 2016

cuducos commented Dec 30, 2016

marcusrehm commented Jan 3, 2017 • edited

marcusrehm commented Jan 22, 2017

cuducos commented Jan 25, 2017

marcusrehm commented Jan 27, 2017

jtemporal commented Jan 27, 2017 • edited

marcusrehm commented Jan 27, 2017

marcusrehm commented Jan 27, 2017

marcusrehm commented Jan 27, 2017

cuducos commented Jan 30, 2017

franklinbaldo commented Sep 15, 2016 •

edited

cuducos commented Sep 15, 2016 •

edited

marcusrehm commented Dec 14, 2016 •

edited

marcusrehm commented Jan 3, 2017 •

edited

jtemporal commented Jan 27, 2017 •

edited