Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Looking for corruption on the Federal Budget #67

Closed
franklinbaldo opened this issue Sep 15, 2016 · 33 comments
Closed

Looking for corruption on the Federal Budget #67

franklinbaldo opened this issue Sep 15, 2016 · 33 comments

Comments

@franklinbaldo
Copy link

franklinbaldo commented Sep 15, 2016

The Brazilian Constitution allows each parliamentary allocate a portion of the federal budget for a specific purpose. But there is a problem because the law also allowed the parliamentary indicate the institution (NGO, Association, Foundation, public agency) that will receive the money. This creates a major risk of embezzlement, if the money is intended to entities controlled by the Parliament itself.

The federal government publishes the list of entities that received funds in this way. This list indicates which entity received the money, what she should do and what was the congressman who was the author of the amendment. address http://portal.convenios.gov.br/images/docs/CGSIS/csv/siconv_emenda.csv.zip

We could build a tool to check the reputation of such entities. This information would indicate higher risk of corruption.

We can get the information about the reputation from various sources: protests because of debts for this CNPJ, jundiciais actions against authority (sites of the courts, JusBrasil), criminal actions against the leaders (courts sites), leaders of donations to campaign parliamentary (TSE) and others.

Português

A constituição brasileira permite que cada parlamentar destine uma parte do orçamento federal para um objetivo específico. Mas existe um problema porque a Lei também permite que o parlamentar indique a instituição (ONG, Associação, Fundação, órgão público) que irá receber esse dinheiro. Isso gera um grande risco de desvio de dinheiro, se o dinheiro for destinado a entidades controladas pelo próprio parlamentar.

O governo federal divulga a lista de entidades que receberam verbas dessa forma. Essa lista indica para qual entidade recebeu o dinheiro, o que ela deveria fazer e quem foi o parlamentar que foi autor da emenda. Endereço: http://portal.convenios.gov.br/images/docs/CGSIS/csv/siconv_emenda.csv.zip

Nós poderíamos construir uma ferramenta que verifique a reputação dessas entidades. Essa informação indicaria emendas com alto risco de corrupção.

Podemos obter as informações sobre a reputação a partir de várias fontes: protestos em razão de dívidas cíveis (buscar pelo CNPJ em sites como http://www.ieptb.com.br/), ações judiciais contra entidade (sites dos tribunais, jusbrasil), ações civis e criminais contra os dirigentes (sites de tribunais), doações de dirigentes a campanha do parlamentar ou do partido (TSE) e outras.

@cuducos
Copy link
Collaborator

cuducos commented Sep 15, 2016

Awesome, @franklinbaldo!

IMHO the steps to bring this in to our project would be something like these:

  1. Write a script that downloads the CSV, translate the headers to English and save it to data/ in the .xz compressed format (e.g. src/fetch_datasets.py)
  2. If is there a documentation of these variables/header clarifying the meaning of each variable, maybe write a script that generates a translated version of it and save it into data/ (e.g. src/translate_datasets.py)
  3. Edit the src/fetch_cnpj_info.py to also fetch the CNPJ from this new dataset

I think this is a great idea and should be put forward ; ) Do you feel like you need any help?

@franklinbaldo
Copy link
Author

I'm a Lawyer. I know nothing about code. So, I can't develop the tool.

@cuducos
Copy link
Collaborator

cuducos commented Sep 15, 2016

Worry not, my friend — we know how to handle that. Thanks for the awesome idea : )

@augusto-herrmann
Copy link

Here's another source for the same data, but in API form (documentation). Unfortunately, information on the politician who made the amendment to the budget is not listed, as it was not available at the time the API was built (i.e. data is updated daily, but the fields made available have not been expanded upon for a long time).

@paralelo14
Copy link

can i use any lib/framework in python to scrap the data from the websites sugested:

("buscar pelo CNPJ em sites como http://www.ieptb.com.br/), ações judiciais contra entidade (sites dos tribunais, jusbrasil), ações civis e criminais contra os dirigentes (sites de tribunais), doações de dirigentes a campanha do parlamentar ou do partido (TSE) e outras.") ??

or is obrigatory to use python-requests + any html parser ??

I take a look at (http://portal.convenios.gov.br/images/docs/CGSIS/csv/siconv_emenda.csv.zip) and the file have a number that i think is the CNPJ of the ONG.. to i think the first thing is get this number and "translate" it to the name of the ONG to make a better seach on differents search engines...

@cuducos
Copy link
Collaborator

cuducos commented Oct 19, 2016

@anarcoder no need to restrict yourself to a specific lib or framework unless your code ends up being unreproducible by other people. As we don't commit to data/ the scripts from src/ should be able to fetch the data locally. Following that principle, I see no restriction.

We tend to prefer Python as it's already our stack, but I see no problem in using other general purpose and widely available languages that don't require any setup (Ruby, shell script, etc…).

@franklinbaldo
Copy link
Author

Hi, I am listing other datasets where we can find information about the reputation of NGOs and individuals. They should not appear in any of those lists:

@cuducos
Copy link
Collaborator

cuducos commented Oct 25, 2016

Great list, @franklinbaldo — thanks for that!

And just in case: collecting data on campaign donors is already a topic in #76

marcusrehm added a commit to marcusrehm/serenata-de-amor that referenced this issue Dec 13, 2016
…brasil#67

- fetching emendas.csv and saving as data/amendments.xz
- translating columns names to english
- TODO: download/create columns documentation and fetch beneficiaries info in src/fetch_cnpj_info.py
@marcusrehm
Copy link
Contributor

Hi guys, first of all congratulations for the awesome work you're doing here!

I created a script to fetch the emendas.csv file from SICONV. I forked the project and created a branch here. Basically it downloads the dataset and translate the variables. There is a simple notebook to show some records also.

If is there a documentation of these variables/header clarifying the meaning of each variable, maybe write a script that generates a translated version of it and save it into data/ (e.g. src/translate_datasets.py)

Here we got a pdf document in Portuguese, I think I can create a html document with the translation.

Edit the src/fetch_cnpj_info.py to also fetch the CNPJ from this new dataset

This one I will need a little help because maybe will need some refactoring as the read_csv(name) method has the date part of reimbursements filename on it. I don't know what impact it could be at others scripts.

I will try to work on the APIs from Banco Nacional de Mandados de Prisão do CNJ and Cadastro Nacional de Condenados por Improbidade Administrativa do CNJ as @franklinbaldo listed earlier. This one I think we could use to validate the CNPJs in reimbursements datasets also.

@cuducos
Copy link
Collaborator

cuducos commented Dec 13, 2016

Sorry… is @marcusrehm's notebook link working for you? I can't read it:

curl -I https://github.com/datasciencebr/serenata-de-amor/issues/serenata-de-amor/develop/2016-12-12-marcusrehm-amendments.ipynb
HTTP/1.1 406 Not Acceptable

BTW your contribution seems very good, looking forward to read the notebook ; )

@baldoequeiroz
Copy link

marcusrehm added a commit to marcusrehm/serenata-de-amor that referenced this issue Dec 14, 2016
 Rank of  congresspersons and beneficiaries (cnpjs) with highest amounts of amendments and their values.
@marcusrehm
Copy link
Contributor

marcusrehm commented Dec 14, 2016

Thanks @baldoequeiroz ! @cuducos sorry for the wrong link... Actually I used a wrong link. The correct one is that you pointed..

I did some refactoring in the fetch script and in the notebook also, it was (still is) pretty simple, by now it was just to show the data I got. I will try to work on those items I listed in previous comment.

@cuducos
Copy link
Collaborator

cuducos commented Dec 14, 2016

Great notebook, great data collection @marcusrehm! Many thanks for that. I do believe a lot could be done with this data.

Regarding editing src/fetch_cnpj_info.py to also fetch the CNPJ from this new dataset, actually it's quite simple: you can use a function like that to load the newest file independent of the date prefix. So you load all data from *reimbursements.xz, all data from *amendments.xz (gonna have to rename amendment_beneficiary to cnpj maybe) and you're good to go!

@marcusrehm
Copy link
Contributor

Hey @cuducos , I'm already working on that! :)

I referenced issue #167 because I think would be exactly that, isn't it?

@cuducos
Copy link
Collaborator

cuducos commented Dec 15, 2016

I referenced issue #167 because I think would be exactly that, isn't it?

Considering your comment on #167 I think I gots misunderstood, I'm sorry about that. Putting my suggestion in this topic, and @Irio's suggestion on that topic, the proper usage would be (dates here are merely placeholder, not real dates in our data/).

$ python src/fetch_cnpj_info.py data/2016-12-06-reimbursements.xz data/2016-12-11-amendments.py

This would inform fetch_cnpj_info.py to look for CNPJ in these both files (the reimbursement dataset and the amendments dataset).

This way we can query for the full CNPJ data of all these companies ending up with a more complete YYYY-MM-DD-companies.xz.

Does that make sense?

@marcusrehm
Copy link
Contributor

Yes @cuducos, it does make sense! My concerns were just that doing this way we need to force the columns holding the CNPJs in all files to be named "cnpj". So doing this we could change (or lose) the meaning of a certain column in a file, it could be out of context of a data model? But it's not a big problem, we could address it in the dataset's documentation.

I'm gonna make the changes in order to fetch_cnpj_info.py work with the fixed column CNPJ and receive a list of file names as arguments ok?

Do you think it would be possible to store the YYYY-MM-DD-companies.xz file in the S3 or in Github? I'm asking it because I've been banned by the receitasws because of the large amount of requests. :)

@cuducos
Copy link
Collaborator

cuducos commented Dec 16, 2016

My concerns were just that doing this way we need to force the columns holding the CNPJs in all files to be named "cnpj". So doing this we could change (or lose) the meaning of a certain column in a file

Good point, but that could be addressed in the code:

# not functional, just a example
cols = {'amendments': 'beneficiary', 'other_dataset': 'something_else}
cnpj_col = cols.get(base_file_name, 'cnpj')

@cuducos
Copy link
Collaborator

cuducos commented Dec 16, 2016

Do you think it would be possible to store the YYYY-MM-DD-companies.xz file in the S3 or in Github?

It is already. Scripts in src/ folder or in the serenata-tool box fetches it from S3 as the default.

@marcusrehm
Copy link
Contributor

@cuducos I pushed the files with modifications, now fetch_cnpj_info works this way:
$ python src/fetch_cnpj_info.py data/2016-12-06-reimbursements.xz data/2016-12-11-amendments.py

The only thing to consider is the questions of columns with CNPJ's. As we spoke earlier, it is using a dictionary for dataset / columns:
datasets_cols = {'reimbursements': 'cnpj_cpf', 'current-year': 'cnpj_cpf', 'last-year': 'cnpj_cpf', 'previous-years': 'cnpj_cpf', 'amendments': 'amendment_beneficiary'}

So when a new dataset is added, in order to fetch its CNPJ's one should add the entry at datasets_cols. What do you guys think about this approach?

@marcusrehm
Copy link
Contributor

About the dataset of this issue, I renamed the script to fetch_federal_budget_datasets.py because I'm considering other files would be very interesting. I added 2 more files, besides amendments (emendas), it's fetching now suppliers payments (pagamentos de fornecedores) and agreements (convênios).

I thinking with theses datasets we can try to correlate the congressperson of the amendments (and their relatives) with beneficiaries and suppliers.

Do you think it should be better put theses files in a specific folder like data/federal_budget?

@marcusrehm
Copy link
Contributor

It is already. Scripts in src/ folder or in the serenata-tool box fetches it from S3 as the default.

Sorry @cuducos ! I was talking about cnpj-info.xz. :)

@cuducos
Copy link
Collaborator

cuducos commented Dec 19, 2016

I was talking about cnpj-info.xz. :)

No need for that I guess. companies dataset have all this info plus geolocation ; )So go for the companies dataset, cnpj-info is an intermediary step.

@marcusrehm
Copy link
Contributor

Hi Guys,

I've made available the scripts and a simple analysis (notebook) regarding non-profit entities with agreements in execution that started after the the date entities become impended, but I would like some help in understanding if this reasoning is correct.

@cuducos , @franklinbaldo when you have time could please review it? The notebook is this one. Please it is in WIP, so the analysis I'm talking goes until the notebook section Impeded Non-Profit Entities.

Besides that I also made available the scripts to fetch datasets related to federal agreements and amendments and the registers of companies/persons that suffered some federal sanctions and can't celebrate any kind of contract with the government.

Basically it is:

@cuducos
Copy link
Collaborator

cuducos commented Dec 30, 2016

Hi @marcusrehm,

Many thanks for the notebook. I took a while to go through it because right now we're focused on the CEAP thing. This is the most feasible way to deliver to our Catarse supporters in the following weeks, so this is my priority these days.

However taking SICONV is very promising for the next steps of the project — so I reinforce my thank you: you're giving our first step in that direction an that's awesome.

My utterly douchbag comment would be to try to make your code a a bit more readable. I'm not a PEP8 radical but sometimes you code is very difficult to parse in human brains IMHO.

But please… don't let that douchbag part of the feedback get in our way. Your contribution is really good. It looks like an interesting material to raise attention of the press and also to embody official reports denouncing these cases.

Your comments in the notebook make it easy for newbies to understand what's going on and to make sense of data. I overlooked the editions in the .py files because that closer look works better in the PR environment, with diff and inline comments.

My only concern at this point is how to organize documentation. Serenata de Amor kind of grew up around CEAP and I'm not sure what's the best way to include documentation of SICONV etc. Maybe we need a more robust way to document what we're doing. By now maybe a hot fix would be to add .md and link them in CONTRIBUTING.md.

What do you think about it?

@marcusrehm
Copy link
Contributor

marcusrehm commented Jan 3, 2017

Hi @cuducos !

Glad to know that it will help Serenata de Amor going through next steps!

In fact your comments about the code are relevant, I'll make the adjustments to make it more readable. I think it's because of notebook's display when I run it locally and when it is viewed on Github.

About your concern regarding how to organize documentation and subjects (CEAP and Federal Budget), I had the same feeling about it while I was developing the issue. I think we could do as you said, create a section in CONTRIBUTING.md for Federal Budget and put a link to federal-budget-agreements-datasets.md. In this section we could have a brief explanation and a disclaimer that it is a parallel job (a secondary goal), as CEAP is the main goal right now for Serenata de Amor.

But regardless the questions above, the 3 datasets with registers of companies with some kind of issue with the federal government could be used to point out or improve the suspicion of companies that appear in reimbursements of CEAP and/or any other future analysis. Maybe we can create another notebook and cross these datasets with the reimbursements.

Happy New Year for you guys! :)

@marcusrehm
Copy link
Contributor

Hi @cuducos ! I made the adjustments related to the layout of the notebooks and created a small section in CONTRIBUTING.md talking about Federal Budget. In addition, I created a notebook crossing the data of companies suspended with CEAP's reimbursements of 2016 and I found some interesting cases.

These notebooks are available here and here.

The code is available at issue-67 branch.

@cuducos
Copy link
Collaborator

cuducos commented Jan 25, 2017

It looks really good, many thanks for that.

My last suggestions:

  • What about moving your .md files together CEAP.md to a new docs/ directory (and maybe mention it in the table at CONTRIBUTING.md)? Does that make sense?
  • Can you lis your script and description on the Scripts part of the CONTRIBUTING.md?
  • Can you provide us with datasets you generated so we can upload to S3 and add them to the toolbox downloads?
  • And finally… send us a PR ; )

@marcusrehm
Copy link
Contributor

What about moving your .md files together CEAP.md to a new docs/ directory (and maybe mention it in the table at CONTRIBUTING.md)? Does that make sense?

Yeah, it really makes sense.

Can you lis your script and description on the Scripts part of the CONTRIBUTING.md?

Yes! :)

Can you provide us with datasets you generated so we can upload to S3 and add them to the toolbox downloads?

@cuducos I can upload them to the issue-67 branch. Is that ok?

@jtemporal
Copy link
Collaborator

jtemporal commented Jan 27, 2017

@cuducos I can upload them to the issue-67 branch. Is that ok?

@marcusrehm We don't normally commit data. You can upload it to some file transfer service like WeTransfer and we will upload it to aws so it is available =)

@marcusrehm
Copy link
Contributor

Many thanks @jtemporal ! ;)

@marcusrehm
Copy link
Contributor

@cuducos , @jtemporal PR #185 created. I'll send the datasets later ok?

@marcusrehm
Copy link
Contributor

@cuducos , @jtemporal The datasets are available at WeTransfer. The link to download is https://we.tl/G9I2WV4DGV.

@cuducos
Copy link
Collaborator

cuducos commented Jan 30, 2017

Many thanks, @marcusrehm!

I'm gonna upload the datasets to our S3 and merge your PR soon ; )

All: I'm gonna close this loooong Issue as we have the datasets and an automatized what to get updated versions of data. But this is only the beginning. New ideas on how to use this data within analysis are still welcomed — feel free to open new Issues about these hypothesis and solutions

@cuducos cuducos closed this as completed Jan 30, 2017
cuducos added a commit that referenced this issue Jan 30, 2017
Issue #67 - Looking for corruption on the Federal Budget
Irio pushed a commit that referenced this issue Feb 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants