Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data collection: campaign donors #290

Merged
merged 28 commits into from Dec 1, 2017

Conversation

lacerdamarcelo
Copy link
Contributor

I would like to ask to take a look on this code to see if there are things I should do to follow any given coding standard.

This is a work in progress, but it is almost done. What is left:

  • Delete folders from the extractions of the zip files.
  • Save in an appropriate data folder.

@silviodc
Copy link

silviodc commented Oct 21, 2017

Hi @lacerdamarcelo

I'm not a python expert, but i think you could try to use python dictionaries or a list to improve the readability and reusability of your code.

Let me give you an example. In 2018-2019 we will have the another dataset. Therefore to include it in your code we have to copy many parts of it. However, using a dictionary we will include just a new line on it.

@lacerdamarcelo
Copy link
Contributor Author

Thanks for the suggestion, @silviodc, I will make the changes and commit again.

Copy link
Collaborator

@cuducos cuducos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lacerdamarcelo — many thanks for this PR. I pointed out some improvements I'd like to see in this code before merging it. In general the logic is good, bit you can improve in reusability and maintainability of your script — mainly DRY (do not repeat yourself).

Beyond that there are several issues about code style. I strongly recommend you to install prospector (pip install prospector) and use it to check reports on code health… I like it because it's reports are usually useful and meaningful to read ; )

import os
import sys
import zipfile
import time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind following PEP8 style for imports? First we'd have native modules, then 3rd party, and then local modules, each block separated by a blank line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not so sure if now I am following PEP8 in the imports... could you take a look again?

else: # total size is unknown
sys.stderr.write("read %d\n" % (readsofar,))

url = 'http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_contas_2010.zip'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind keep cols smaller than 80 chars? A clean way to have long strings in Python is:

url = (
    'http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas'
    '/prestacao_contas_2010.zip'
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cuducos, this is the most difficult part. I tried to keep it this way the whole code, but in many parts, the readability of the code would be compromised if I strictly followed this rule, what made me use # noqa in some parts of the code. This is my opinion, of course, and if you still think I should stick to this 80 chars rule, I will do that. Could you take a look in the new code?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's completely OK ; )

sys.stderr.write("read %d\n" % (readsofar,))

url = 'http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_contas_2010.zip'
file_name = 'prestacao_contas_2010.zip'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually you can extract it from the URL with the help of urllib.parse.urlparse and DRY (do not repeat yourself).

print("Uncompressing downloaded data.")
with zipfile.ZipFile('prestacao_contas_2010.zip',"r") as zip_ref:
zip_ref.extractall("prestacao_contas_2010")
os.remove("prestacao_contas_2010.zip")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You basically repeat this block of code for every year. When you have things like that it's better to isolate this logic as a function and pass only what really changes from year to year.

donations_data_candidates_2010 = []
donations_data_parties_2010 = []
donations_data_committees_2010 = []
for root, dirs, files in os.walk("prestacao_contas_2010/", topdown=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not recommended to use slashes (/) in file paths because it changes from OS to OS (Windows, for example, uses \ instead of /). Opt for os.path.join or if you really need the slash os.sep.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW you shoud review the whole script to address this issue in different parts of it, ok?


donations_data_candidates_2012_chunks = pd.read_csv('prestacao_final_2012/receitas_candidatos_2012_brasil.txt',
low_memory=True, encoding = "ISO-8859-1", sep = ';',
chunksize = 10000)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and for following imports, try to keep it readable under 80 chars:

donations_data_candidates_2012_chunks = pd.read_csv(
    'prestacao_final_2012/receitas_candidatos_2012_brasil.txt',
    low_memory=True,
    encoding="ISO-8859-1",
    sep=';',
    chunksize=10000
)

for root, dirs, files in os.walk("prestacao_contas_2010/", topdown=False):
for name in files:
if 'Receitas' in name:
data = pd.read_csv(os.path.join(root, name), low_memory=False, encoding = "ISO-8859-1", sep = ';')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When passing named arguments for a function there is no space around the = (eg. encoding="ISO-8859-1", sep=';' not encoding = "ISO-8859-1", sep = ';'). Also try to choose between ' or ", mixing both is not a best practice. Can you recheck the whole script on that?

donations_committees_concatenated.rename(columns={'Tipo doador originário': 'original_donor_type'}, inplace=True)
donations_committees_concatenated.rename(columns={'Tipo receita': 'revenue_type'}, inplace=True)
donations_committees_concatenated.rename(columns={'UF': 'state'}, inplace=True)
donations_committees_concatenated.rename(columns={'Valor receita': 'revenue_value'}, inplace=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use a dictionary to handle multiple renames at once ; )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jesus, that was a terrible mistake hahaha fixed!

@cuducos cuducos changed the title Issue 275 - Work in progress Data collection: campaign donors Oct 25, 2017
@lacerdamarcelo
Copy link
Contributor Author

I made lots of changes in the code trying to make it easier to read. The problem to reuse code in many parts is that each year needs a slightly different treatment for the data. I tried to keep PEP8 the most I could. Waiting for new feedbacks.

Copy link
Collaborator

@anaschwendler anaschwendler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lacerdamarcelo as @cuducos I've left some suggestions to refactor your code
Most of them are about refactoring to no repeat code and do more with few lines :)

sys.stderr.write("read %d\n" % (readsofar,))


def folder_walk_2010():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lacerdamarcelo I have some suggestions of refactoring to your code. I was talking to @cuducos this morning and one thing that we can refactor is here, we have four methods def folder_walk_YYYY() what do you think about creating only one method and then use the year as a parameter. They may change a bit, but they almost look quite the same, so we can refactor that :)


donations_data = folder_walk_function()

if 'candidates' in donations_data.keys():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we have 3 if statements that change the keyword ('candidates', 'parties' and 'committees') but have almost the same code. Can we do something similar, and use the key as a parameter?

base_url = ('http://agencia.tse.jus.br/estatistica/sead/odsele/'
'prestacao_contas/')
url = base_url + 'prestacao_contas_2010.zip'
donations_data_2010 = download_base(url, folder_walk_2010)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In here too, we don't need to have all those lines that may only change the year, we can use one and then change the year as it is needed :)

@lacerdamarcelo
Copy link
Contributor Author

Hi @anaschwendler , thanks a lot for your feedback! I just improved the code with the proposed changes. Waiting for new feedbacks.

@lacerdamarcelo
Copy link
Contributor Author

Hey guys, is there anything else I should improve to finish this task?

Copy link
Collaborator

@anaschwendler anaschwendler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry by the lateness @lacerdamarcelo, I left some other tips here, but it is something to avoid repeating code that we can write only one time.

'receitas_partidos_2012_brasil.txt')
path_committ = os.path.join('prestacao_final_2012',
'receitas_comites_2012_brasil.txt')
elif year == '2014':
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lacerdamarcelo, here what you think about changing for 2012 and 2014, and then you can do something like:

path_candid = os.path.join('prestacao_final_{}'.format(year),
                                       'receitas_candidatos_{}_brasil.txt'.format(year))

if __name__ == "__main__":
base_url = ('http://agencia.tse.jus.br/estatistica/sead/odsele/'
'prestacao_contas/')
url = base_url + 'prestacao_contas_2010.zip'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something like I said before like use the same structure but format to the years, using an apply for a dictionary with years?

@cuducos can explain better here

@cuducos
Copy link
Collaborator

cuducos commented Nov 16, 2017

Hi @lacerdamarcelo. Many thanks for the improvements. I still think we have room for DRY and minor tweaks that would enhance redability, maintainability and performance. Would you mind if I open a PR to your branch? I think that would be better than async atomic feedback thia time.

@lacerdamarcelo
Copy link
Contributor Author

Hi @cuducos, yes, sure you can!

Use 3rd party libs (requests and tqdm) to simplify code and to avoid reinventing the wheel
DRY by using a flexible function that always returns a DataFrame, using or not the `chunksize` argument
DRY using a base class with the intention to re-use it later
DRY using a temporary context manager (later the idea is to re-use the class built for 2010 directory)
Gather 2010 and 2012-2016 context managers in a common context manager
Gather all these operations in a single point of the script
@cuducos
Copy link
Collaborator

cuducos commented Nov 21, 2017

FYI @anaschwendler, I suggested an architecture to make this script more aligned with the DRY idea and more maintainable lacerdamarcelo#1

@anaschwendler
Copy link
Collaborator

Thanks @cuducos and @lacerdamarcelo I'll take a look later :)

@cuducos
Copy link
Collaborator

cuducos commented Nov 27, 2017

Thanks @cuducos and @lacerdamarcelo I'll take a look later :)

I think first @lacerdamarcelo should take my suggestions into account in his branch before we're able to move on with this PR. In other words we depend on Marcelo core reviewing my PR, merging it into his branch in his fork, and then changes will be available here.

@cuducos
Copy link
Collaborator

cuducos commented Nov 27, 2017

Ok, from now on I cannot be a code reviewer anymore as my commits intoxicated this PR hahaha… @anaschwendler now we count on you to scrutnize @lacerdamarcelo and my code ; )

@lacerdamarcelo
Copy link
Contributor Author

Hey guys,

I am really sorry for the late reply. I've had a crazy end of semester, which means LOTS of work!

I reviewed all changes made by @cuducos and I really thank you for all of them. Some of them was something I could've done, but somehow I just didn't. But others involved new stuff to me and I would like thank for these new things I've learned.

Well, this week I'll be still quite busy, but next week I'll be able to give more attention to this PR again =)

Waiting for new requests!

@cuducos
Copy link
Collaborator

cuducos commented Nov 27, 2017

Just one more comment: we need to document this dataset in CONTRIBUTING.md section ; )

@anaschwendler
Copy link
Collaborator

anaschwendler commented Nov 29, 2017

Hi @lacerdamarcelo and @cuducos as cuducos has already add code to this PR, I will be reviewing and the if it is all good I'll merge it to our project!

  1. Clone the project:
$ git clone git@github.com:datasciencebr/serenata-de-amor.git
  1. Change to serenata's folder:
$ cd serenata-de-amor 
  1. Change to @lacerdamarcelo's branch:
$ git checkout -b lacerdamarcelo-issue_275 master
$ git pull https://github.com/lacerdamarcelo/serenata-de-amor.git issue_275
  1. The steps to run the project:
$ cp config.ini.example config.ini
$ docker-compose build
$ docker-compose up -d
  1. Access the console:
$ docker-compose run --rm research bash
  1. Run the script:
$ python work/src/fetch_campaign_donations.py

The result:

jovyan@e58003a214cb:~$ python work/src/fetch_campaign_donations.py
Downloading http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_contas_2010.zip…
110MB [05:12, 353KB/s]                                                                                                                     
Uncompressing prestacao_contas_2010.zip…
Cleaning up source files from 2010…
Downloading http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_final_2012.zip…
863MB [30:06, 616KB/s]                                                                                                                     
Uncompressing prestacao_final_2012.zip…
Killed

The process was killed by my computer, but it seems to be working. @lacerdamarcelo does it work on your computer?
If it does, just put here your output and the dataset generated (on wetransfer or google drive), and it looks good to merge :) 🎉

@anaschwendler
Copy link
Collaborator

anaschwendler commented Nov 29, 2017

I'll try again with the local installation :)

  1. Clone the project:
$ git clone git@github.com:datasciencebr/serenata-de-amor.git
  1. Change to serenata's folder:
$ cd serenata-de-amor 
  1. Change to @lacerdamarcelo's branch:
$ git checkout -b lacerdamarcelo-issue_275 master
$ git pull https://github.com/lacerdamarcelo/serenata-de-amor.git issue_275
  1. Run the project locally:
$ conda update conda
$ conda create --name serenata_de_amor python=3
$ source activate serenata_de_amor
$ ./setup
  1. Run the script:
$ python research/src/fetch_campaign_donations.py

The result:

Downloading http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_contas_2010.zip…
110MB [04:06, 447KB/s]                                                                                                                     Uncompressing prestacao_contas_2010.zip…
Cleaning up source files from 2010…
Downloading http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_final_2012.zip…
863MB [41:19, 297KB/s]                                                                                                                     
Uncompressing prestacao_final_2012.zip…
Cleaning up source files from 2012…
Downloading http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_final_2014.zip…
284MB [10:05, 469KB/s]                                                                                                                     Uncompressing prestacao_final_2014.zip…
Cleaning up source files from 2014…
Downloading http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_contas_final_2016.zip…
1.08GB [35:33, 508KB/s]                                                                                                                    Uncompressing prestacao_contas_final_2016.zip…
Cleaning up source files from 2016…
Traceback (most recent call last):
  File "research/src/fetch_campaign_donations.py", line 226, in <module>
    by_year = tuple(fetch_data_from(year) for year in YEARS)
  File "research/src/fetch_campaign_donations.py", line 226, in <genexpr>
    by_year = tuple(fetch_data_from(year) for year in YEARS)
  File "research/src/fetch_campaign_donations.py", line 222, in fetch_data_from
    return donation.data
  File "research/src/fetch_campaign_donations.py", line 185, in data
    data = self._data()
  File "research/src/fetch_campaign_donations.py", line 178, in _data
    for key, path in zip(KEYS, paths)
  File "research/src/fetch_campaign_donations.py", line 176, in <dictcomp>
    return {
  File "research/src/fetch_campaign_donations.py", line 175, in <genexpr>
    paths = (os.path.join(self.directory, filename) for filename in files)
  File "/Users/anaschwendler/anaconda3/envs/serenata_de_amor/lib/python3.6/posixpath.py", line 92, in join
    genericpath._check_arg_types('join', a, *p)
  File "/Users/anaschwendler/anaconda3/envs/serenata_de_amor/lib/python3.6/genericpath.py", line 149, in _check_arg_types
    (funcname, s.__class__.__name__)) from None
TypeError: join() argument must be str or bytes, not 'NoneType'

It seems that we have an error with NoneTypes.

Besides that, the next steps are:

  • Document the dataset in `CONTRIBUTING.md``

And after merged:

  • Upload the datasets in amazon
  • Add the datasets to the serenata-toolbox

@cuducos
Copy link
Collaborator

cuducos commented Nov 29, 2017

Hi @anaschwendler,

I just sent a fix to @lacerdamarcelo's branch, as soon as he accepts it this will be reflected here.

However I disagree that these are steps for this PR:

Upload the datasets in amazon
Add the datasets to the serenata-toolbox

These are tasks to be done after this PR is merged.

Instead I'd add (as mentioned) that @lacerdamarcelo needs to document this new dataset in CONTRIBUTING.md.

@lacerdamarcelo
Copy link
Contributor Author

Hey guys,

I'll test with docker until this weekend and post here the results and update CONTRIBUTING.md with the new dataset.

Thanks @anaschwendler and @cuducos for testing and fixing the code!

@anaschwendler
Copy link
Collaborator

These are tasks to be done after this PR is merged.

You're right @cuducos, I'll update my comment.

I'll test with docker until this weekend and post here the results and update CONTRIBUTING.md with the new dataset.

Ok! I'll be waiting for your update ;)

@lacerdamarcelo
Copy link
Contributor Author

@cuducos thanks for updating the CONTRIBUTING.md file =)

Again, I'll run everything on docker to see if it works fine on this weekend.

@anaschwendler
Copy link
Collaborator

I'll test one last time, but it looks good to go.

About the Docker problem: I tend to think that it was my limitation of memory when using Docker, so I must need more limit when running the script in Docker ;)

If it is working good now, I'll add to the project. I just opened a serenata-toolbox issue to add the datasets to Amazon

@lacerdamarcelo
Copy link
Contributor Author

@anaschwendler you're right, I forgot to mention. The datasets take A LOT of memory, even though I tried to load the datasets with a "low memory profile". I have 8 GB and it uses almost everything (i need to close all other programs to avoid going into the swap memory). This is probably the problem.

However, isn't that an issue? Couldn't this be a problem for others to use?

If it's okay to go, great! You guys rock @cuducos and @anaschwendler =D

@anaschwendler
Copy link
Collaborator

anaschwendler commented Dec 1, 2017

. I have 8 GB and it uses almost everything (i need to close all other programs to avoid going into the swap memory).

I did that, I left my computer running and got back like, one hour later.

However, isn't that an issue? Couldn't this be a problem for others to use?

In my opinion, we can close this part with the script the way it is now. The script is very useful the way it is now. And we will only need to run with two years of difference. We can later open an issue to solve that, but it is good enough by now 🎉

Many thanks for your patience with us @lacerdamarcelo \o\

@anaschwendler
Copy link
Collaborator

Last test works 🎉

If @cuducos give a 👍 I'll be merging the script to the project and we can see later if there is need to refactor more :)

➜  serenata-de-amor git:(lacerdamarcelo-issue_275) python research/src/fetch_campaign_donations.py    
Downloading http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_contas_2010.zip…
110MB [05:21, 343KB/s]                                                                                                                     
Uncompressing prestacao_contas_2010.zip…
Cleaning up source files from 2010…
Downloading http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_final_2012.zip…
863MB [31:10, 372KB/s]                                                                                                                     Uncompressing prestacao_final_2012.zip…
Cleaning up source files from 2012…
Downloading http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_final_2014.zip…
284MB [19:17, 245KB/s]                                                                                                                     
Uncompressing prestacao_final_2014.zip…
Cleaning up source files from 2014…
Uncompressing prestacao_contas_final_2016.zip…
Cleaning up source files from 2016…
Saving 2017-12-01-donations-candidates.xz…
Saving 2017-12-01-donations-parties.xz…
Saving 2017-12-01-donations-committees.xz…

@anaschwendler anaschwendler merged commit 51fad8c into okfn-brasil:master Dec 1, 2017
Irio pushed a commit that referenced this pull request Feb 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants