New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data collection: campaign donors #290
Conversation
I'm not a python expert, but i think you could try to use python dictionaries or a list to improve the readability and reusability of your code. Let me give you an example. In 2018-2019 we will have the another dataset. Therefore to include it in your code we have to copy many parts of it. However, using a dictionary we will include just a new line on it. |
Thanks for the suggestion, @silviodc, I will make the changes and commit again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lacerdamarcelo — many thanks for this PR. I pointed out some improvements I'd like to see in this code before merging it. In general the logic is good, bit you can improve in reusability and maintainability of your script — mainly DRY (do not repeat yourself).
Beyond that there are several issues about code style. I strongly recommend you to install prospector
(pip install prospector
) and use it to check reports on code health… I like it because it's reports are usually useful and meaningful to read ; )
import os | ||
import sys | ||
import zipfile | ||
import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind following PEP8 style for imports? First we'd have native modules, then 3rd party, and then local modules, each block separated by a blank line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not so sure if now I am following PEP8 in the imports... could you take a look again?
else: # total size is unknown | ||
sys.stderr.write("read %d\n" % (readsofar,)) | ||
|
||
url = 'http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_contas_2010.zip' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind keep cols smaller than 80 chars? A clean way to have long strings in Python is:
url = (
'http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas'
'/prestacao_contas_2010.zip'
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @cuducos, this is the most difficult part. I tried to keep it this way the whole code, but in many parts, the readability of the code would be compromised if I strictly followed this rule, what made me use # noqa in some parts of the code. This is my opinion, of course, and if you still think I should stick to this 80 chars rule, I will do that. Could you take a look in the new code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's completely OK ; )
sys.stderr.write("read %d\n" % (readsofar,)) | ||
|
||
url = 'http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_contas_2010.zip' | ||
file_name = 'prestacao_contas_2010.zip' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually you can extract it from the URL with the help of urllib.parse.urlparse
and DRY (do not repeat yourself).
print("Uncompressing downloaded data.") | ||
with zipfile.ZipFile('prestacao_contas_2010.zip',"r") as zip_ref: | ||
zip_ref.extractall("prestacao_contas_2010") | ||
os.remove("prestacao_contas_2010.zip") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You basically repeat this block of code for every year. When you have things like that it's better to isolate this logic as a function and pass only what really changes from year to year.
donations_data_candidates_2010 = [] | ||
donations_data_parties_2010 = [] | ||
donations_data_committees_2010 = [] | ||
for root, dirs, files in os.walk("prestacao_contas_2010/", topdown=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not recommended to use slashes (/
) in file paths because it changes from OS to OS (Windows, for example, uses \
instead of /
). Opt for os.path.join
or if you really need the slash os.sep
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW you shoud review the whole script to address this issue in different parts of it, ok?
|
||
donations_data_candidates_2012_chunks = pd.read_csv('prestacao_final_2012/receitas_candidatos_2012_brasil.txt', | ||
low_memory=True, encoding = "ISO-8859-1", sep = ';', | ||
chunksize = 10000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this and for following imports, try to keep it readable under 80 chars:
donations_data_candidates_2012_chunks = pd.read_csv(
'prestacao_final_2012/receitas_candidatos_2012_brasil.txt',
low_memory=True,
encoding="ISO-8859-1",
sep=';',
chunksize=10000
)
for root, dirs, files in os.walk("prestacao_contas_2010/", topdown=False): | ||
for name in files: | ||
if 'Receitas' in name: | ||
data = pd.read_csv(os.path.join(root, name), low_memory=False, encoding = "ISO-8859-1", sep = ';') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When passing named arguments for a function there is no space around the =
(eg. encoding="ISO-8859-1", sep=';'
not encoding = "ISO-8859-1", sep = ';'
). Also try to choose between '
or "
, mixing both is not a best practice. Can you recheck the whole script on that?
donations_committees_concatenated.rename(columns={'Tipo doador originário': 'original_donor_type'}, inplace=True) | ||
donations_committees_concatenated.rename(columns={'Tipo receita': 'revenue_type'}, inplace=True) | ||
donations_committees_concatenated.rename(columns={'UF': 'state'}, inplace=True) | ||
donations_committees_concatenated.rename(columns={'Valor receita': 'revenue_value'}, inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use a dictionary to handle multiple renames at once ; )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jesus, that was a terrible mistake hahaha fixed!
…d data and deleting temp data folders.
I made lots of changes in the code trying to make it easier to read. The problem to reuse code in many parts is that each year needs a slightly different treatment for the data. I tried to keep PEP8 the most I could. Waiting for new feedbacks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lacerdamarcelo as @cuducos I've left some suggestions to refactor your code
Most of them are about refactoring to no repeat code and do more with few lines :)
sys.stderr.write("read %d\n" % (readsofar,)) | ||
|
||
|
||
def folder_walk_2010(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lacerdamarcelo I have some suggestions of refactoring to your code. I was talking to @cuducos this morning and one thing that we can refactor is here, we have four methods def folder_walk_YYYY()
what do you think about creating only one method and then use the year as a parameter. They may change a bit, but they almost look quite the same, so we can refactor that :)
|
||
donations_data = folder_walk_function() | ||
|
||
if 'candidates' in donations_data.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we have 3 if
statements that change the keyword ('candidates', 'parties' and 'committees') but have almost the same code. Can we do something similar, and use the key as a parameter?
base_url = ('http://agencia.tse.jus.br/estatistica/sead/odsele/' | ||
'prestacao_contas/') | ||
url = base_url + 'prestacao_contas_2010.zip' | ||
donations_data_2010 = download_base(url, folder_walk_2010) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In here too, we don't need to have all those lines that may only change the year, we can use one and then change the year as it is needed :)
Hi @anaschwendler , thanks a lot for your feedback! I just improved the code with the proposed changes. Waiting for new feedbacks. |
Hey guys, is there anything else I should improve to finish this task? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry by the lateness @lacerdamarcelo, I left some other tips here, but it is something to avoid repeating code that we can write only one time.
'receitas_partidos_2012_brasil.txt') | ||
path_committ = os.path.join('prestacao_final_2012', | ||
'receitas_comites_2012_brasil.txt') | ||
elif year == '2014': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lacerdamarcelo, here what you think about changing for 2012
and 2014
, and then you can do something like:
path_candid = os.path.join('prestacao_final_{}'.format(year),
'receitas_candidatos_{}_brasil.txt'.format(year))
if __name__ == "__main__": | ||
base_url = ('http://agencia.tse.jus.br/estatistica/sead/odsele/' | ||
'prestacao_contas/') | ||
url = base_url + 'prestacao_contas_2010.zip' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do something like I said before like use the same structure but format
to the years, using an apply for a dictionary with years?
@cuducos can explain better here
Hi @lacerdamarcelo. Many thanks for the improvements. I still think we have room for DRY and minor tweaks that would enhance redability, maintainability and performance. Would you mind if I open a PR to your branch? I think that would be better than async atomic feedback thia time. |
Hi @cuducos, yes, sure you can! |
…a-de-amor into lacerdamarcelo-issue_275
Use 3rd party libs (requests and tqdm) to simplify code and to avoid reinventing the wheel
DRY by using a flexible function that always returns a DataFrame, using or not the `chunksize` argument
DRY using a base class with the intention to re-use it later
DRY using a temporary context manager (later the idea is to re-use the class built for 2010 directory)
Gather 2010 and 2012-2016 context managers in a common context manager
Gather all these operations in a single point of the script
FYI @anaschwendler, I suggested an architecture to make this script more aligned with the DRY idea and more maintainable lacerdamarcelo#1 |
Thanks @cuducos and @lacerdamarcelo I'll take a look later :) |
I think first @lacerdamarcelo should take my suggestions into account in his branch before we're able to move on with this PR. In other words we depend on Marcelo core reviewing my PR, merging it into his branch in his fork, and then changes will be available here. |
WIP: Refactor campaign donation script
Ok, from now on I cannot be a code reviewer anymore as my commits intoxicated this PR hahaha… @anaschwendler now we count on you to scrutnize @lacerdamarcelo and my code ; ) |
Hey guys, I am really sorry for the late reply. I've had a crazy end of semester, which means LOTS of work! I reviewed all changes made by @cuducos and I really thank you for all of them. Some of them was something I could've done, but somehow I just didn't. But others involved new stuff to me and I would like thank for these new things I've learned. Well, this week I'll be still quite busy, but next week I'll be able to give more attention to this PR again =) Waiting for new requests! |
Just one more comment: we need to document this dataset in |
Hi @lacerdamarcelo and @cuducos as cuducos has already add code to this PR, I will be reviewing and the if it is all good I'll merge it to our project!
$ git clone git@github.com:datasciencebr/serenata-de-amor.git
$ cd serenata-de-amor
$ git checkout -b lacerdamarcelo-issue_275 master
$ git pull https://github.com/lacerdamarcelo/serenata-de-amor.git issue_275
$ cp config.ini.example config.ini
$ docker-compose build
$ docker-compose up -d
$ docker-compose run --rm research bash
$ python work/src/fetch_campaign_donations.py The result: jovyan@e58003a214cb:~$ python work/src/fetch_campaign_donations.py
Downloading http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_contas_2010.zip…
110MB [05:12, 353KB/s]
Uncompressing prestacao_contas_2010.zip…
Cleaning up source files from 2010…
Downloading http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_final_2012.zip…
863MB [30:06, 616KB/s]
Uncompressing prestacao_final_2012.zip…
Killed The process was killed by my computer, but it seems to be working. @lacerdamarcelo does it work on your computer? |
I'll try again with the local installation :)
$ git clone git@github.com:datasciencebr/serenata-de-amor.git
$ cd serenata-de-amor
$ git checkout -b lacerdamarcelo-issue_275 master
$ git pull https://github.com/lacerdamarcelo/serenata-de-amor.git issue_275
$ conda update conda
$ conda create --name serenata_de_amor python=3
$ source activate serenata_de_amor
$ ./setup
$ python research/src/fetch_campaign_donations.py The result: Downloading http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_contas_2010.zip…
110MB [04:06, 447KB/s] Uncompressing prestacao_contas_2010.zip…
Cleaning up source files from 2010…
Downloading http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_final_2012.zip…
863MB [41:19, 297KB/s]
Uncompressing prestacao_final_2012.zip…
Cleaning up source files from 2012…
Downloading http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_final_2014.zip…
284MB [10:05, 469KB/s] Uncompressing prestacao_final_2014.zip…
Cleaning up source files from 2014…
Downloading http://agencia.tse.jus.br/estatistica/sead/odsele/prestacao_contas/prestacao_contas_final_2016.zip…
1.08GB [35:33, 508KB/s] Uncompressing prestacao_contas_final_2016.zip…
Cleaning up source files from 2016…
Traceback (most recent call last):
File "research/src/fetch_campaign_donations.py", line 226, in <module>
by_year = tuple(fetch_data_from(year) for year in YEARS)
File "research/src/fetch_campaign_donations.py", line 226, in <genexpr>
by_year = tuple(fetch_data_from(year) for year in YEARS)
File "research/src/fetch_campaign_donations.py", line 222, in fetch_data_from
return donation.data
File "research/src/fetch_campaign_donations.py", line 185, in data
data = self._data()
File "research/src/fetch_campaign_donations.py", line 178, in _data
for key, path in zip(KEYS, paths)
File "research/src/fetch_campaign_donations.py", line 176, in <dictcomp>
return {
File "research/src/fetch_campaign_donations.py", line 175, in <genexpr>
paths = (os.path.join(self.directory, filename) for filename in files)
File "/Users/anaschwendler/anaconda3/envs/serenata_de_amor/lib/python3.6/posixpath.py", line 92, in join
genericpath._check_arg_types('join', a, *p)
File "/Users/anaschwendler/anaconda3/envs/serenata_de_amor/lib/python3.6/genericpath.py", line 149, in _check_arg_types
(funcname, s.__class__.__name__)) from None
TypeError: join() argument must be str or bytes, not 'NoneType' It seems that we have an error with NoneTypes. Besides that, the next steps are:
And after merged:
|
Hi @anaschwendler, I just sent a fix to @lacerdamarcelo's branch, as soon as he accepts it this will be reflected here. However I disagree that these are steps for this PR:
These are tasks to be done after this PR is merged. Instead I'd add (as mentioned) that @lacerdamarcelo needs to document this new dataset in |
Fix generator for non-existent files
Hey guys, I'll test with docker until this weekend and post here the results and update CONTRIBUTING.md with the new dataset. Thanks @anaschwendler and @cuducos for testing and fixing the code! |
You're right @cuducos, I'll update my comment.
Ok! I'll be waiting for your update ;) |
Fix: no bool value for pd.DaraFrame
@cuducos thanks for updating the CONTRIBUTING.md file =) Again, I'll run everything on docker to see if it works fine on this weekend. |
I'll test one last time, but it looks good to go. About the Docker problem: I tend to think that it was my limitation of memory when using Docker, so I must need more limit when running the script in Docker ;) If it is working good now, I'll add to the project. I just opened a |
@anaschwendler you're right, I forgot to mention. The datasets take A LOT of memory, even though I tried to load the datasets with a "low memory profile". I have 8 GB and it uses almost everything (i need to close all other programs to avoid going into the swap memory). This is probably the problem. However, isn't that an issue? Couldn't this be a problem for others to use? If it's okay to go, great! You guys rock @cuducos and @anaschwendler =D |
I did that, I left my computer running and got back like, one hour later.
In my opinion, we can close this part with the script the way it is now. The script is very useful the way it is now. And we will only need to run with two years of difference. We can later open an issue to solve that, but it is good enough by now 🎉 Many thanks for your patience with us @lacerdamarcelo \o\ |
Last test works 🎉 If @cuducos give a 👍 I'll be merging the script to the project and we can see later if there is need to refactor more :)
|
I would like to ask to take a look on this code to see if there are things I should do to follow any given coding standard.
This is a work in progress, but it is almost done. What is left: