New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
created fetch_tse_data #208
Conversation
Hi @rafonseca awesome contribution! I would recommend a few things here in order to start reviewing this PR =) The Also I would like to point out that inside the So tl;dr for this PR. You'll need to do the following:
|
Ok, I've got it. |
No! you can do your changes here ;) |
Awesome contribution @rafonseca, awesome feedback @jtemporal! No need to close the PR. You can work on changes on this branch and new commits are automagically added to this PR when pushed to GitHub ; ) |
It is indeed magical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some quick english review
@@ -8456,8 +8456,8 @@ | |||
} | |||
/* Flexible box model classes */ | |||
/* Taken from Alex Russell http://infrequently.org/2009/08/css-3-progress/ */ | |||
/* This file is a compatibility layer. It allows the usage of flexible box | |||
model layouts across multiple browsers, including older browsers. The newest, | |||
/* This file is a compatability layer. It allows the usage of flexible box |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compatibility
estava certo.
/* This file is a compatibility layer. It allows the usage of flexible box | ||
model layouts across multiple browsers, including older browsers. The newest, | ||
/* This file is a compatability layer. It allows the usage of flexible box | ||
model layouts accross multiple browsers, including older browsers. The newest, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
across
estava certo.
@@ -11357,7 +11357,7 @@ | |||
* Author: Jupyter Development Team | |||
*/ | |||
/** WARNING IF YOU ARE EDITTING THIS FILE, if this is a .css file, It has a lot | |||
* of chance of being generated from the ../less/[samename].less file, you can | |||
* of chance of beeing generated from the ../less/[samename].less file, you can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
being
estava certo
Hi again, thank you for contribution @rafonseca and thank you @guizero for the review. Actually irio's notebook shouldn't change.
@rafonseca would you mind changing it back? that would also correct @guizero comments ;) |
Hi @jtemporal and @guizero, |
Quick and dirty way: find Irio's notebook here on GitHub, click Raw, download and replace your local copy with it. Add a commit saying something like Reverting Irio's notebook original version. Classy way: play with |
Thanks for the tip @cuducos . Finally, I did a checkout on that file using the SHA. Not so dirty, not so classy. Inevitably, I will make further errors, so I will have the opportunity to try the classy way :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, looks way better now ; ) yay!
I added a few more comments inline. Besides that I wouldn't add develop/2017-03-31-rafonseca-fetch_tse_data.ipynb
because ir actually change files in data/
directory — that's definitively is not a best practice. I do understand it was useful for you to study and get to src/fetch_tse_data.py
, but I'm not convinced it is relevant to have this code as a notebook.
Also I encourage you to share with us (a link here, file via telegram, whatever) a version of the dataset and we upload it to our serves. People would download it as they download all other datasets ; )
src/fetch_tse_data.py
Outdated
|
||
|
||
FILENAME_PREFIX='consulta_cand_' | ||
TEMP_PATH = '../data/tse_temp' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might fail in Windows (there it's \
not /
). We recommend using os.path.join
to make file and directory paths that would work in both worlds.
src/fetch_tse_data.py
Outdated
FILENAME_PREFIX='consulta_cand_' | ||
TEMP_PATH = '../data/tse_temp' | ||
TSE_CANDIDATES_URL='http://agencia.tse.jus.br/estatistica/sead/odsele/consulta_cand/' | ||
OUTPUT_DATASET_PATH = '../data/2017-03-31-tse-candidates.xz' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as line 12.
src/fetch_tse_data.py
Outdated
TEMP_PATH = '../data/tse_temp' | ||
TSE_CANDIDATES_URL='http://agencia.tse.jus.br/estatistica/sead/odsele/consulta_cand/' | ||
OUTPUT_DATASET_PATH = '../data/2017-03-31-tse-candidates.xz' | ||
os.makedirs(TEMP_PATH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use tempfile
to manage temporary directories, couldn't we?
src/fetch_tse_data.py
Outdated
OUTPUT_DATASET_PATH = '../data/2017-03-31-tse-candidates.xz' | ||
os.makedirs(TEMP_PATH) | ||
|
||
# setting year range from 2004 to 2016. this will be modified further to 'from 1994 to 2016' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we generate years from 2004 to 2016 and, later, transform it? Wouldn't it worth it to start from 1994 right now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the TSE informs that the data from 1994 to 2002 is not consistent, and they are working on this. I can try to get those data, but ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow… no, no… now I got it. Just leave code as it is, but maybe clarify it in comments…
src/fetch_tse_data.py
Outdated
os.makedirs(TEMP_PATH) | ||
|
||
# setting year range from 2004 to 2016. this will be modified further to 'from 1994 to 2016' | ||
year_list=[str(year) for year in (range(2004,2017,2))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use date.today().year
instead of a hardcoded 2017
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can. It is more elegant, but the headers are hardcoded and there is a high probability that next election dataset will have different headers from all these...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn… that's not good news. That's something that might be helpful to have documented (in comments) to future-me and future-you ; )
src/fetch_tse_data.py
Outdated
|
||
# Download files | ||
for year in year_list: | ||
filename=FILENAME_PREFIX+year+'.zip' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using format
is recommended when concatenating more than 2 strings: '{}{}.zip'.format(FILENAME_PREFIX, year)
.
src/fetch_tse_data.py
Outdated
for year in year_list: | ||
filename=FILENAME_PREFIX+year+'.zip' | ||
file_url=TSE_CANDIDATES_URL+filename | ||
output_file=os.path.join(TEMP_PATH,filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some Python best practices are to use spaces between the object being defined and its value, and also after comas (e.g. output_file = os.path.join(TEMP_PATH, filename)
instead of output_file=os.path.join(TEMP_PATH,filename)
). Check PEP8 or prospector
if you are interested in this hints on code quality ; )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very reasonable!
src/fetch_tse_data.py
Outdated
|
||
# ### Adding the headers | ||
# The following headers were extracted from LEIAME.pdf in consulta_cand_2016.zip. | ||
header_consulta_cand_till2010=[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering this headers section:
- We don't need the
#
with no comments at the end of the lines - Also I don't think we need the headers you skip with the
#
- And we need to translate the headers to English (example)
src/fetch_tse_data.py
Outdated
cand_df.index=cand_df.reset_index().index # this index contains no useful information | ||
|
||
# Exporting data | ||
cand_df.to_csv(OUTPUT_DATASET_PATH,encoding='iso-8859-1',compression='xz',header=True,index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use utf-8
? (same for lines 240-244).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably yes for exporting and no for importing. I will check again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, sure thing. My bad. The idea as to use UTF-8 for exporting, sorry ; )
src/fetch_tse_data.py
Outdated
for file_i in files_of_the_year: | ||
# the following cases do not take into account next elections. hopefully, TSE will add headers to the files | ||
if ('2014' in file_i) or ('2016' in file_i): | ||
cand_df_i=(pd.read_csv('./'+file_i,sep=';',header=None,dtype=str,names=header_consulta_cand_from2014,encoding='iso-8859-1')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it recommended to use np.str
instead of str
, @jtemporal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed it is, in a few words: <numpy str objects>
are different from <python str objects>
. To keep the pattern used in the whole project we should stick to using np.str
=)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I used "unicode" in python 2.7. In python 3, the standard str replaced unicode. We can use np.str as long as it deals with unicode characters. Does it?
Hi @cuducos. Thanks for the feedback. |
Hum… now I got it. I tend to agree that creating only the first dataset, and the notebook that demonstrated how to get from it to a list of elected politicians is good enough. @jtemporal and @Irio — you're more familiar than I am with data science — what do you think about it? |
I personally would go to having the first dataset on the S3 and keep the notebook that shows " how to get from it to a list of elected politicians". |
@cuducos There's a precedent for merging a notebook directly and only related to a |
@Irio AFAIK the notebook you mention does not write to |
@rafonseca Do you need help for making these requested changes? |
Hi @Irio. |
Hello there, |
src/fetch_tse_data.py
Outdated
|
||
FILENAME_PREFIX= 'consulta_cand_' | ||
TSE_CANDIDATES_URL= 'http://agencia.tse.jus.br/estatistica/sead/odsele/consulta_cand/' | ||
OUTPUT_DATASET_PATH= os.path.join(os.pardir,'data','2017-03-31-tse-candidates.xz') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not a good idea to have this date hardcoded here. Something like this might be useful:
>>> from datetime import date
>>> today = date.today()
>>> today.strftime('%Y-%m-%d')
'2017-05-10'
If the last link got lost, this might me helpful ; )
No worries — worst case scenario if we need this data ASAP we merge your PR to a secondary branch and work on some tweaks before bringing your code to the master branch…
No problem at all! This is valuable data <3 |
🎉 Many thanks @rafonseca ; ) |
yuhuu!! |
Add more engines to Code Climate
Hello guys.
This is the script that fetches data from TSE website in order to create a list of brazilian politicians. I have also a small example notebook that uses this data.
As the script will be code reviewed, I thought the notebook version would be appreciated. So this is a direct export from a notebook. Is it ok?