Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

created fetch_tse_data #208

Merged
merged 13 commits into from May 10, 2017
Merged

created fetch_tse_data #208

merged 13 commits into from May 10, 2017

Conversation

rafonseca
Copy link
Contributor

Hello guys.
This is the script that fetches data from TSE website in order to create a list of brazilian politicians. I have also a small example notebook that uses this data.
As the script will be code reviewed, I thought the notebook version would be appreciated. So this is a direct export from a notebook. Is it ok?

@jtemporal
Copy link
Collaborator

jtemporal commented Mar 31, 2017

Hi @rafonseca awesome contribution! I would recommend a few things here in order to start reviewing this PR =)

The .py generated from the notebook is awesome but it should be in the develop/ directory alongside with the .ipynb file and the .html file. You can checkout here how to configure your Jupyter Notebook to generate all of these files while saving (if you haven't done so yet).

Also I would like to point out that inside the src/ only goes the clean code, you can checkout other python scripts there if want ;)

So tl;dr for this PR. You'll need to do the following:

  • move the .py to the develop/ subdirectory
  • commit both .html and .ipynb files in the develop/ subdirectory
  • create the clean .py script inside the src/ subdirectory

@rafonseca
Copy link
Contributor Author

Ok, I've got it.
so, should I close this PR and create another one?

@jtemporal
Copy link
Collaborator

No! you can do your changes here ;)
That's the workflow: you send things, we suggest some changes, you change it and it goes under review again. You are on the right track, great job!

@cuducos
Copy link
Collaborator

cuducos commented Mar 31, 2017

Awesome contribution @rafonseca, awesome feedback @jtemporal!

No need to close the PR. You can work on changes on this branch and new commits are automagically added to this PR when pushed to GitHub ; )

@rafonseca
Copy link
Contributor Author

It is indeed magical

Copy link

@guizero guizero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some quick english review

@@ -8456,8 +8456,8 @@
}
/* Flexible box model classes */
/* Taken from Alex Russell http://infrequently.org/2009/08/css-3-progress/ */
/* This file is a compatibility layer. It allows the usage of flexible box
model layouts across multiple browsers, including older browsers. The newest,
/* This file is a compatability layer. It allows the usage of flexible box
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compatibility estava certo.

/* This file is a compatibility layer. It allows the usage of flexible box
model layouts across multiple browsers, including older browsers. The newest,
/* This file is a compatability layer. It allows the usage of flexible box
model layouts accross multiple browsers, including older browsers. The newest,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

across estava certo.

@@ -11357,7 +11357,7 @@
* Author: Jupyter Development Team
*/
/** WARNING IF YOU ARE EDITTING THIS FILE, if this is a .css file, It has a lot
* of chance of being generated from the ../less/[samename].less file, you can
* of chance of beeing generated from the ../less/[samename].less file, you can
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

being estava certo

@jtemporal
Copy link
Collaborator

Hi again, thank you for contribution @rafonseca and thank you @guizero for the review.

Actually irio's notebook shouldn't change.

  • first because they weren't changes due to incorrect logic or analysis or any improvement on that aspect;
  • and second because we try not to change notebooks since ideally the have historical reasons for being that way, by historical reasons I mean: in some point in the future we will be able to get the same results as the ones in the notebook if we replicate the same conditions such as python version, packages version and data version.

@rafonseca would you mind changing it back? that would also correct @guizero comments ;)

@rafonseca
Copy link
Contributor Author

Hi @jtemporal and @guizero,
sorry for the mess. Probably, I opened an older version of this notebook and accidentaly commited it. I will change it back as soon as possible. Actually, I need to figure out how to do this. I think I need help there, hehe..

@cuducos
Copy link
Collaborator

cuducos commented Apr 4, 2017

I think I need help there, hehe...

Quick and dirty way: find Irio's notebook here on GitHub, click Raw, download and replace your local copy with it. Add a commit saying something like Reverting Irio's notebook original version.

Classy way: play with git revert on the commits that changed these files, skip these files, git commit and git rebase to organize new commits if needed ; )

@rafonseca
Copy link
Contributor Author

Thanks for the tip @cuducos . Finally, I did a checkout on that file using the SHA. Not so dirty, not so classy. Inevitably, I will make further errors, so I will have the opportunity to try the classy way :)

Copy link
Collaborator

@cuducos cuducos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, looks way better now ; ) yay!

I added a few more comments inline. Besides that I wouldn't add develop/2017-03-31-rafonseca-fetch_tse_data.ipynb because ir actually change files in data/ directory — that's definitively is not a best practice. I do understand it was useful for you to study and get to src/fetch_tse_data.py, but I'm not convinced it is relevant to have this code as a notebook.

Also I encourage you to share with us (a link here, file via telegram, whatever) a version of the dataset and we upload it to our serves. People would download it as they download all other datasets ; )



FILENAME_PREFIX='consulta_cand_'
TEMP_PATH = '../data/tse_temp'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might fail in Windows (there it's \ not /). We recommend using os.path.join to make file and directory paths that would work in both worlds.

FILENAME_PREFIX='consulta_cand_'
TEMP_PATH = '../data/tse_temp'
TSE_CANDIDATES_URL='http://agencia.tse.jus.br/estatistica/sead/odsele/consulta_cand/'
OUTPUT_DATASET_PATH = '../data/2017-03-31-tse-candidates.xz'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as line 12.

TEMP_PATH = '../data/tse_temp'
TSE_CANDIDATES_URL='http://agencia.tse.jus.br/estatistica/sead/odsele/consulta_cand/'
OUTPUT_DATASET_PATH = '../data/2017-03-31-tse-candidates.xz'
os.makedirs(TEMP_PATH)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use tempfile to manage temporary directories, couldn't we?

OUTPUT_DATASET_PATH = '../data/2017-03-31-tse-candidates.xz'
os.makedirs(TEMP_PATH)

# setting year range from 2004 to 2016. this will be modified further to 'from 1994 to 2016'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we generate years from 2004 to 2016 and, later, transform it? Wouldn't it worth it to start from 1994 right now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the TSE informs that the data from 1994 to 2002 is not consistent, and they are working on this. I can try to get those data, but ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow… no, no… now I got it. Just leave code as it is, but maybe clarify it in comments…

os.makedirs(TEMP_PATH)

# setting year range from 2004 to 2016. this will be modified further to 'from 1994 to 2016'
year_list=[str(year) for year in (range(2004,2017,2))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use date.today().year instead of a hardcoded 2017?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can. It is more elegant, but the headers are hardcoded and there is a high probability that next election dataset will have different headers from all these...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn… that's not good news. That's something that might be helpful to have documented (in comments) to future-me and future-you ; )


# Download files
for year in year_list:
filename=FILENAME_PREFIX+year+'.zip'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using format is recommended when concatenating more than 2 strings: '{}{}.zip'.format(FILENAME_PREFIX, year).

for year in year_list:
filename=FILENAME_PREFIX+year+'.zip'
file_url=TSE_CANDIDATES_URL+filename
output_file=os.path.join(TEMP_PATH,filename)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some Python best practices are to use spaces between the object being defined and its value, and also after comas (e.g. output_file = os.path.join(TEMP_PATH, filename) instead of output_file=os.path.join(TEMP_PATH,filename)). Check PEP8 or prospector if you are interested in this hints on code quality ; )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very reasonable!


# ### Adding the headers
# The following headers were extracted from LEIAME.pdf in consulta_cand_2016.zip.
header_consulta_cand_till2010=[
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering this headers section:

  • We don't need the # with no comments at the end of the lines
  • Also I don't think we need the headers you skip with the #
  • And we need to translate the headers to English (example)

cand_df.index=cand_df.reset_index().index # this index contains no useful information

# Exporting data
cand_df.to_csv(OUTPUT_DATASET_PATH,encoding='iso-8859-1',compression='xz',header=True,index=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use utf-8? (same for lines 240-244).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably yes for exporting and no for importing. I will check again

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, sure thing. My bad. The idea as to use UTF-8 for exporting, sorry ; )

for file_i in files_of_the_year:
# the following cases do not take into account next elections. hopefully, TSE will add headers to the files
if ('2014' in file_i) or ('2016' in file_i):
cand_df_i=(pd.read_csv('./'+file_i,sep=';',header=None,dtype=str,names=header_consulta_cand_from2014,encoding='iso-8859-1'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it recommended to use np.str instead of str, @jtemporal?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed it is, in a few words: <numpy str objects> are different from <python str objects>. To keep the pattern used in the whole project we should stick to using np.str =)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I used "unicode" in python 2.7. In python 3, the standard str replaced unicode. We can use np.str as long as it deals with unicode characters. Does it?

@rafonseca
Copy link
Contributor Author

rafonseca commented Apr 13, 2017

Hi @cuducos. Thanks for the feedback.
I completely agree that the fetch notebook should not be included. I thought it could be useful for the reviewing process. But for the data, it is still not clear to me what dataset we want to keep:
-This script merges some data from TSE selecting some columns. This is dataset number 1 and we have it.
-The notebook 'use-tse-data' use dataset number 1 and produces dataset number 2, which is not saved in data/ (hell, it is a notebook!).
dataset2 is a nice readable list of all elected politicians as requested in the issue #196, but dataset1 indeed has more information...
Again, thanks for the review. I will work on the code soon.

@cuducos
Copy link
Collaborator

cuducos commented Apr 13, 2017

Hum… now I got it.

I tend to agree that creating only the first dataset, and the notebook that demonstrated how to get from it to a list of elected politicians is good enough.

@jtemporal and @Irio — you're more familiar than I am with data science — what do you think about it?

@jtemporal
Copy link
Collaborator

I personally would go to having the first dataset on the S3 and keep the notebook that shows " how to get from it to a list of elected politicians".
We usually have this: A large, structured, translated and ready to work dataset from which we can retrieve information and/or intelligence by filtering or applying some coding on top of it.
Maybe @Irio has some other point of view to share with us.

@Irio
Copy link
Collaborator

Irio commented Apr 14, 2017

@cuducos There's a precedent for merging a notebook directly and only related to a src script, so I'm ok with keeping doing so.

@cuducos
Copy link
Collaborator

cuducos commented Apr 14, 2017

@Irio AFAIK the notebook you mention does not write to data/ — it's a study about how to implement a script that keeps the data directory intact. If @rafonseca's notebook doesn't write new data I'm fine.

@Irio
Copy link
Collaborator

Irio commented May 5, 2017

@rafonseca Do you need help for making these requested changes?

@rafonseca
Copy link
Contributor Author

Hi @Irio.
Indeed, I have been quite busy last weeks and I will probably be busy for the next two weeks. Shame on me. I will try to find some gap to work on this.
I will certainly need help for the headers translation.
There is one last thing which is not clear for me. As you have noticed, I created the script from the notebook and there is important information on the notebook that will help to maintain this script in the future, like those infos @cuducos asked to comment somewhere. So, we are going to keep also the original notebook or not? Of course, we would remove the file writing in '../data'

@rafonseca
Copy link
Contributor Author

Hello there,
I did almost all modifications @cuducos suggested. It is still missing the translation. Finally, I removed the scratch notebook and moved the important information to the script comments. I will try to answer more quickly for the next reviews. This PR is becoming long, hehe.
Thanks for the suggestions again


FILENAME_PREFIX= 'consulta_cand_'
TSE_CANDIDATES_URL= 'http://agencia.tse.jus.br/estatistica/sead/odsele/consulta_cand/'
OUTPUT_DATASET_PATH= os.path.join(os.pardir,'data','2017-03-31-tse-candidates.xz')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a good idea to have this date hardcoded here. Something like this might be useful:

>>> from datetime import date
>>> today = date.today()
>>> today.strftime('%Y-%m-%d')
'2017-05-10'

@cuducos
Copy link
Collaborator

cuducos commented May 10, 2017

It is still missing the translation.

If the last link got lost, this might me helpful ; )

I will try to answer more quickly for the next reviews.

No worries — worst case scenario if we need this data ASAP we merge your PR to a secondary branch and work on some tweaks before bringing your code to the master branch…

This PR is becoming long

No problem at all! This is valuable data <3

@cuducos cuducos merged commit 3e2ad7d into okfn-brasil:master May 10, 2017
@cuducos
Copy link
Collaborator

cuducos commented May 10, 2017

🎉 Many thanks @rafonseca ; )

@rafonseca
Copy link
Contributor Author

yuhuu!!
Thank you all. Hope it is the first PR of many!

@rafonseca rafonseca deleted the rafonseca-tse-data branch May 12, 2017 14:19
Irio pushed a commit that referenced this pull request Feb 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants