created fetch_tse_data #208

rafonseca · 2017-03-31T15:56:22Z

Hello guys.
This is the script that fetches data from TSE website in order to create a list of brazilian politicians. I have also a small example notebook that uses this data.
As the script will be code reviewed, I thought the notebook version would be appreciated. So this is a direct export from a notebook. Is it ok?

jtemporal · 2017-03-31T16:19:13Z

Hi @rafonseca awesome contribution! I would recommend a few things here in order to start reviewing this PR =)

The .py generated from the notebook is awesome but it should be in the develop/ directory alongside with the .ipynb file and the .html file. You can checkout here how to configure your Jupyter Notebook to generate all of these files while saving (if you haven't done so yet).

Also I would like to point out that inside the src/ only goes the clean code, you can checkout other python scripts there if want ;)

So tl;dr for this PR. You'll need to do the following:

move the .py to the develop/ subdirectory
commit both .html and .ipynb files in the develop/ subdirectory
create the clean .py script inside the src/ subdirectory

rafonseca · 2017-03-31T16:48:37Z

Ok, I've got it.
so, should I close this PR and create another one?

jtemporal · 2017-03-31T16:51:20Z

No! you can do your changes here ;)
That's the workflow: you send things, we suggest some changes, you change it and it goes under review again. You are on the right track, great job!

cuducos · 2017-03-31T16:52:06Z

Awesome contribution @rafonseca, awesome feedback @jtemporal!

No need to close the PR. You can work on changes on this branch and new commits are automagically added to this PR when pushed to GitHub ; )

rafonseca · 2017-03-31T17:58:52Z

It is indeed magical

guizero

Just some quick english review

guizero · 2017-04-04T10:19:55Z

develop/2016-08-08-irio-translate-dataset.html

@@ -8456,8 +8456,8 @@
 }
 /* Flexible box model classes */
 /* Taken from Alex Russell http://infrequently.org/2009/08/css-3-progress/ */
-/* This file is a compatibility layer.  It allows the usage of flexible box 
-model layouts across multiple browsers, including older browsers.  The newest,
+/* This file is a compatability layer.  It allows the usage of flexible box 


compatibility estava certo.

guizero · 2017-04-04T10:20:15Z

develop/2016-08-08-irio-translate-dataset.html

-/* This file is a compatibility layer.  It allows the usage of flexible box 
-model layouts across multiple browsers, including older browsers.  The newest,
+/* This file is a compatability layer.  It allows the usage of flexible box 
+model layouts accross multiple browsers, including older browsers.  The newest,


across estava certo.

guizero · 2017-04-04T10:20:27Z

develop/2016-08-08-irio-translate-dataset.html

@@ -11357,7 +11357,7 @@
 * Author: Jupyter Development Team
 */
 /** WARNING IF YOU ARE EDITTING THIS FILE, if this is a .css file, It has a lot
- * of chance of being generated from the ../less/[samename].less file, you can
+ * of chance of beeing generated from the ../less/[samename].less file, you can


being estava certo

jtemporal · 2017-04-04T11:05:18Z

Hi again, thank you for contribution @rafonseca and thank you @guizero for the review.

Actually irio's notebook shouldn't change.

first because they weren't changes due to incorrect logic or analysis or any improvement on that aspect;
and second because we try not to change notebooks since ideally the have historical reasons for being that way, by historical reasons I mean: in some point in the future we will be able to get the same results as the ones in the notebook if we replicate the same conditions such as python version, packages version and data version.

@rafonseca would you mind changing it back? that would also correct @guizero comments ;)

rafonseca · 2017-04-04T15:41:50Z

Hi @jtemporal and @guizero,
sorry for the mess. Probably, I opened an older version of this notebook and accidentaly commited it. I will change it back as soon as possible. Actually, I need to figure out how to do this. I think I need help there, hehe..

cuducos · 2017-04-04T18:18:06Z

I think I need help there, hehe...

Quick and dirty way: find Irio's notebook here on GitHub, click Raw, download and replace your local copy with it. Add a commit saying something like Reverting Irio's notebook original version.

Classy way: play with git revert on the commits that changed these files, skip these files, git commit and git rebase to organize new commits if needed ; )

rafonseca · 2017-04-06T22:19:39Z

Thanks for the tip @cuducos . Finally, I did a checkout on that file using the SHA. Not so dirty, not so classy. Inevitably, I will make further errors, so I will have the opportunity to try the classy way :)

cuducos

Ok, looks way better now ; ) yay!

I added a few more comments inline. Besides that I wouldn't add develop/2017-03-31-rafonseca-fetch_tse_data.ipynb because ir actually change files in data/ directory — that's definitively is not a best practice. I do understand it was useful for you to study and get to src/fetch_tse_data.py, but I'm not convinced it is relevant to have this code as a notebook.

Also I encourage you to share with us (a link here, file via telegram, whatever) a version of the dataset and we upload it to our serves. People would download it as they download all other datasets ; )

cuducos · 2017-04-12T20:24:43Z

src/fetch_tse_data.py

+
+
+FILENAME_PREFIX='consulta_cand_'
+TEMP_PATH = '../data/tse_temp'


This might fail in Windows (there it's \ not /). We recommend using os.path.join to make file and directory paths that would work in both worlds.

cuducos · 2017-04-12T20:24:57Z

src/fetch_tse_data.py

+FILENAME_PREFIX='consulta_cand_'
+TEMP_PATH = '../data/tse_temp'
+TSE_CANDIDATES_URL='http://agencia.tse.jus.br/estatistica/sead/odsele/consulta_cand/'
+OUTPUT_DATASET_PATH = '../data/2017-03-31-tse-candidates.xz'


Same as line 12.

cuducos · 2017-04-12T20:25:27Z

src/fetch_tse_data.py

+TEMP_PATH = '../data/tse_temp'
+TSE_CANDIDATES_URL='http://agencia.tse.jus.br/estatistica/sead/odsele/consulta_cand/'
+OUTPUT_DATASET_PATH = '../data/2017-03-31-tse-candidates.xz'
+os.makedirs(TEMP_PATH)


We could use tempfile to manage temporary directories, couldn't we?

cuducos · 2017-04-12T20:26:49Z

src/fetch_tse_data.py

+OUTPUT_DATASET_PATH = '../data/2017-03-31-tse-candidates.xz'
+os.makedirs(TEMP_PATH)
+
+# setting year range from 2004 to 2016. this will be modified further to 'from 1994 to 2016'


Why do we generate years from 2004 to 2016 and, later, transform it? Wouldn't it worth it to start from 1994 right now?

Well, the TSE informs that the data from 1994 to 2002 is not consistent, and they are working on this. I can try to get those data, but ...

Wow… no, no… now I got it. Just leave code as it is, but maybe clarify it in comments…

cuducos · 2017-04-12T20:27:53Z

src/fetch_tse_data.py

+os.makedirs(TEMP_PATH)
+
+# setting year range from 2004 to 2016. this will be modified further to 'from 1994 to 2016'
+year_list=[str(year) for year in (range(2004,2017,2))]


Can we use date.today().year instead of a hardcoded 2017?

Yes, we can. It is more elegant, but the headers are hardcoded and there is a high probability that next election dataset will have different headers from all these...

Damn… that's not good news. That's something that might be helpful to have documented (in comments) to future-me and future-you ; )

cuducos · 2017-04-12T20:29:01Z

src/fetch_tse_data.py

+
+# Download files
+for year in year_list:
+    filename=FILENAME_PREFIX+year+'.zip'


Using format is recommended when concatenating more than 2 strings: '{}{}.zip'.format(FILENAME_PREFIX, year).

cuducos · 2017-04-12T20:32:00Z

src/fetch_tse_data.py

+for year in year_list:
+    filename=FILENAME_PREFIX+year+'.zip'
+    file_url=TSE_CANDIDATES_URL+filename
+    output_file=os.path.join(TEMP_PATH,filename)


Some Python best practices are to use spaces between the object being defined and its value, and also after comas (e.g. output_file = os.path.join(TEMP_PATH, filename) instead of output_file=os.path.join(TEMP_PATH,filename)). Check PEP8 or prospector if you are interested in this hints on code quality ; )

Very reasonable!

cuducos · 2017-04-12T20:34:11Z

src/fetch_tse_data.py

+
+# ### Adding the headers
+# The following headers were extracted from LEIAME.pdf in consulta_cand_2016.zip.
+header_consulta_cand_till2010=[


Considering this headers section:

We don't need the # with no comments at the end of the lines

Also I don't think we need the headers you skip with the #

And we need to translate the headers to English (example)

cuducos · 2017-04-12T20:38:45Z

src/fetch_tse_data.py

+cand_df.index=cand_df.reset_index().index # this index contains no useful information
+
+# Exporting data
+cand_df.to_csv(OUTPUT_DATASET_PATH,encoding='iso-8859-1',compression='xz',header=True,index=False)


Can we use utf-8? (same for lines 240-244).

Probably yes for exporting and no for importing. I will check again

Yep, sure thing. My bad. The idea as to use UTF-8 for exporting, sorry ; )

cuducos · 2017-04-12T20:39:09Z

src/fetch_tse_data.py

+    for file_i in files_of_the_year:
+        # the following cases do not take into account next elections. hopefully, TSE will add headers to the files
+        if ('2014' in file_i) or ('2016' in file_i):
+            cand_df_i=(pd.read_csv('./'+file_i,sep=';',header=None,dtype=str,names=header_consulta_cand_from2014,encoding='iso-8859-1'))


Is it recommended to use np.str instead of str, @jtemporal?

Yes, indeed it is, in a few words: <numpy str objects> are different from <python str objects>. To keep the pattern used in the whole project we should stick to using np.str =)

Actually, I used "unicode" in python 2.7. In python 3, the standard str replaced unicode. We can use np.str as long as it deals with unicode characters. Does it?

rafonseca · 2017-04-13T01:58:22Z

Hi @cuducos. Thanks for the feedback.
I completely agree that the fetch notebook should not be included. I thought it could be useful for the reviewing process. But for the data, it is still not clear to me what dataset we want to keep:
-This script merges some data from TSE selecting some columns. This is dataset number 1 and we have it.
-The notebook 'use-tse-data' use dataset number 1 and produces dataset number 2, which is not saved in data/ (hell, it is a notebook!).
dataset2 is a nice readable list of all elected politicians as requested in the issue #196, but dataset1 indeed has more information...
Again, thanks for the review. I will work on the code soon.

cuducos · 2017-04-13T02:38:20Z

Hum… now I got it.

I tend to agree that creating only the first dataset, and the notebook that demonstrated how to get from it to a list of elected politicians is good enough.

@jtemporal and @Irio — you're more familiar than I am with data science — what do you think about it?

jtemporal · 2017-04-13T12:54:14Z

I personally would go to having the first dataset on the S3 and keep the notebook that shows " how to get from it to a list of elected politicians".
We usually have this: A large, structured, translated and ready to work dataset from which we can retrieve information and/or intelligence by filtering or applying some coding on top of it.
Maybe @Irio has some other point of view to share with us.

Irio · 2017-04-14T12:16:28Z

@cuducos There's a precedent for merging a notebook directly and only related to a src script, so I'm ok with keeping doing so.

cuducos · 2017-04-14T14:22:15Z

@Irio AFAIK the notebook you mention does not write to data/ — it's a study about how to implement a script that keeps the data directory intact. If @rafonseca's notebook doesn't write new data I'm fine.

Irio · 2017-05-05T08:56:42Z

@rafonseca Do you need help for making these requested changes?

rafonseca · 2017-05-05T12:58:52Z

Hi @Irio.
Indeed, I have been quite busy last weeks and I will probably be busy for the next two weeks. Shame on me. I will try to find some gap to work on this.
I will certainly need help for the headers translation.
There is one last thing which is not clear for me. As you have noticed, I created the script from the notebook and there is important information on the notebook that will help to maintain this script in the future, like those infos @cuducos asked to comment somewhere. So, we are going to keep also the original notebook or not? Of course, we would remove the file writing in '../data'

rafonseca · 2017-05-09T23:54:45Z

Hello there,
I did almost all modifications @cuducos suggested. It is still missing the translation. Finally, I removed the scratch notebook and moved the important information to the script comments. I will try to answer more quickly for the next reviews. This PR is becoming long, hehe.
Thanks for the suggestions again

cuducos · 2017-05-10T14:16:58Z

src/fetch_tse_data.py

+
+FILENAME_PREFIX= 'consulta_cand_'
+TSE_CANDIDATES_URL= 'http://agencia.tse.jus.br/estatistica/sead/odsele/consulta_cand/'
+OUTPUT_DATASET_PATH= os.path.join(os.pardir,'data','2017-03-31-tse-candidates.xz')


It's not a good idea to have this date hardcoded here. Something like this might be useful:

>>> from datetime import date >>> today = date.today() >>> today.strftime('%Y-%m-%d') '2017-05-10'

cuducos · 2017-05-10T14:21:51Z

It is still missing the translation.

If the last link got lost, this might me helpful ; )

I will try to answer more quickly for the next reviews.

No worries — worst case scenario if we need this data ASAP we merge your PR to a secondary branch and work on some tweaks before bringing your code to the master branch…

This PR is becoming long

No problem at all! This is valuable data <3

cuducos · 2017-05-10T19:47:33Z

🎉 Many thanks @rafonseca ; )

rafonseca · 2017-05-10T20:08:09Z

yuhuu!!
Thank you all. Hope it is the first PR of many!

Add more engines to Code Climate

created fetch_tse_data

13f0843

created 2017-03-30-rafonseca-use-data notebook

37cdcfe

files are now in the proper directories

bd2ffb8

jtemporal added the data collection label Apr 3, 2017

guizero reviewed Apr 4, 2017

View reviewed changes

trying to fix wrong commit on old notebooks

1470919

cuducos requested changes Apr 12, 2017

View reviewed changes

added some requested modifications

c4be1d6

cuducos reviewed May 10, 2017

View reviewed changes

rafonseca added 3 commits May 10, 2017 11:32

output filename has dynamic date

7f736e2

updated notebook to read translated file

650b39c

translate headers and labels before exporting

2e56dd6

rafonseca added 5 commits May 10, 2017 14:38

change dtype on import to reduce RAM usage

fa503c8

corrected translation

499d0af

autopep8 on it

a0d1d24

corrected translation

a1826c1

add tse related data and script

3e2ad7d

cuducos merged commit 3e2ad7d into okfn-brasil:master May 10, 2017

cuducos mentioned this pull request May 10, 2017

Find or create a database of all politicians #196

Closed

rafonseca deleted the rafonseca-tse-data branch May 12, 2017 14:19

Irio pushed a commit that referenced this pull request Feb 27, 2018

Merge pull request #208 from datasciencebr/cuducos-codeclimate

55515dd

Add more engines to Code Climate



		FILENAME_PREFIX='consulta_cand_'
		TEMP_PATH = '../data/tse_temp'

created fetch_tse_data #208

created fetch_tse_data #208

Conversation

rafonseca commented Mar 31, 2017

jtemporal commented Mar 31, 2017 • edited by cuducos

rafonseca commented Mar 31, 2017

jtemporal commented Mar 31, 2017

cuducos commented Mar 31, 2017

rafonseca commented Mar 31, 2017

guizero left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jtemporal commented Apr 4, 2017

rafonseca commented Apr 4, 2017

cuducos commented Apr 4, 2017

rafonseca commented Apr 6, 2017

cuducos left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rafonseca commented Apr 13, 2017 • edited

cuducos commented Apr 13, 2017

jtemporal commented Apr 13, 2017

Irio commented Apr 14, 2017

cuducos commented Apr 14, 2017

Irio commented May 5, 2017

rafonseca commented May 5, 2017

rafonseca commented May 9, 2017

Choose a reason for hiding this comment

cuducos commented May 10, 2017

cuducos commented May 10, 2017

rafonseca commented May 10, 2017

jtemporal commented Mar 31, 2017 •

edited by cuducos

rafonseca commented Apr 13, 2017 •

edited