This repository has been archived by the owner on Mar 1, 2018. It is now read-only.

First steps towards generalization in reimbursiments description #66

Open

silviodc wants to merge 36 commits into okfn-brasil:master from silviodc:master

silviodc commented Jul 20, 2017

It is the inclusion of okfn-brasil/serenata-de-amor#238 in Rosie.
Since the built classifier has 91% of accuracy, maybe it can be included in the set of classifiers for the chamber of deputies.

PS:

I uploaded the ML model (40 MB) together to code. I guess it is not the best practice, maybe it must be downloaded from the serenata-toolbox, however i don't have access to amazon to upload the model :/ ( * Probably we have to change it in order to have a better architecture)
The code works fine in the Test:

However i couldn't run it in the real reimbursements. Rosie is consuming all my RAM (5 Gb) while the reimbursements are downloaded, maybe someone could test it to me 👍
Sorry for my Python code. I just started to build something in Python there 4 months. I will appreciate some tips to improve the code.
As you can see in the print the TensorFlow i included is the basic one. I kept it since i don't know whether the machine where Rosie is running allows other configurations. (*We also can change it)


          First steps towards generalization in reimbursiments description

d62497b

Contributor

luzfcb commented Jul 20, 2017

A comment about 40 MB of binary file (ML model)
Commit a binary file on git repository is a bad idea. I have problems with this decision, because in https://github.com/pythonclub/pythonclub.github.io we had commited too many images files, about 3MB in total. The result was that, over time, the size of the repository increases exponentially, even though I did not include any other new binary files. (So far, the pythonclub repository is close to 216.33MB) Removing these files forces me to rewrite a whole git history, and this would break all branchs and forks. So I can not do that.

In the last year, github has added support for Git LFS, which was apparently created to solve problems like this.

https://git-lfs.github.com/
https://help.github.com/articles/versioning-large-files/

cuducos requested a review from Irio

July 20, 2017 16:43

silviodc added 27 commits

July 29, 2017 01:50


          Merge branch 'master' into master

648bf9a


          Update .travis.yml

5b59e13


          Update reimbursement_generalization.py

f05c579


          Update reimbursement_generalization.py

82a7395

trying to save the pdf and png in a temporary folder


          Update reimbursement_generalization.py

e87c5ca


          Update reimbursement_generalization.py

ca3a2e3


          Update reimbursement_generalization.py

9d799dc


          Update reimbursement_generalization.py

0c4d8f1


          Update reimbursement_generalization.py

3c9eede


          Update .travis.yml

9e656d9


          Update .travis.yml


          Update reimbursement_generalization.py

cc3a83a


          Update reimbursement_generalization.py

487add5


          Update reimbursement_generalization.py

2be6e50


          Update reimbursement_generalization.py


          Update reimbursement_generalization.py

f603520


          Update reimbursement_generalization.py


          Update reimbursement_generalization.py

b807779


          Update reimbursement_generalization.py

4541b3a


          Update reimbursement_generalization.py

e25dead


          Update reimbursement_generalization.py

2b8a7e8


          Update reimbursement_generalization.py

a46dfe2


          Update reimbursement_generalization.py

24c5b91


          Update reimbursement_generalization.py

f50112a


          Update .travis.yml

19bd8ae


          Update .travis.yml

0d71c55


          Update .travis.yml

c657d95


          Update .travis.yml

9396ecd

jtemporal reviewed

View reviewed changes

Collaborator

jtemporal left a comment

First comments on 💅 only. PEP-8 for the win ;)

Dockerfile Outdated

               COPY requirements.txt ./
               COPY setup ./
               COPY rosie.py ./
               COPY rosie ./rosie
               COPY config.ini.example ./
+              COPY config.ini ./

Collaborator

jtemporal Aug 7, 2017

This doesn't need to be here since Rosie's setup file creates config.ini from the example file copied in the previous line 😉

You can go ahead and delete this line.

Author

silviodc Aug 9, 2017

Done, i removed it. Strange the first time i tried to build the image i got error without it... Any way now its ok

rosie/chamber_of_deputies/classifiers/reimbursement_generalization.py Outdated

+              from keras.layers import Activation, Dropout, Flatten, Dense
+              from keras import backend as K
+              from keras.callbacks import ModelCheckpoint
+              from keras.models import load_model

Collaborator

jtemporal Aug 7, 2017

You can follow PEP-8 imports section to guide you when organizing your imports. For example, when doing from ... import ... you can separate the second block with comma like: from keras.models import Sequential, load_model.

Note that you should also group your imports:


1. standard library imports
1. related third party imports
1. local application/library specific imports

You should put a blank line between each group of imports.

Also on that note, you can make use of tools to help you automatically organize your imports like isort.

Author

silviodc Aug 9, 2017

Done! Now they look beautiful :D

rosie/chamber_of_deputies/classifiers/reimbursement_generalization.py Outdated

+                      The year the expense was generated.
+                  """
+                  COLS = ['applicant_id',

Collaborator

jtemporal Aug 7, 2017

Don't be afraid to use descriptive names like COLUMNS for your variables. It will make your code clearer to read down the road.

Author

silviodc Aug 9, 2017

In fact i tried to follow the code of other classifiers. Ctrl+C > Ctrl+V
I changed it to COLUMNS ;)

rosie/chamber_of_deputies/classifiers/reimbursement_generalization.py Outdated

+                      nb_train_samples = sum([len(files) for r, d, files in os.walk(train_data_dir)])
+                      nb_validation_samples = sum([len(files) for r, d, files in os.walk(validation_data_dir)])
+                      print('no. of trained samples = ', nb_train_samples, ' no. of validation samples= ',nb_validation_samples)

Collaborator

jtemporal Aug 7, 2017

Long lines aren't a good thing. This print could be like the following in other to avoid the extra long line:

print('no. of trained samples = ', nb_train_samples,
      ' no. of validation samples= ', nb_validation_samples)

Extra space was missing there.

rosie/chamber_of_deputies/classifiers/reimbursement_generalization.py Outdated


		img_width, img_height = 300, 300

		def train(self,train_data_dir,validation_data_dir,save_dir):

Collaborator

jtemporal Aug 7, 2017

Pay extra attention to spaces after commas, they help make your code easier on the eyes 😉

This method would be nicer like this:

def train(self, train_data_dir, validation_data_dir, save_dir):

Author

silviodc Aug 9, 2017

I changed the code using this tool: http://pep8online.com/checkresult
PEP8 online
Check your code for PEP8 requirements

rosie/chamber_of_deputies/classifiers/reimbursement_generalization.py

+                      model.add(Conv2D(64, (3, 3)))
+                      model.add(Activation('relu'))
+                      model.add(MaxPooling2D(pool_size=(2, 2)))

Collaborator

jtemporal Aug 7, 2017

I noted that you make the same process here 3 times with very little difference between parameters used. Maybe you can explain a little why is that necessary?

Author

silviodc Aug 9, 2017

I included a brief description and a link to explain it: http://deeplearning.net/tutorial/lenet.html

rosie/chamber_of_deputies/classifiers/reimbursement_generalization.py Outdated

+                                    optimizer='rmsprop',
+                                    metrics=['accuracy'])
+                      #this is the augmentation configuration we will use for training

Collaborator

jtemporal Aug 7, 2017

PEP-8 inline comments state that:

They (inline comments) should start with a # and a single space.

# This is the augmentation configuration we will use for training looks way better don't you think?

Author

silviodc Aug 9, 2017

Done, using the tool

rosie/chamber_of_deputies/classifiers/reimbursement_generalization.py Outdated

+                          rescale=1. / 255,
+                          shear_range=0.2,
+                          zoom_range=0.2,
+                          horizontal_flip=False)#As you can see i put it as FALSE and on link example it is TRUE

Collaborator

jtemporal Aug 7, 2017

PEP-8 inline comments state that:

An inline comment is a comment on the same line as a statement. Inline comments should be separated by at least two spaces from the statement. They should start with a # and a single space.

And be careful on the line length here too.

Collaborator

jtemporal Aug 7, 2017

Also, I didn't quite get what you meant here.

Author

silviodc Aug 9, 2017

It was a copy and past from serenata. Now i included the reason

I put horizontal_flip as FALSE because we can not handwrite from right to left in Portuguese

rosie/chamber_of_deputies/classifiers/reimbursement_generalization.py Outdated

+                          horizontal_flip=False)#As you can see i put it as FALSE and on link example it is TRUE
+                      #Explanation, there no possibility to write in a reverse way :P
+                      #this is the augmentation configuration we will use for testing:

Collaborator

jtemporal Aug 7, 2017

Same thing as inline comments, these should start with # followed by a single space.

Collaborator

jtemporal Aug 7, 2017

This is applicable to all other comments you made in this file ;)

Author

silviodc Aug 9, 2017

PEP8 requirements done, using it: http://pep8online.com/checkresult

rosie/chamber_of_deputies/classifiers/reimbursement_generalization.py Outdated

+                          class_mode='binary')
+                      #It allow us to save only the best model between the iterations
+                      checkpointer = ModelCheckpoint(filepath=save_dir+"weights.hdf5", verbose=1, save_best_only=True)

Collaborator

jtemporal Aug 7, 2017

File paths should be built using os.path.join to avoid breaking the scripts when running Rosie in different systems

Contributor

luzfcb Aug 7, 2017

@jtemporal Another good option with good API is pathlib

Author

silviodc Aug 9, 2017

Done, i included the os.path.join

silviodc added 2 commits

August 10, 2017 00:25


          1) Review code for PEP8 requirements (using: http://pep8online.com/ch…

0300cf9

…eckresult)

2) Include comments in the source code


          Merge commit '9396ecd1ab171c524827c2fd72349ac671f313ec'

5b9952d

coveralls commented Aug 9, 2017

Coverage decreased (-4.7%) to 93.293% when pulling 5b9952d on silviodc:master into 97a0f00 on datasciencebr:master.

silviodc mentioned this pull request

Script to Download and include Supervised Learning okfn-brasil/serenata-toolbox#126

Open


          Include tests for training

b036f1d

coveralls commented Aug 11, 2017

Coverage decreased (-0.7%) to 97.329% when pulling b036f1d on silviodc:master into 97a0f00 on datasciencebr:master.

silviodc added 4 commits

August 12, 2017 19:08


          Include test to CORE


          Changing location of supervised models to settings

f919453


          Fixing integration between core and senate

9ce573f


          Fixing load supervised model

PEP8 over the changes

coveralls commented Aug 12, 2017

Coverage decreased (-0.5%) to 97.508% when pulling 4060063 on silviodc:master into 97a0f00 on datasciencebr:master.

Author

silviodc commented Aug 12, 2017 •

edited

Main points after the last pull request:

I deleted the Machine Learn model of 40MB 👍 Now i configured the code to load the model from some informed path or download from external source (the one i'm using).
I included tests to predict downloaded reimbursements using the proper ML model
I included a test to create and to train a fake model.
I included a test to this classifier in TestCore class
I fixed the location of supervised machine learn models in rosie.chamber_of_deputies.settings and rosie.federal_senate.settings. Therefore, new supervised models will be able to specify the files they have to load.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.