Skip to content
This repository has been archived by the owner on Mar 1, 2018. It is now read-only.

First steps towards generalization in reimbursiments description #66

Open
wants to merge 36 commits into
base: master
Choose a base branch
from

Conversation

silviodc
Copy link

It is the inclusion of okfn-brasil/serenata-de-amor#238 in Rosie.
Since the built classifier has 91% of accuracy, maybe it can be included in the set of classifiers for the chamber of deputies.

PS:

  1. I uploaded the ML model (40 MB) together to code. I guess it is not the best practice, maybe it must be downloaded from the serenata-toolbox, however i don't have access to amazon to upload the model :/ ( * Probably we have to change it in order to have a better architecture)

  2. The code works fine in the Test:
    screenshot from 2017-07-20 14-01-53
    However i couldn't run it in the real reimbursements. Rosie is consuming all my RAM (5 Gb) while the reimbursements are downloaded, maybe someone could test it to me 👍

  3. Sorry for my Python code. I just started to build something in Python there 4 months. I will appreciate some tips to improve the code.

  4. As you can see in the print the TensorFlow i included is the basic one. I kept it since i don't know whether the machine where Rosie is running allows other configurations. (*We also can change it)

@luzfcb
Copy link
Contributor

luzfcb commented Jul 20, 2017

A comment about 40 MB of binary file (ML model)
Commit a binary file on git repository is a bad idea. I have problems with this decision, because in https://github.com/pythonclub/pythonclub.github.io we had commited too many images files, about 3MB in total. The result was that, over time, the size of the repository increases exponentially, even though I did not include any other new binary files. (So far, the pythonclub repository is close to 216.33MB) Removing these files forces me to rewrite a whole git history, and this would break all branchs and forks. So I can not do that.

In the last year, github has added support for Git LFS, which was apparently created to solve problems like this.

https://git-lfs.github.com/
https://help.github.com/articles/versioning-large-files/

@cuducos cuducos requested a review from Irio July 20, 2017 16:43
Copy link
Collaborator

@jtemporal jtemporal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First comments on 💅 only. PEP-8 for the win ;)

Dockerfile Outdated
COPY requirements.txt ./
COPY setup ./
COPY rosie.py ./
COPY rosie ./rosie
COPY config.ini.example ./
COPY config.ini ./
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be here since Rosie's setup file creates config.ini from the example file copied in the previous line 😉

You can go ahead and delete this line.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, i removed it. Strange the first time i tried to build the image i got error without it... Any way now its ok

from keras.layers import Activation, Dropout, Flatten, Dense
from keras import backend as K
from keras.callbacks import ModelCheckpoint
from keras.models import load_model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can follow PEP-8 imports section to guide you when organizing your imports. For example, when doing from ... import ... you can separate the second block with comma like: from keras.models import Sequential, load_model.

Note that you should also group your imports:


1. standard library imports
1. related third party imports
1. local application/library specific imports

You should put a blank line between each group of imports.

Also on that note, you can make use of tools to help you automatically organize your imports like isort.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! Now they look beautiful :D

The year the expense was generated.
"""

COLS = ['applicant_id',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't be afraid to use descriptive names like COLUMNS for your variables. It will make your code clearer to read down the road.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact i tried to follow the code of other classifiers. Ctrl+C > Ctrl+V
I changed it to COLUMNS ;)

nb_train_samples = sum([len(files) for r, d, files in os.walk(train_data_dir)])
nb_validation_samples = sum([len(files) for r, d, files in os.walk(validation_data_dir)])

print('no. of trained samples = ', nb_train_samples, ' no. of validation samples= ',nb_validation_samples)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long lines aren't a good thing. This print could be like the following in other to avoid the extra long line:

print('no. of trained samples = ', nb_train_samples,
      ' no. of validation samples= ', nb_validation_samples)

Extra space was missing there.


img_width, img_height = 300, 300

def train(self,train_data_dir,validation_data_dir,save_dir):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pay extra attention to spaces after commas, they help make your code easier on the eyes 😉

This method would be nicer like this:

def train(self, train_data_dir, validation_data_dir, save_dir):

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the code using this tool: http://pep8online.com/checkresult
PEP8 online
Check your code for PEP8 requirements


model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noted that you make the same process here 3 times with very little difference between parameters used. Maybe you can explain a little why is that necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included a brief description and a link to explain it: http://deeplearning.net/tutorial/lenet.html

optimizer='rmsprop',
metrics=['accuracy'])

#this is the augmentation configuration we will use for training
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP-8 inline comments state that:

They (inline comments) should start with a # and a single space.

# This is the augmentation configuration we will use for training looks way better don't you think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, using the tool

rescale=1. / 255,
shear_range=0.2,
zoom_range=0.2,
horizontal_flip=False)#As you can see i put it as FALSE and on link example it is TRUE
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP-8 inline comments state that:

An inline comment is a comment on the same line as a statement. Inline comments should be separated by at least two spaces from the statement. They should start with a # and a single space.

And be careful on the line length here too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I didn't quite get what you meant here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a copy and past from serenata. Now i included the reason

I put horizontal_flip as FALSE because we can not handwrite from right to left in Portuguese

horizontal_flip=False)#As you can see i put it as FALSE and on link example it is TRUE
#Explanation, there no possibility to write in a reverse way :P

#this is the augmentation configuration we will use for testing:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing as inline comments, these should start with # followed by a single space.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is applicable to all other comments you made in this file ;)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 requirements done, using it: http://pep8online.com/checkresult

class_mode='binary')

#It allow us to save only the best model between the iterations
checkpointer = ModelCheckpoint(filepath=save_dir+"weights.hdf5", verbose=1, save_best_only=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File paths should be built using os.path.join to avoid breaking the scripts when running Rosie in different systems

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtemporal Another good option with good API is pathlib

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, i included the os.path.join

@coveralls
Copy link

Coverage Status

Coverage decreased (-4.7%) to 93.293% when pulling 5b9952d on silviodc:master into 97a0f00 on datasciencebr:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.7%) to 97.329% when pulling b036f1d on silviodc:master into 97a0f00 on datasciencebr:master.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.5%) to 97.508% when pulling 4060063 on silviodc:master into 97a0f00 on datasciencebr:master.

@silviodc
Copy link
Author

silviodc commented Aug 12, 2017

Main points after the last pull request:

  1. I deleted the Machine Learn model of 40MB 👍 Now i configured the code to load the model from some informed path or download from external source (the one i'm using).
  2. I included tests to predict downloaded reimbursements using the proper ML model
  3. I included a test to create and to train a fake model.
  4. I included a test to this classifier in TestCore class
  5. I fixed the location of supervised machine learn models in rosie.chamber_of_deputies.settings and rosie.federal_senate.settings. Therefore, new supervised models will be able to specify the files they have to load.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants