How Much

A Machine Learning supported project for predicting the income level class. It distinguishes between classes of “<= 50K” and “> 50K” in US Dollars. The model is trained using Adults income dataset and is integrated with really simple Rails application using Sklearn-porter project, that generated native Ruby code.

The demo application is hosted on Heroku and is available here (please be mindful if it takes a lot of time to load the first page - it’s hosted on free service).

Please note, the model’s accuracy is around 83% but is based on data gathered in 1994, so will not be very accurate for Today’s answers, nevertheless, it was great experience and fun project to build!

Train the Model

The full research, engineering and choosing features for the model, and then searching for the best model and parameters is described here, but the summary of findings are described below:

Most correlated to the target variable features were:

df.corr()["income_cat"].sort_values(ascending=False)
# education            0.324409
# hours-per-week       0.226346
# capital-gain         0.219655
# male                 0.205186

These have been selected for training the models, which accuracy was:

Model Best accuracy

Decision Tree 82.0%

Random Forest 82.5%

KNN 82.1%
The best performing model was based on the RandomForest algorithm, and this one will be deployed.

Deploy the Model

Sklearn-Porter is able to generate native Ruby code, which will be used to deploy the trained model:

from sklearn_porter import Porter

porter = Porter(grid_for_forest.best_estimator_, language='ruby')
output = porter.export(embed_data=True, class_name='Ml::IncomeClassifierModel')

with open('../app/lib/ml/income_classifier_model.rb', 'w') as f:
    f.write(output)

This would generate a class with the following interface:

class Ml::IncomeClassifierModel
  # ...

  def self.predict(features)
    # ...
  end

  # ...
end

That could be used as follows:

Ml::IncomeClassifierModel.predict([
  10, # value associated to education
  40, # value associated to hours_per_week
   0, # value associated to capital_gain
   1  # value indicating 1 for male or 0 for female
])
# => 0 or 1

Integrate with the rest of the application

The features required for performing the prediction are expected to be passed in a specific order and the predicted value will be either 0 or 1 (for “<= 50K” and “> 50K” respectively), so for convenience - these can be wrapped in another method.

Given the prediction will be performed based on values obtained from the form submitted by the user, the helper method can look like the following:

class Ml
  def self.classify(submission)
    submission.classified_as = predict([
      submission.education,
      submission.hours_per_week,
      submission.capital_gain,
      submission.male ? 1 : 0
    ])
  end

  def self.predict(features)
    classes = ["<= 50K", "> 50K"]
    predicted = Ml::IncomeClassifierModel.predict(features)
    classes[predicted]
  end
end

Summary

Even though the Machine Learning model has been trained on really old data (1994 was 25 years ago, when writing this post in 2019) and will most likely not be accurate for submissions of data reflecting Today’s circumstances - this still was great exercise and an amazing experience!

Now, when retrospectively considering where I spent most of my time when developing this simple app - I had to put more effort in building the frontend of the application, rather than coming up with the Machine Learning engine powering it…

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
ML		ML
app		app
bin		bin
config		config
db		db
docs/assets		docs/assets
lib		lib
log		log
public		public
test		test
tmp		tmp
vendor		vendor
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.org		README.org
Rakefile		Rakefile
config.ru		config.ru
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How Much

Train the Model

Deploy the Model

Integrate with the rest of the application

Summary

About

Releases

Packages

Languages

Model	Best accuracy
Decision Tree	82.0%
Random Forest	82.5%
KNN	82.1%

pdawczak/how_much

Folders and files

Latest commit

History

Repository files navigation

How Much

Train the Model

Deploy the Model

Integrate with the rest of the application

Summary

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages