Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Class for Scaling of features #1876

Merged
merged 38 commits into from Jul 17, 2019

Conversation

@jeffin143
Copy link
Contributor

commented Apr 25, 2019

I have implemented following classes helpful for scaling of features

  • MinMax Scaler class

  • MaxAbs Scaler class

  • Standard Scaler Class

  • Mean Normalization Class

  • Zca whitening

  • Pca Whitening

Scaling to unit length is already there in armadillo and hence we need not implement it here and simpy call arma::normalize() to scale it to unit length.

jeffin143 added 2 commits Apr 25, 2019
jeffin143 added 2 commits May 11, 2019
…apt new interface and removing typos
Copy link
Member

left a comment

Hey @jeffin143, looking good. I think this is nice support to add to mlpack. But I have a bunch of comments. Sorry if some of them are kind of nitpicky. :)

src/mlpack/core/data/scaler_methods/CMakeLists.txt Outdated Show resolved Hide resolved
src/mlpack/core/data/scaler_methods/maxabsscaler.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/scaler_methods/maxabsscaler.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/scaler_methods/maxabsscaler.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/scaler_methods/maxabsscaler.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/scaler_methods/minmaxscaler.hpp Outdated Show resolved Hide resolved
src/mlpack/tests/load_save_test.cpp Outdated Show resolved Hide resolved
src/mlpack/tests/load_save_test.cpp Outdated Show resolved Hide resolved
src/mlpack/tests/load_save_test.cpp Outdated Show resolved Hide resolved
jeffin143 added 3 commits May 15, 2019
Made seprate file for scaling test
Added more documentation
Adapted Style
Made required changes
Reduce Lines
@jeffin143

This comment has been minimized.

Copy link
Contributor Author

commented May 22, 2019

@zoq , could you please once re run Travis test, probably the probabilistic test may have failed, I didn't find any other issue after going through failed test.

@rcurtin rcurtin removed the s: unanswered label May 29, 2019
Copy link
Member

left a comment

Looks good, I like the addition of the PCA and ZCA scalers. I think these will be useful. I also like the binding. I left some comments there.

We should definitely add something in main_tests/ for the binding, but the key is just that the tests there need to make sure that the options are getting passed correctly---so we don't need to check that the results are exactly right (the tests you already wrote do that) but instead that the options make something happen. So, e.g., tests in main_tests/ should try passing different scaling strategies to ensure that the output changes, and different parameters for those to make sure the output also changes, and should make sure that unrelated options don't change anything (like if I specify epsilon for a scaler that doesn't need it, then the result shouldn't change).

For the CLI output format, the code that outputs the matrix in data::Save() in turn calls out to Armadillo, but last I checked we were not able to change the format. However, perhaps that has been changed. If you like, you can open a separate issue and we can handle it there. Another option might be submitting a patch upstream to Armadillo so that it doesn't output full precision when it isn't needed.

Also, we should document this in HISTORY.md so the changelog shows we added it. :)

@jeffin143

This comment has been minimized.

Copy link
Contributor Author

commented May 30, 2019

We should definitely add something in main_tests/ for the binding, but the key is just that the tests there need to make sure that the options are getting passed correctly---so we don't need to check that the results are exactly right (the tests you already wrote do that) but instead that the options make something happen. So, e.g., tests in main_tests/ should try passing different scaling strategies to ensure that the output changes, and different parameters for those to make sure the output also changes, and should make sure that unrelated options don't change anything (like if I specify epsilon for a scaler that doesn't need it, then the result shouldn't change).

Added all the three test according to your suggestion.

For the CLI output format, the code that outputs the matrix in data::Save() in turn calls out to Armadillo, but last I checked we were not able to change the format. However, perhaps that has been changed. If you like, you can open a separate issue and we can handle it there. Another option might be submitting a patch upstream to Armadillo so that it doesn't output full precision when it isn't needed.

Ok, as of now we will leave it, but we surely need to look into this issue, I will open an issue, also about submitting patch, i guess i am not so sound with armadillo library :), so i will leave it to you.

Also, we should document this in HISTORY.md so the changelog shows we added it. :)

I will do it.

jeffin143 added 2 commits May 30, 2019
Syncing
@jeffin143

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2019

@rcurtin Just for fun sake : Tried to benchmark the functions
Armadillo is really great, we can see its fast implementation on relatively small dataset.

c++ > python :)

Scaler Method Mlpack Sklearn
MinMaxScaler 0.001542 0.002096
MaxAbsScaler 0.001190 0.001541
StandardScaler 0.001892 0.037366

Screenshot from 2019-05-31 09-31-25

Script i used was :

from sklearn.datasets import load_digits
from mlpack import preprocess_scale
import time
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import MaxAbsScaler
from sklearn.preprocessing import StandardScaler

x = load_digits().data
y = load_digits().data

start_time = time.time()
output = MinMaxScaler().fit(x).transform(x)
end_time = time.time() - start_time
print("sklearn time for minmax scaler is ",end_time)


start_time = time.time()
output = preprocess_scale(input = y, scaler_method = 'min_max_scaler')
end_time = time.time() - start_time
print("mlpack time for minmax scaler is",end_time)

start_time = time.time()
output = MaxAbsScaler().fit(x).transform(x)
end_time = time.time() - start_time
print("sklearn time for MaxAbsScaler scaler is ",end_time)

start_time = time.time()
output = preprocess_scale(input = y, scaler_method = 'max_abs_scaler')
end_time = time.time() - start_time
print("mlpack time for MaxAbsScaler scaler is",end_time)

start_time = time.time()
output = StandardScaler().fit(x).transform(x)
end_time = time.time() - start_time
print("sklearn time for StandardScaler scaler is ",end_time)

start_time = time.time()
output = preprocess_scale(input = y, scaler_method = 'standard_scaler')
end_time = time.time() - start_time
print("mlpack time for StandardScaler is",end_time)

PCA whitening and zca whitening were not avialable in sklearn and also mean_normalization wasn't there

jeffin143 added 3 commits Jun 8, 2019
… change in documentation
@jeffin143 jeffin143 force-pushed the jeffin143:scalerclass branch from ec6e7c2 to 5fd0763 Jun 8, 2019
@jeffin143

This comment has been minimized.

Copy link
Contributor Author

commented Jun 8, 2019

Not sure , how well I have wrote the binding, if you find that it isn't appropriate to introduce this api, and stick to the old one, let me know and I will revert back to it.

Copy link
Member

left a comment

I know I always have a lot of comments but this is really exciting and I'm looking forward to seeing it merged. Especially the speedups you are seeing are great---this is something nice we can advertise, and with the Python bindings it's easy to just drop mlpack into your workflow for feature scaling and get a big speedup, especially for standard scaling. I bet those speedups get even bigger as the data matrices get bigger too.

All the comments I left are pretty minor ideas for cleanups and clarifications. The only big comment is for the tests for the binding---we should be careful to test everything, so if you don't mind adding the tests I suggested it would really help us know that the code is robust. :)

The documentation you wrote for the binding is great; it's clear and easy to use. If you eventually wanted to write a blog post or something detailing this binding and how you can drop it in a workflow to get speedup, we could do that and then try and publicize it. Up to you.

From my side I don't see any other changes than just the stuff I pointed out, so if you can handle those I think it's ready for merge.

Ok, as of now we will leave it, but we surely need to look into this issue, I will open an issue, also about submitting patch, i guess i am not so sound with armadillo library :), so i will leave it to you.

Sounds good, if you want to chat more about it just ping me sometime. I can explain a bit more about what's going on under the hood in Armadillo (it's very templated code so it can be hard to trace it all down sometimes).

src/mlpack/core/data/scaler_methods/pcawhitening.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/scaler_methods/pcawhitening.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/scaler_methods/pcawhitening.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/scaler_methods/scaling_model.hpp Outdated Show resolved Hide resolved
src/mlpack/tests/main_tests/preprocess_scale_test.cpp Outdated Show resolved Hide resolved
src/mlpack/tests/main_tests/preprocess_scale_test.cpp Outdated Show resolved Hide resolved
src/mlpack/tests/main_tests/preprocess_scale_test.cpp Outdated Show resolved Hide resolved
jeffin143 added 3 commits Jun 10, 2019
Copy link
Member

left a comment

From my end everything looks good here. Thanks so much for the hard work and I'm excited to integrate this. 👍

I only had one other little comment, but it's easy for me to handle that during merge (or if there is a reason I overlooked to name it zca, let me know). For those last tests, I can also implement those if you like; it should be really easy. They might seem like really pedantic tests to add now, but they become useful later when someone modifies the bindings or changes the set of allowed algorithms or something like this. (Think of the test as a "defensive measure" to keep the code from getting unintentionally broken by later changes. :))

jeffin143 added 2 commits Jun 19, 2019
@rcurtin

This comment has been minimized.

Copy link
Member

commented Jul 12, 2019

@mlpack-bot why aren't you automatically approving this? :)

Copy link
Member

left a comment

(This is an attempt to try to get mlpack-bot to approve the PR...)

@mlpack-bot
mlpack-bot bot approved these changes Jul 17, 2019
Copy link

left a comment

Second approval provided automatically after 24 hours. 👍

@rcurtin

This comment has been minimized.

Copy link
Member

commented Jul 17, 2019

It worked!!! Took me like a week to debug that.

@rcurtin rcurtin merged commit 7237d17 into mlpack:master Jul 17, 2019
6 checks passed
6 checks passed
LaTeX Documentation Checks Build finished.
Details
Memory Checks Build finished.
Details
Static Code Analysis Checks Build finished.
Details
Style Checks Build finished.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@rcurtin

This comment has been minimized.

Copy link
Member

commented Jul 17, 2019

Thanks again for the contribution; I'm really happy to have this merged. I made some minor style changes in 2d8e1f8, including changing PcaWhitening and ZcaWhitening to PCAWhitening and ZCAWhitening since PCA and ZCA are acronyms. :) I know it is a little pedantic but I am all about consistency :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.