Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CLI and python bindngs for linear_svm #1935

Merged
merged 31 commits into from Oct 9, 2019

Conversation

@Yashwants19
Copy link
Contributor

Yashwants19 commented Jun 21, 2019

Address #1633

Yashwants19 added 6 commits Jun 21, 2019
Yashwants19 added 3 commits Jun 25, 2019
@Yashwants19

This comment has been minimized.

Copy link
Contributor Author

Yashwants19 commented Jun 26, 2019

Hi @rcurtin, @zoq : This is ready for review.

Copy link
Member

zoq left a comment

Thanks, this looks great.

src/mlpack/methods/linear_svm/linear_svm_main.cpp Outdated Show resolved Hide resolved
src/mlpack/methods/linear_svm/linear_svm_main.cpp Outdated Show resolved Hide resolved
src/mlpack/methods/linear_svm/linear_svm_main.cpp Outdated Show resolved Hide resolved
@Yashwants19 Yashwants19 mentioned this pull request Jul 3, 2019
@rcurtin rcurtin removed the s: unanswered label Jul 6, 2019
Copy link
Member

rcurtin left a comment

Hey @Yashwants19, nice work and thank you for writing this code. I've taken a pass over it and left a bunch of comments. I think we can definitely merge this in---my comments are mostly minor in nature and should be relatively easy to resolve. 👍

* @paran delta Margin of difference between correct class and other classes.
* @param fitIntercept add intercept term or not.
*/
LinearSVM(const size_t numClasses = 0,

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jul 8, 2019

Member

Instead of saying be sure to use Train() before calling Classify() or ComputeAccuracy(), it seems to me like we could just check in those functions whether or not the model has 0 classes, and issue an error if so.

This comment has been minimized.

Copy link
@Yashwants19

Yashwants19 Jul 11, 2019

Author Contributor

I have added this part. Please take a look.
Sorry for the delay.

src/mlpack/methods/linear_svm/linear_svm_main.cpp Outdated Show resolved Hide resolved
src/mlpack/methods/linear_svm/linear_svm_main.cpp Outdated Show resolved Hide resolved
src/mlpack/methods/linear_svm/linear_svm_main.cpp Outdated Show resolved Hide resolved
"be specified with the" + PRINT_PARAM_STRING("delta") + "option."
"The optimizer used to train the model can be specified with the " +
PRINT_PARAM_STRING("optimizer") + " parameter. Available options are "
"'psgd' (stochastic gradient descent) and 'lbfgs' (the L-BFGS optimizer). "

This comment has been minimized.

Copy link
@rcurtin

rcurtin Jul 8, 2019

Member

It might be worth writing parallel stochastic gradient descent not just stochastic gradient descent here since there are lots of variations of SGD.

src/mlpack/tests/main_tests/linear_svm_test.cpp Outdated Show resolved Hide resolved
src/mlpack/tests/main_tests/linear_svm_test.cpp Outdated Show resolved Hide resolved
src/mlpack/tests/main_tests/linear_svm_test.cpp Outdated Show resolved Hide resolved
Yashwants19 added 3 commits Jul 11, 2019
Remove Line.
@Yashwants19

This comment has been minimized.

Copy link
Contributor Author

Yashwants19 commented Jul 13, 2019

Hi @rcurtin This is ready for review. :)

@mlpack-bot

This comment has been minimized.

Copy link

mlpack-bot bot commented Aug 12, 2019

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

@rcurtin rcurtin added this to the mlpack 3.2.0 (maybe) milestone Aug 31, 2019
@Yashwants19

This comment has been minimized.

Copy link
Contributor Author

Yashwants19 commented Sep 3, 2019

Hi @rcurtin this is ready for review.

@rcurtin rcurtin modified the milestones: mlpack 3.2.1, mlpack 3.2.2 Sep 27, 2019
@Yashwants19

This comment has been minimized.

Copy link
Contributor Author

Yashwants19 commented Sep 30, 2019

Hi @rcurtin I think so we should finish this.
If you have some time please review this.
Thank you

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Sep 30, 2019

Hey @Yashwants19 I totally agree, I have just been busy with other things and haven't really been able to look into this. Basically, we need to figure out what is going on with parallel SGD and why this causes problems with the intercepts. Once we can resolve that then I think it's ready to merge. I hope that maybe I'll have some more time to look into this more comprehensively later this week.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Oct 4, 2019

@Yashwants19 I'm not able to reproduce the error that you keep mentioning. Can you give me details on how to reproduce it so I can debug it? Thanks!

Yashwants19 added 2 commits Oct 4, 2019
@Yashwants19

This comment has been minimized.

Copy link
Contributor Author

Yashwants19 commented Oct 4, 2019

Hi @rcurtin as you can see in Travis Build psgd got some matrix incompatible error.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Oct 5, 2019

I was able to debug this pretty easily with gdb. Here's the diff to fix the issue:

diff --git a/src/mlpack/methods/linear_svm/linear_svm_function_impl.hpp b/src/mlpack/methods/linear_svm/linear_svm_function_impl.hpp
index 4e8a751ce..0656631c2 100644
--- a/src/mlpack/methods/linear_svm/linear_svm_function_impl.hpp
+++ b/src/mlpack/methods/linear_svm/linear_svm_function_impl.hpp
@@ -320,7 +320,7 @@ void LinearSVMFunction<MatType>::Gradient(
   {
     scores = parameters.rows(0, dataset.n_rows - 1).t()
         * dataset.cols(firstId, lastId)
-        + arma::repmat(parameters.row(dataset.n_rows).t(), 1, dataset.n_cols);
+        + arma::repmat(parameters.row(dataset.n_rows).t(), 1, batchSize);
   }
 
   arma::mat margin = scores - (arma::repmat(arma::ones(numClasses).t()

You can see that the repmat call incorrectly had dataset.n_cols as a parameter instead of batchSize.

@Yashwants19

This comment has been minimized.

Copy link
Contributor Author

Yashwants19 commented Oct 5, 2019

You are great @rcurtin. Thanks for the help now this ready for complete review. Thank you :)

Copy link
Member

rcurtin left a comment

I thought maybe debugging the psgd issue would be really tricky, so I was going to make sure to set aside an entire evening for it, but that turned out to not be necessary---it was pretty straightforward.

I left some comments; I think that some option names have to change, and I found one bug with the accuracy calculation.

I also did some timing and was a little bit surprised by how slow parallel SGD can sometimes be:

$ bin/mlpack_linear_svm -t ~/datasets/covertype-5k.train.csv -l ~/datasets/covertype-5k.train.labels.csv -T ~/datasets/covertype-5k.test.csv -A ~/datasets/covertype-5k.test.labels.csv -v --max_iterations 50 -O psgd --epochs 1
[WARN ] Should pass one of '--output_model_file (-M)', '--predictions_file (-P)', or '--probabilities_file (-p)'; no output will be saved!
[WARN ] '--max_iterations (-n)' ignored because optimizer type is not 'lbfgs'.
[INFO ] Loading '/home/ryan/datasets/covertype-5k.train.csv' as CSV data.  Size is 54 x 5000.
[INFO ] Loading '/home/ryan/datasets/covertype-5k.train.labels.csv' as raw ASCII formatted data.  Size is 5000 x 1.
[INFO ] Training model with ParallelSGD optimizer.
[INFO ] LinearSVM::LinearSVM(): final objective of trained model is 114918.
[INFO ] Loading '/home/ryan/datasets/covertype-5k.test.csv' as CSV data.  Size is 54 x 5000.
[INFO ] Loading '/home/ryan/datasets/covertype-5k.test.labels.csv' as raw ASCII formatted data.  Size is 5000 x 1.
[INFO ] Accuracy for points with label 0 is -nan (0 of 0).
[INFO ] Accuracy for points with label 1 is 0.912241 (1632 of 1789).
[INFO ] Accuracy for points with label 2 is 0.287571 (708 of 2462).
[INFO ] Accuracy for points with label 3 is 0.899371 (286 of 318).
[INFO ] Accuracy for points with label 4 is 0 (0 of 26).
[INFO ] Accuracy for points with label 5 is 0.075 (6 of 80).
[INFO ] Accuracy for points with label 6 is 0 (0 of 149).
[INFO ] Total accuracy for all points is 0.5264 (2632 of 5000).
[INFO ] 
[INFO ] Execution parameters:
[INFO ]   delta: 1
[INFO ]   epochs: 1
[INFO ]   help: 0
[INFO ]   info:
[INFO ]   input_model_file:
[INFO ]   labels_file: /home/ryan/datasets/covertype-5k.train.labels.csv
[INFO ]   lambda: 0.0001
[INFO ]   max_iterations: 50
[INFO ]   no_intercept: 0
[INFO ]   num_classes: 0
[INFO ]   optimizer: psgd
[INFO ]   output_model_file:
[INFO ]   predictions_file:
[INFO ]   probabilities_file:
[INFO ]   seed: 0
[INFO ]   shuffle: 0
[INFO ]   step_size: 0.01
[INFO ]   test_file: /home/ryan/datasets/covertype-5k.test.csv
[INFO ]   test_labels_file: /home/ryan/datasets/covertype-5k.test.labels.csv
[INFO ]   tolerance: 1e-10
[INFO ]   training_file: /home/ryan/datasets/covertype-5k.train.csv
[INFO ]   verbose: 1
[INFO ]   version: 0
[INFO ] Program timers:
[INFO ]   linear_svm_optimization: 174.025922s (2 mins, 54.0 secs)
[INFO ]   loading_data: 0.065555s
[INFO ]   total_time: 174.093572s (2 mins, 54.0 secs)

as compared to L-BFGS with the same effective number of iterations (5000):

$ bin/mlpack_linear_svm -t ~/datasets/covertype-5k.train.csv -l ~/datasets/covertype-5k.train.labels.csv -T ~/datasets/covertype-5k.test.csv -A ~/datasets/covertype-5k.test.labels.csv -v --max_iterations 5000
[WARN ] Should pass one of '--output_model_file (-M)', '--predictions_file (-P)', or '--probabilities_file (-p)'; no output will be saved!
[INFO ] Loading '/home/ryan/datasets/covertype-5k.train.csv' as CSV data.  Size is 54 x 5000.
[INFO ] Loading '/home/ryan/datasets/covertype-5k.train.labels.csv' as raw ASCII formatted data.  Size is 5000 x 1.
[INFO ] Training model with L-BFGS optimizer.
[INFO ] LinearSVM::LinearSVM(): final objective of trained model is 0.908343.
[INFO ] Loading '/home/ryan/datasets/covertype-5k.test.csv' as CSV data.  Size is 54 x 5000.
[INFO ] Loading '/home/ryan/datasets/covertype-5k.test.labels.csv' as raw ASCII formatted data.  Size is 5000 x 1.
[INFO ] Accuracy for points with label 0 is -nan (0 of 0).
[INFO ] Accuracy for points with label 1 is 0.713248 (1276 of 1789).
[INFO ] Accuracy for points with label 2 is 0.787571 (1939 of 2462).
[INFO ] Accuracy for points with label 3 is 0.716981 (228 of 318).
[INFO ] Accuracy for points with label 4 is 0.230769 (6 of 26).
[INFO ] Accuracy for points with label 5 is 0 (0 of 80).
[INFO ] Accuracy for points with label 6 is 0.120805 (18 of 149).
[INFO ] Total accuracy for all points is 0.6934 (3467 of 5000).
[INFO ] 
[INFO ] Execution parameters:
[INFO ]   delta: 1
[INFO ]   epochs: 50
[INFO ]   help: 0
[INFO ]   info: 
[INFO ]   input_model_file: 
[INFO ]   labels_file: /home/ryan/datasets/covertype-5k.train.labels.csv
[INFO ]   lambda: 0.0001
[INFO ]   max_iterations: 5000
[INFO ]   no_intercept: 0
[INFO ]   num_classes: 0
[INFO ]   optimizer: lbfgs
[INFO ]   output_model_file: 
[INFO ]   predictions_file: 
[INFO ]   probabilities_file: 
[INFO ]   seed: 0
[INFO ]   shuffle: 0
[INFO ]   step_size: 0.01
[INFO ]   test_file: /home/ryan/datasets/covertype-5k.test.csv
[INFO ]   test_labels_file: /home/ryan/datasets/covertype-5k.test.labels.csv
[INFO ]   tolerance: 1e-10
[INFO ]   training_file: /home/ryan/datasets/covertype-5k.train.csv
[INFO ]   verbose: 1
[INFO ]   version: 0
[INFO ] Program timers:
[INFO ]   linear_svm_optimization: 19.017438s
[INFO ]   loading_data: 0.077176s
[INFO ]   total_time: 19.109493s

It may be worth investigating further here---such a huge slowdown for parallel SGD seems pretty unreasonable; it shouldn't take an order of magnitude longer to get to a model that's much worse than L-BFGS gives in far less time. However, there are probably two issues here:

  1. The LinearSVMFunction implementation could be accelerated. This is supported by some basic instrumenting I did where I used Timer::Start() and Timer::Stop() inside Evaluate() and Gradient() in order to see how long was spent in each function. For the L-BFGS run above, I saw
[INFO ]   lsvm_evaluate: 4.592845s
[INFO ]   lsvm_gradient: 15.203862s
  1. The ensmallen ParallelSGD optimizer is for some reason not doing well on this dataset. That could be for a lot of reasons; I am not sure of which it might be. I played with various settings of OMP_NUM_THREADS and found that 2 threads did give speedup over 1, but 8 doesn't, so it's probably some kind of locking issue---which actually there shouldn't be in Hogwild.

Lastly, I ran a quick comparison with the most similar thing I could find that was easy to run (scikit's SVC classifier). I used the following code:

import pandas as pd
import numpy as np

train = pd.read_csv("/home/ryan/datasets/covertype-5k.train.csv", header=None)
tr_l = pd.read_csv("/home/ryan/datasets/covertype-5k.train.labels.csv",
    header=None)
test = pd.read_csv("/home/ryan/datasets/covertype-5k.test.csv", header=None)
t_l = pd.read_csv("/home/ryan/datasets/covertype-5k.test.labels.csv",
    header=None)

from sklearn.svm import SVC, LinearSVC

import time

svc = SVC(kernel='linear')

start_time = time.time()
svc.fit(train[1:1000], tr_l[1:1000])
end_time = time.time()

print("SVC: %s seconds" % (end_time - start_time))

# Compute accuracy.
preds = svc.predict(test)
correct = np.sum(preds == np.reshape(t_l.values, (len(preds),)))
print("SVC correct: %d out of %d (%f)" % (correct, len(preds), 100 *
    float(correct) / float(len(preds))))
                
lsvc = LinearSVC()
# tuned version below
#lsvc = LinearSVC(max_iter=1000000, verbose=1000, loss='hinge')
                      
start_time = time.time()   
lsvc.fit(train, tr_l)
end_time = time.time()

print("LinearSVC: %s seconds" % (end_time - start_time))

# Compute accuracy.
preds = lsvc.predict(test)
correct = np.sum(preds == np.reshape(t_l.values, (len(preds),)))
print("LinearSVC correct: %d out of %d (%f)" % (correct, len(preds), 100 *
    float(correct) / float(len(preds))))

Note that that only runs 1000 points with SVC (which uses libsvm) because it just took so long. I got impatient waiting and commented that part out. Training LinearSVC on the "full" dataset of 5k points also took a very long time and I had to do some amount of tuning. I'm not sure if the tuning I did gave the fastest implementation, but anyway, here were the results:

LinearSVC: 1033.84167314 seconds
LinearSVC correct: 3259 out of 5000 (65.180000)

So, those results are promising in that we've outperformed the scikit implementation pretty resoundingly, but I still think there's a lot of acceleration possible and it's not behaving quite how I would have expected---I definitely would have expected psgd to be much quicker than L-BFGS.

I know this comment got really long---anyway, most of it is my observations, but if you can handle the couple of comments I left and update HISTORY.md then I think this is ready to merge. 👍 If you want to also take the time to play around with some benchmarking it would be interesting to see what you found also.

"visited for parallel SGD.", "S");
PARAM_INT_IN("epochs", "Maximum number of full epochs over dataset for "
"psgd", "E", 50);
PARAM_INT_IN("seed", "Random seed. If 0, 'std::time(NULL)' is used.", "r", 0);

This comment has been minimized.

Copy link
@rcurtin

rcurtin Oct 5, 2019

Member

This parameter is generally -s in other programs. Maybe step size should be -p ("steP") or something, or maybe -a because the step size is often also known as alpha.

This comment has been minimized.

Copy link
@Yashwants19

Yashwants19 Oct 6, 2019

Author Contributor

Updated.

if (CLI::HasParam("test_labels"))
{
arma::Row<size_t> testLabels =
std::move(CLI::GetParam<arma::Row<size_t>>("test_labels"));

This comment has been minimized.

Copy link
@rcurtin

rcurtin Oct 5, 2019

Member

Don't forget to apply the mapping to the test labels too---this caused some problems in some simulations I ran.

This comment has been minimized.

Copy link
@Yashwants19

Yashwants19 Oct 6, 2019

Author Contributor

I have added this part.

numClasses = CLI::GetParam<int>("num_classes") == 0 ?
model->mappings.n_elem : CLI::GetParam<int>("num_classes");
vector<size_t> correctClassCounts(numClasses, 0);
vector<size_t> labelSize(numClasses, 0);

This comment has been minimized.

Copy link
@rcurtin

rcurtin Oct 5, 2019

Member

I'd suggest using arma::Col<size_t> here for consistency.

This comment has been minimized.

Copy link
@Yashwants19

Yashwants19 Oct 6, 2019

Author Contributor

Updated

@Yashwants19

This comment has been minimized.

Copy link
Contributor Author

Yashwants19 commented Oct 6, 2019

I will definitely open a PR for benchmarking afterward.
Thank You.

Yashwants19 added 3 commits Oct 6, 2019
@rcurtin
rcurtin approved these changes Oct 6, 2019
Copy link
Member

rcurtin left a comment

@Yashwants19 nice work on this! If you do end up benchmarking it later it would be really interesting to see how it works vs. comparable implementations. :)

}

numClasses = CLI::GetParam<int>("num_classes") == 0 ?
model->mappings.n_elem : CLI::GetParam<int>("num_classes");

This comment has been minimized.

Copy link
@rcurtin

rcurtin Oct 6, 2019

Member

Oops, this line should be doubly indented. I can handle it during merge unless you get to it before me. :)

This comment has been minimized.

Copy link
@Yashwants19

Yashwants19 Oct 7, 2019

Author Contributor

Job done :)

@mlpack-bot
mlpack-bot bot approved these changes Oct 7, 2019
Copy link

mlpack-bot bot left a comment

Second approval provided automatically after 24 hours. 👍

@rcurtin rcurtin removed the s: needs review label Oct 9, 2019
@rcurtin rcurtin merged commit 873360d into mlpack:master Oct 9, 2019
6 checks passed
6 checks passed
LaTeX Documentation Checks Build finished.
Details
Memory Checks Build finished.
Details
Static Code Analysis Checks Build finished.
Details
Style Checks Build finished.
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Oct 9, 2019

Thanks again, awesome contribution. 👍

@Yashwants19 Yashwants19 deleted the Yashwants19:cli-lsvm branch Oct 9, 2019
@Yashwants19

This comment has been minimized.

Copy link
Contributor Author

Yashwants19 commented Oct 9, 2019

Hi @rcurtin Now looking forward to complete go bindings PR. And then I would like to work on Extra trees.

@rcurtin

This comment has been minimized.

Copy link
Member

rcurtin commented Oct 9, 2019

Sounds good, I'll focus my review efforts there as I am able to. 👍

@Yashwants19

This comment has been minimized.

Copy link
Contributor Author

Yashwants19 commented Oct 9, 2019

No problem and thank you for the help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.