[Feature] GridSearchCV for Surprise algorithms #6

mahermalaeb · 2016-12-30T07:55:33Z

Add the ability for a user to easily test an algorithm with different parameters similar to GridSearchCV of sklearn.

One can give an algorithm a dictionary of the different parameters to try and it generates the best combination of parameters based on some error measurements. Since the data model of Surprise does not directly plug into GridSearchCV an analogous functionality should be added.

mahermalaeb · 2016-12-30T15:53:04Z

@NicolasHug,
I want to add documentation to the new feature. Before the pull request I want to test them on "readthedocs" to see if they look good. I read a little and it seems that I have to create an account and publish docs from my forked repository to do so.

Do you confirm that this is the way to go or is there any other alternative that I am not aware of?

Thanks,
Maher

NicolasHug · 2016-12-30T16:00:33Z

Hey Maher,

No you don't need to use RDT, you can simply build the docs locally. If you haven't already, install all the packages for development with

pip install -r requirements_dev.txt

this will install sphinx (the documentation generator), the RTD theme and some sphinx extensions needed in the doc building process.

The you should be able to build the docs locally by running

cd doc
make html

You can check the results in doc/build/html

I'll add all this in the CONTRIBUTING.md file!

mahermalaeb · 2016-12-31T17:03:45Z

Hello @NicolasHug,

I have finished the implementation and basic documentation of the feature, and pushed the changes to my repo here. When you can please take a look at the docs to confirm that the functionality is as you expected.

In addition, I want to write a user-guide page but I need to take your opinion on where to add it.
Should it be added under "Advanced Usage" section, should I create a separate section or should I add a jupyter notebook?

Happy new year btw :)

Maher

NicolasHug · 2016-12-31T18:17:29Z

This is awesome!

For the doc, you can put it as the first subsection of the "Advanced Usage", right before "Manually iterate over folds". As this is a major feature, you can also add a quick example in the README under the "example" section (if you don't no problem, I'll do it ^^)

I haven't thoroughly tested the code for now (I'll come back to you, probably on Monday), so these are just a few remarks after taking a quick glance at it:

In the test file, could we set the number of epochs and number of factors to lower values so that the tests run faster?
We can make a direct reference to the GridSearchCV of scikit-learn in the doc, so that people know what this is about.
Try to be as pythonic as possible. For example, using enumerate in a for loop would be better than counting the number of iterations (combination_counter += 1). These two talks (talk1, talk2) by Raymond Hettinger have been of great help to me.
I'd rather not use u'strings' here, is it possible to avoid them?
Could cv_results_ be a defaultdict(list) ? I think this would avoid the if key in self.cv_results_.keys():...
I never used attributes that end (or start) with underscores so may be rename best_params_ and likes to best_params.
Is this true : Depending on the nature of the data parameter, it may or may not cross validation. ?
As best_params_ and co are defaultdict, I don't think the tests in test_measure_is_not_case_sensitive do what you think they do (no exception will ever be raised).
Maybe add a little bit of (meaningful) comments to the code?
don't forget to be PEP8 compliant! I see many lines exceeding 80 characters, flake8 is your friend.

Also, please create a pull request so that we can review changes on the pull request . Your upcoming commits will be automatically added to the pull request discussion.

Thank you so much for putting time into it, this will be a very attractive feature!

Happy new year :)!

Nicolas

NicolasHug · 2017-01-01T15:22:34Z

Hey, coming back to you on this :)

It all seem to work very well, I only have a few more remarks:

maybe the verbose message could be changed to something like "Computing MAE and RMSE for parameters combination 1/8"?
Could we used an OrderderDict for the parameters? Or something that would keep the parameters list in the same order as they were given?
Could we use the evaluate() function of the evaluate module instead of computing accuracies manually? To avoid a name clash you would need something like performances = globals()['evaluate'](algo_instance, data)
I think something like measures = [measure.upper() for measure in measures] at the beginning of the function would save a lot other measures.upper() in the future

And I think this is pretty much it :)

Nicolas

mahermalaeb · 2017-01-01T18:55:09Z

Hello Nicolas,

I have created the pull request #7 . I have also added a user guide in the beginning of the advanced usage section.

Please find my comments on each of the points above:

In the test file, could we set the number of epochs and number of factors to lower values so that the tests run faster? -->I have reduced iterations in most functions. Could be reduced further but requires changing assertions
We can make a direct reference to the GridSearchCV of scikit-learn in the doc, so that people know what this is about. --> Done
Try to be as pythonic as possible. For example, using enumerate in a for loop would be better than counting the number of iterations (combination_counter += 1). These two talks (talk1, talk2) by Raymond Hettinger have been of great help to me. --> Done Thanks to the talks :)
I'd rather not use u'strings' here, is it possible to avoid them? --> I didn't get what you meant from this point. Can you please clarify it to me?
Could cv_results_ be a defaultdict(list) ? I think this would avoid the if key in self.cv_results_.keys():... --> --> Done in a pythonic way. Awesome talks again
I never used attributes that end (or start) with underscores so may be rename best_params_ and likes to best_params. --> I don't know why either. I kept the GridSearchCV convention. Let me know if you confirm renaming them
Is this true : Depending on the nature of the data parameter, it may or may not cross validation. ? --> Since I am using the evaluate method you implemented and the data is used as an input, I whether the cross validated depends on whether the data was divided into folds. Am I missing something?
As best_params_ and co are defaultdict, I don't think the tests in test_measure_is_not_case_sensitive do what you think they do (no exception will ever be raised). --> Corrected
Maybe add a little bit of (meaningful) comments to the code? --> Done. Let me know if its enough
don't forget to be PEP8 compliant! I see many lines exceeding 80 characters, flake8 is your friend --> All lines should be less than 80 chars now
maybe the verbose message could be changed to something like "Computing MAE and RMSE for parameters combination 1/8"? --> I changed the verbose a little. However I didn't add the accuracy names such as MAE and RMSE since I am using the evaluate method which adds them by default. Note that the docs of evaluate says with verbosity 0 it shouldn't print anything but it actually prints accuracy measures.
Could we used an OrderderDict for the parameters? Or something that would keep the parameters list in the same order as they were given? --> The moment a dict is defined, its order is lost. So to my knowledge there is no way to save the order of dict keys. The solution would be changing the input to the function to be an ordered list maybe. But I think it is easier and more convenient to keep it a normal dict.
Could we use the evaluate() function of the evaluate module instead of computing accuracies manually? To avoid a name clash you would need something like performances = globals()['evaluate'](algo_instance, data) --> This is my bad. I should have used it from the start instead of doing it manually. Now that its done, the code is much cleaner.
I think something like measures = [measure.upper() for measure in measures] at the beginning of the function would save a lot other measures.upper() in the future --> Done

Notes:

My background is java where the good practice is to write a lot of small functions and unit tests. It's not the case in Python. Any enhancements you add to the code will be of great benefit for me.
I have created a duplicate case insensitive dict because the one you already had overrides the __str__ method. This is not good for best_params_ and co because we are no longer able to print it. Nevertheless, I think it is better to use inheritance then duplicating the code but I didn't want to mess with the code you've written

Looking forward for the merge ;)

Maher

NicolasHug · 2017-01-01T19:54:16Z

Awesome!

I put my remarks on the pull request.

NicolasHug · 2017-01-02T12:45:17Z

Closing this as it has been solved by 714be0b

gopal-gcek · 2020-01-23T13:08:31Z

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
'reg_all': [0.4, 0.6]}

Cross-validation using grid search

grid_search = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=1)

Find the best parameters on the data set

data = Dataset.load_builtin('ml-100k')
after Running above code i got the below error can any help me also i have checked that there is key word measures in grid search of surprises

TypeError Traceback (most recent call last)
in ()
6 'reg_all': [0.4, 0.6]}
7 # Cross-validation using grid search
----> 8 grid_search = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=1)
9 # Find the best parameters on the data set
10 data = Dataset.load_builtin('ml-100k')

TypeError: init() got an unexpected keyword argument 'measures'

NicolasHug assigned mahermalaeb Dec 30, 2016

NicolasHug added the enhancement label Dec 30, 2016

mahermalaeb mentioned this issue Jan 1, 2017

GridSearch feature #7

Merged

NicolasHug closed this as completed Jan 2, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] GridSearchCV for Surprise algorithms #6

[Feature] GridSearchCV for Surprise algorithms #6

mahermalaeb commented Dec 30, 2016

mahermalaeb commented Dec 30, 2016

NicolasHug commented Dec 30, 2016

mahermalaeb commented Dec 31, 2016

NicolasHug commented Dec 31, 2016

NicolasHug commented Jan 1, 2017

mahermalaeb commented Jan 1, 2017

NicolasHug commented Jan 1, 2017

NicolasHug commented Jan 2, 2017

gopal-gcek commented Jan 23, 2020

[Feature] GridSearchCV for Surprise algorithms #6

[Feature] GridSearchCV for Surprise algorithms #6

Comments

mahermalaeb commented Dec 30, 2016

mahermalaeb commented Dec 30, 2016

NicolasHug commented Dec 30, 2016

mahermalaeb commented Dec 31, 2016

NicolasHug commented Dec 31, 2016

NicolasHug commented Jan 1, 2017

mahermalaeb commented Jan 1, 2017

NicolasHug commented Jan 1, 2017

NicolasHug commented Jan 2, 2017

gopal-gcek commented Jan 23, 2020

Cross-validation using grid search

Find the best parameters on the data set