Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] GridSearchCV for Surprise algorithms #6

Closed
mahermalaeb opened this issue Dec 30, 2016 · 9 comments
Closed

[Feature] GridSearchCV for Surprise algorithms #6

mahermalaeb opened this issue Dec 30, 2016 · 9 comments
Assignees

Comments

@mahermalaeb
Copy link
Contributor

Add the ability for a user to easily test an algorithm with different parameters similar to GridSearchCV of sklearn.

One can give an algorithm a dictionary of the different parameters to try and it generates the best combination of parameters based on some error measurements. Since the data model of Surprise does not directly plug into GridSearchCV an analogous functionality should be added.

@mahermalaeb
Copy link
Contributor Author

@NicolasHug,
I want to add documentation to the new feature. Before the pull request I want to test them on "readthedocs" to see if they look good. I read a little and it seems that I have to create an account and publish docs from my forked repository to do so.

Do you confirm that this is the way to go or is there any other alternative that I am not aware of?

Thanks,
Maher

@NicolasHug
Copy link
Owner

Hey Maher,

No you don't need to use RDT, you can simply build the docs locally. If you haven't already, install all the packages for development with

pip install -r requirements_dev.txt

this will install sphinx (the documentation generator), the RTD theme and some sphinx extensions needed in the doc building process.

The you should be able to build the docs locally by running

cd doc
make html

You can check the results in doc/build/html

I'll add all this in the CONTRIBUTING.md file!

@mahermalaeb
Copy link
Contributor Author

Hello @NicolasHug,

I have finished the implementation and basic documentation of the feature, and pushed the changes to my repo here. When you can please take a look at the docs to confirm that the functionality is as you expected.

In addition, I want to write a user-guide page but I need to take your opinion on where to add it.
Should it be added under "Advanced Usage" section, should I create a separate section or should I add a jupyter notebook?

Happy new year btw :)

Maher

@NicolasHug
Copy link
Owner

This is awesome!

For the doc, you can put it as the first subsection of the "Advanced Usage", right before "Manually iterate over folds". As this is a major feature, you can also add a quick example in the README under the "example" section (if you don't no problem, I'll do it ^^)

I haven't thoroughly tested the code for now (I'll come back to you, probably on Monday), so these are just a few remarks after taking a quick glance at it:

  • In the test file, could we set the number of epochs and number of factors to lower values so that the tests run faster?
  • We can make a direct reference to the GridSearchCV of scikit-learn in the doc, so that people know what this is about.
  • Try to be as pythonic as possible. For example, using enumerate in a for loop would be better than counting the number of iterations (combination_counter += 1). These two talks (talk1, talk2) by Raymond Hettinger have been of great help to me.
  • I'd rather not use u'strings' here, is it possible to avoid them?
  • Could cv_results_ be a defaultdict(list) ? I think this would avoid the if key in self.cv_results_.keys():...
  • I never used attributes that end (or start) with underscores so may be rename best_params_ and likes to best_params.
  • Is this true : Depending on the nature of the data parameter, it may or may not cross validation. ?
  • As best_params_ and co are defaultdict, I don't think the tests in test_measure_is_not_case_sensitive do what you think they do (no exception will ever be raised).
  • Maybe add a little bit of (meaningful) comments to the code?
  • don't forget to be PEP8 compliant! I see many lines exceeding 80 characters, flake8 is your friend.

Also, please create a pull request so that we can review changes on the pull request . Your upcoming commits will be automatically added to the pull request discussion.

Thank you so much for putting time into it, this will be a very attractive feature!

Happy new year :)!

Nicolas

@NicolasHug
Copy link
Owner

Hey, coming back to you on this :)

It all seem to work very well, I only have a few more remarks:

  • maybe the verbose message could be changed to something like "Computing MAE and RMSE for parameters combination 1/8"?
  • Could we used an OrderderDict for the parameters? Or something that would keep the parameters list in the same order as they were given?
  • Could we use the evaluate() function of the evaluate module instead of computing accuracies manually? To avoid a name clash you would need something like performances = globals()['evaluate'](algo_instance, data)
  • I think something like measures = [measure.upper() for measure in measures] at the beginning of the function would save a lot other measures.upper() in the future

And I think this is pretty much it :)

Nicolas

@mahermalaeb
Copy link
Contributor Author

Hello Nicolas,

I have created the pull request #7 . I have also added a user guide in the beginning of the advanced usage section.

Please find my comments on each of the points above:

  • In the test file, could we set the number of epochs and number of factors to lower values so that the tests run faster? -->I have reduced iterations in most functions. Could be reduced further but requires changing assertions
  • We can make a direct reference to the GridSearchCV of scikit-learn in the doc, so that people know what this is about. --> Done
  • Try to be as pythonic as possible. For example, using enumerate in a for loop would be better than counting the number of iterations (combination_counter += 1). These two talks (talk1, talk2) by Raymond Hettinger have been of great help to me. --> Done Thanks to the talks :)
  • I'd rather not use u'strings' here, is it possible to avoid them? --> I didn't get what you meant from this point. Can you please clarify it to me?
  • Could cv_results_ be a defaultdict(list) ? I think this would avoid the if key in self.cv_results_.keys():... --> --> Done in a pythonic way. Awesome talks again
  • I never used attributes that end (or start) with underscores so may be rename best_params_ and likes to best_params. --> I don't know why either. I kept the GridSearchCV convention. Let me know if you confirm renaming them
  • Is this true : Depending on the nature of the data parameter, it may or may not cross validation. ? --> Since I am using the evaluate method you implemented and the data is used as an input, I whether the cross validated depends on whether the data was divided into folds. Am I missing something?
  • As best_params_ and co are defaultdict, I don't think the tests in test_measure_is_not_case_sensitive do what you think they do (no exception will ever be raised). --> Corrected
  • Maybe add a little bit of (meaningful) comments to the code? --> Done. Let me know if its enough
  • don't forget to be PEP8 compliant! I see many lines exceeding 80 characters, flake8 is your friend --> All lines should be less than 80 chars now
  • maybe the verbose message could be changed to something like "Computing MAE and RMSE for parameters combination 1/8"? --> I changed the verbose a little. However I didn't add the accuracy names such as MAE and RMSE since I am using the evaluate method which adds them by default. Note that the docs of evaluate says with verbosity 0 it shouldn't print anything but it actually prints accuracy measures.
  • Could we used an OrderderDict for the parameters? Or something that would keep the parameters list in the same order as they were given? --> The moment a dict is defined, its order is lost. So to my knowledge there is no way to save the order of dict keys. The solution would be changing the input to the function to be an ordered list maybe. But I think it is easier and more convenient to keep it a normal dict.
  • Could we use the evaluate() function of the evaluate module instead of computing accuracies manually? To avoid a name clash you would need something like performances = globals()['evaluate'](algo_instance, data) --> This is my bad. I should have used it from the start instead of doing it manually. Now that its done, the code is much cleaner.
  • I think something like measures = [measure.upper() for measure in measures] at the beginning of the function would save a lot other measures.upper() in the future --> Done

Notes:

  • My background is java where the good practice is to write a lot of small functions and unit tests. It's not the case in Python. Any enhancements you add to the code will be of great benefit for me.
  • I have created a duplicate case insensitive dict because the one you already had overrides the __str__ method. This is not good for best_params_ and co because we are no longer able to print it. Nevertheless, I think it is better to use inheritance then duplicating the code but I didn't want to mess with the code you've written

Looking forward for the merge ;)

Maher

@NicolasHug
Copy link
Owner

Awesome!

I put my remarks on the pull request.

@NicolasHug
Copy link
Owner

Closing this as it has been solved by 714be0b

@gopal-gcek
Copy link

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
'reg_all': [0.4, 0.6]}

Cross-validation using grid search

grid_search = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=1)

Find the best parameters on the data set

data = Dataset.load_builtin('ml-100k')
after Running above code i got the below error can any help me also i have checked that there is key word measures in grid search of surprises

TypeError Traceback (most recent call last)
in ()
6 'reg_all': [0.4, 0.6]}
7 # Cross-validation using grid search
----> 8 grid_search = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3, n_jobs=1)
9 # Find the best parameters on the data set
10 data = Dataset.load_builtin('ml-100k')

TypeError: init() got an unexpected keyword argument 'measures'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants