Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AB-98: Add a feature to select SVM parameters #257

Open
wants to merge 8 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@rsh7
Copy link
Contributor

rsh7 commented Feb 16, 2018

This is a...

  • Bug Fix
  • Feature addition
  • Refactoring
  • Minor / simple change (like a typo)
  • Other

Present Case:

When a dataset has been created, the user can submit it for evaluation.
before
This screen has 2 preferences (filtering and normalization) for now. We don't have any options to select which SVM parameters the user want to use so that the search space can be reduced.

Addition

I have added more additional preferences of C , gamma and Preprocessing values as well so that the user can be more specific on his/her requirements of the SVM parameters. And if the user doesn't provide any input, then the default values will be taken.
Now, the evaluation screen on acousticbrainz server will look like:
after

The additional preferences taken by the user are getting saved into database and then retrieved while model training stage. I am writing these values into the custom project file which is being generated during the evaluation stage before running the tests.
Thus, this feature will help the users to easily select which parameters they want to use.

@paramsingh

This comment has been minimized.

Copy link
Member

paramsingh commented Feb 16, 2018

Hey @rsh7, in general whenever we do a schema change, we add a update script too. See examples here: https://github.com/metabrainz/acousticbrainz-server/blob/master/admin/updates/20160812-user-feedback.sql

@paramsingh

This comment has been minimized.

Copy link
Member

paramsingh commented Feb 16, 2018

Also, tests seem to be failing, please run test.sh and fix them.

@rsh7

This comment has been minimized.

Copy link
Contributor Author

rsh7 commented Feb 17, 2018

@paramsingh, Fixed the failing tests and added an update script as well.

@paramsingh
Copy link
Member

paramsingh left a comment

Nice work. Looks solid to me but needs some changes.

I'd like the entry of the parameters to be an optional feature hidded under an advanced shade, if possible.

updated TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
result JSONB,
eval_location eval_location_type NOT NULL DEFAULT 'local',
c_value VARCHAR,

This comment has been minimized.

@paramsingh

paramsingh Feb 20, 2018

Member

Should we consider having default values for these new columns?

This comment has been minimized.

@alastair

alastair Feb 20, 2018

Contributor

we already have the options field which stores some user-provided settings for an evaluation. I think that we should store these items in there instead of having new parameters.

This comment has been minimized.

@rsh7

rsh7 Feb 21, 2018

Author Contributor

As we also have a parameter of evaluation_location here: https://github.com/metabrainz/acousticbrainz-server/blob/master/webserver/forms.py#L31, which is enabled to display on the acousticbrainz site when https://github.com/metabrainz/acousticbrainz-server/blob/master/default_config.py#L62 is turned True.
Similarly, the c, gamma and preprocessing parameters are added separately.
Should I change it? :)

This comment has been minimized.

@paramsingh

paramsingh Feb 22, 2018

Member

Storing in the options field would remove the need for a schema change. I'd prefer it but let us wait for @alastair to confirm.

This comment has been minimized.

@paramsingh

paramsingh Mar 27, 2018

Member

@rsh7 I think we should go with moving the values in the new columns to options.

This comment has been minimized.

@rsh7

rsh7 Mar 29, 2018

Author Contributor

Alright, working on it.

@@ -59,12 +59,17 @@ def evaluate_dataset(eval_job, dataset_dir, storage_dir):
groundtruth_path = os.path.join(eval_location, "groundtruth.yaml")
with open(groundtruth_path, "w") as f:
yaml.dump(create_groundtruth_dict(snapshot["data"]["name"], train), f)

"""

This comment has been minimized.

@paramsingh

paramsingh Feb 20, 2018

Member

In general, """ is used for docstrings and # for comments that explain what the code is doing. You should change this comment to # Passing more user preferences to train the model.

logging.info("Training model...")
results = gaia_wrapper.train_model(
project_dir=eval_location,
groundtruth_file=groundtruth_path,
filelist_file=filelist_path,
c_value = eval_job["c_value"],

This comment has been minimized.

@paramsingh

paramsingh Feb 20, 2018

Member

just a small nitpick, but no space before and after operators in function calls. see the line above this.

@@ -44,9 +46,29 @@ def train_model(project_dir, groundtruth_file, filelist_file):
datasets_dir=os.path.join(project_dir, "datasets"),
results_dir=os.path.join(project_dir, "results"),
)
"""

This comment has been minimized.

@paramsingh

paramsingh Feb 20, 2018

Member

should be a # comment

This comment has been minimized.

@alastair

alastair Feb 20, 2018

Contributor

In addition to @paramsingh's note about the syntax of the comment, consider the purpose of adding them. In this case I can see that you're converting a string to a list, because I can read the code. A good rule of thumb is to say why you are doing something, not what you are doing. In this case, I'd consider saying

# skip invalid non-numerical c & gamma values

(with the updated code to actually skip these values)

c_value = [int(value) for value in c_value.split(',')]
gamma_value = [int(s) for s in gamma_value.split(',')]
preprocessing_values = [str(value) for value in preprocessing_values.split()]
"""

This comment has been minimized.

@paramsingh

paramsingh Feb 20, 2018

Member

should be a # comment

@@ -273,7 +283,9 @@ def add_dataset_eval_set(connection, data):
return snapshot_id


def _create_job(connection, dataset_id, normalize, eval_location, filter_type=None):
def _create_job(connection, dataset_id, normalize, eval_location, c_value, gamma_value,

This comment has been minimized.

@paramsingh

paramsingh Feb 20, 2018

Member

maybe we should consider using default values for the arguments c_value, gamma_value and preprocessing_values, that would mean that we don't have to change all the tests.

@@ -283,8 +295,12 @@ def _create_job(connection, dataset_id, normalize, eval_location, filter_type=No
raise ValueError("Incorrect 'eval_location'. Must be one of %s" % VALID_EVAL_LOCATION)
snapshot_id = db.dataset.create_snapshot(dataset_id)
query = sqlalchemy.text("""
INSERT INTO dataset_eval_jobs (id, snapshot_id, status, options, eval_location)
VALUES (uuid_generate_v4(), :snapshot_id, :status, :options, :eval_location)
INSERT INTO dataset_eval_jobs (id, snapshot_id, status, options, eval_location,

This comment has been minimized.

@paramsingh

paramsingh Feb 20, 2018

Member

This query could be formatted better: https://github.com/metabrainz/guidelines/blob/master/Python.md#sql

I don't think we mind lines of ~130 characters so this could work like

INSERT INTO dataset_eval_jobs (id, snapshot_id, status, options, eval_location, _value, gamma_value, preprocessing_values)
     VALUES (uuid_generate_v4(), :snapshot_id, :status, :options, :eval_location, :c_value, :gamma_value, :preprocessing_values)
  RETURNING id

Example: https://github.com/metabrainz/critiquebrainz/blob/master/critiquebrainz/db/review.py#L295

@@ -14,6 +15,16 @@
DATASET_DONE = "done"
DATASET_ALL = "all"

DATASET_C_VALUE = ' -5, -3, -1, 1, 3, 5, 7, 9, 11 '

This comment has been minimized.

@paramsingh

paramsingh Feb 20, 2018

Member

why the trailing and leading whitespace?

DATASET_C_VALUE = ' -5, -3, -1, 1, 3, 5, 7, 9, 11 '
DATASET_GAMMA_VALUE = '3, 1, -1, -3, -5, -7, -9, -11'
PREPROCESSING_VALUES = [ 'basic', 'lowlevel', 'nobands', 'normalized', 'gaussianized' ]
PREPROCESSING_VALUES = [(value, value) for value in PREPROCESSING_VALUES]

This comment has been minimized.

@paramsingh

paramsingh Feb 20, 2018

Member

This could be done in one line with a comment to clarify why we need a list of pairs here.

PREPROCESSING_VALUES = [('basic', 'basic'), ...]

@@ -34,4 +45,15 @@ class DatasetEvaluationForm(Form):
default = DATASET_EVAL_LOCAL
)
normalize = BooleanField("Normalize classes")

"""

This comment has been minimized.

@paramsingh

paramsingh Feb 20, 2018

Member

should be a # comment

"""
Converting the preferences from string type to list
"""
c_value = [int(value) for value in c_value.split(',')]

This comment has been minimized.

@alastair

alastair Feb 20, 2018

Contributor

We need error checking here - if a person enters a string then the int() cast will fail. I think it's OK to silently skip non-numerical values here

with open(project_file, "r") as pfile:
project = yaml.load(pfile)
for pref in project['classifiers']['svm']:
pref['gamma'], pref['C'], pref['preprocessing'] = gamma_value, c_value, \

This comment has been minimized.

@alastair

alastair Feb 20, 2018

Contributor

I don't think there's a reason that this has to be a multi-assignment statement. Three separate statements is a bit easier to follow

gamma_value = [int(s) for s in gamma_value.split(',')]
preprocessing_values = [str(value) for value in preprocessing_values.split()]
"""
Updating C, Gamma, and preprocessing values in the project file which were

This comment has been minimized.

@alastair

alastair Feb 20, 2018

Contributor

A very small thing here - the comment is good, but we typically write comments in the imperative mood (git commit messages too). In this case, I would say

Update the project file with user-provided C, gamma, and preprocessing values

(note the use of 'update', not 'updating')

@rsh7

This comment has been minimized.

Copy link
Contributor Author

rsh7 commented Feb 21, 2018

I have added a feature flag like

FEATURE_EVAL_LOCATION = False
to only show these preferences option on the acousticbrainz site when it is enabled.
So, if the value is set to False, it will display:

new

def update_parameters(project_file, c_value, gamma_value, preprocessing_values):
"""Update the project file.
An optional, more detailed description of this function if needed.

This comment has been minimized.

@paramsingh

paramsingh Feb 21, 2018

Member

Should replace this line with a description :)

@rsh7 rsh7 changed the title AB-98: Added a feature to select SVM parameters AB-98: Add a feature to select SVM parameters Mar 4, 2018

@rsh7 rsh7 force-pushed the rsh7:preferences branch from 017aa3a to 355a19b Apr 2, 2018

@paramsingh paramsingh self-assigned this May 4, 2018

@paramsingh
Copy link
Member

paramsingh left a comment

from my testing looks good, some minor things to fix though.

@@ -60,11 +60,15 @@ def evaluate_dataset(eval_job, dataset_dir, storage_dir):
with open(groundtruth_path, "w") as f:
yaml.dump(create_groundtruth_dict(snapshot["data"]["name"], train), f)

# Passing more user preferences to train the model.

This comment has been minimized.

@paramsingh

paramsingh May 4, 2018

Member

I don't think this comment is needed here really, but if it is, Pass user preferences to train the model would read better. Note alastair's comment about tense in comments.

"""Update the project file with user-provided preferences
Args:
project_file: The file to be updated.

This comment has been minimized.

@paramsingh

paramsingh May 4, 2018

Member

Please mention the type of these args in the docstrings too.

@@ -24,8 +24,13 @@
# Filter types are defined in `eval_filter_type` type. See schema definition.
FILTER_ARTIST = "artist"

# Default values for parameters
preprocessing = ['basic', 'lowlevel', 'nobands', 'normalized', 'gaussianized']

This comment has been minimized.

@paramsingh

paramsingh May 4, 2018

Member

these default values are repeated three times in this PR, once here, once in forms.py and once in views/datasets.py. In general, it is a good idea to avoid duplication of code and constants as much as possible. Please define these in dataset_eval/__init__.py and import them, reducing duplication.

https://en.wikipedia.org/wiki/Don%27t_repeat_yourself

db.dataset_eval.evaluate_dataset(
dataset_id=ds["id"],
normalize=form.normalize.data,
eval_location=form.evaluation_location.data,
c_value=c_value,
gamma_value=gamma_value,
preprocessing_values=form.preprocessing_values.data,
filter_type=form.filter_type.data,
)
flash.info("Dataset %s has been added into evaluation queue." % ds["id"])

This comment has been minimized.

@paramsingh

paramsingh May 4, 2018

Member

I think flashing a message here if the input values were bad would be a good idea too.

This comment has been minimized.

@rsh7

rsh7 May 10, 2018

Author Contributor

@paramsingh What if we consider default values in case where input values are bad? And flashing a message saying - "Bad input value. We are taking default values for now"

@@ -39,6 +39,22 @@ <h2 class="page-title">Evaluate dataset "{{ dataset['name'] }}"</h2>
</div>
{% endif %}
</fieldset>
<fieldset>
{% if config.get('FEATURE_MODEL_SELECTION') %}
<div class="form-group">

This comment has been minimized.

@paramsingh

paramsingh May 4, 2018

Member

This should be indented two spaces.

# if invalid non-numerical c and gamma
try:
c_value = [int(value) for value in form.c_value.data.split(',')]
except:

This comment has been minimized.

@paramsingh

paramsingh May 4, 2018

Member

Instead of catching all exceptions, mentioning which error you're looking for here (ValueError) would be a good idea.

@rsh7 rsh7 force-pushed the rsh7:preferences branch from 701a0bb to 925a5f9 Aug 19, 2018

@rsh7 rsh7 force-pushed the rsh7:preferences branch from 925a5f9 to 0bff3fb Feb 15, 2019

@rsh7

This comment has been minimized.

Copy link
Contributor Author

rsh7 commented Feb 15, 2019

Rebased to master!

@rsh7 rsh7 force-pushed the rsh7:preferences branch from 0bff3fb to b779bb9 Mar 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.