Fix #7: implement main worker process, algorithm_registry and logging #15

prasanna08 · 2017-06-14T14:54:23Z

This PR fixes #7 which is combination of milestone 1.3 and milestone 2.1 of my GSoC project.

This PR implements following:

Main worker process which is implemented using polling, training, storing functions.
algorithm_registry which is mapping from algorithm_id <=> classifier class instance.
Incorporates logging module for logging important events.

anmolshkl · 2017-06-16T07:16:48Z

core/classifiers/BaseClassifier/BaseClassifier.py

+
+"""Base class for classification algorithms"""
+
+import abc


@seanlip can we use this module now? I remember that we decided not to use this in Oppia because it would deviate from the existing approach.

seanlip · 2017-06-16T07:46:38Z

I'm not too bothered either way, I think; your call!

…

On Friday, June 16, 2017, Anmol ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In core/classifiers/BaseClassifier/BaseClassifier.py <#15 (comment)>: > +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS-IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Base class for classification algorithms""" + +import abc @seanlip <https://github.com/seanlip> can we use this module here? I remember that we decided not to use this in Oppia because it would deviate from the existing approach. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#15 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AKFeymxntCvWBqkTZBcA2njn13OoC7OKks5sEivhgaJpZM4N6APN> .

anmolshkl

@prasanna08 I have taken a first pass. PTAL at the comments. Thanks!

anmolshkl · 2017-06-16T07:18:43Z

core/classifiers/BaseClassifier/BaseClassifier.py

+
+    Below are some concepts used in this class.
+    training_data: list(dict). The training data that is used for training
+        the classifier. This field is populated lazily when the job request


@prasanna08 I don' think "This field is populated lazily when the job request ..." is relevant here. It is only true in the case of training jobs.

Yup. Actually, I didn't verify the doc strings in much detail before issuing PR. I copied the relevant code and took a brief look at the doc string for consistency. I guess I'll have to go through them once again.

anmolshkl · 2017-06-16T07:41:07Z

core/classifiers/BaseClassifier/BaseClassifier.py

+
+    @abc.abstractmethod
+    def train(self, training_data):
+        """Loads examples for training.


I don't think that is the right description for this method.

anmolshkl · 2017-06-16T07:42:17Z

core/classifiers/BaseClassifier/BaseClassifier.py

+        """Loads examples for training.
+
+        Args:
+            training_data: list(dict). The training data that is used for


this description is correct on the Oppia side, but on the VM side this field will always be populated. In fact, VM doesn't need to know anything about lazy population.

anmolshkl · 2017-06-16T09:14:43Z

core/services/job_services.py

+
+# pylint: disable=too-many-branches
+def _validate_job_data(job_data):
+    if not isinstance(job_data):


instance of what?

Oh, my bad.

anmolshkl · 2017-06-16T09:41:15Z

main.py

+    try:
+        job_data = job_services.get_next_job()
+        if job_data is None:
+            logging.info('No pending job requests.')


you might want to add additional info like time, vm_id etc. I guess this can be done by configuring the logger to append these details.

Oh sorry, I saw the log config after writing this comment :D

Actually I have kept the format of logging same as GAE's log formats which includes all necessary details, I guess.

anmolshkl · 2017-06-16T09:47:16Z

vmconf.py

+FIXED_TIME_WAITING = 'fixed_time_wait'
+
+# Seconds to wait in case of fixed time waiting approach.
+FIXED_TIME_WAITING_SECS = 60


a better name perhaps? (something like FIXED_TIME_WAITING_PERIOD?)

Sounds good.

Actually the idea is to use exponential backoff algorithm for waiting in PROD and fixed time waiting in DEV. On local machines there will be at most a few jobs which can be processed quickly and we don't want VM to go into sleep for large duration when there are no pending jobs and that's why we use fixed backoff in DEV. But that's not the case with PROD. There will be many jobs and so we can use exponential backoff there because we also have to be wary of resources VM is using. Fixed time waiting would lead to wastage of resources. However exponential backoff is still "future idea" which will be implemented later on.

anmolshkl

@prasanna08 done! Sorry for the delay, I have taken another pass.

anmolshkl · 2017-06-20T14:13:56Z

core/services/job_services.py

+"""This module contains functions used for polling, training and saving jobs."""
+
+from core.services import remote_access_services
+from core.classifiers import algorithm_registry


NIT: import order

anmolshkl · 2017-06-20T14:19:19Z

core/services/job_services.py

+
+    Args:
+        algorithm_id: str. ID of classifier algorithm.
+        training_data: dict. A dictionary containing training data.


wouldn't training data be a list of dictionaries?

Yup. My bad.

anmolshkl · 2017-06-20T14:22:26Z

core/services/job_services.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""This module contains functions used for polling, training and saving jobs."""


Don't we need a job_services_test.py?

I don't think so. They are just using the remote_access_service functions, so as long as they are working these functions should work fine, too.

(I wasn't going to add this layer initially but later on I added it because higher modularity is always good for future maintenance)

anmolshkl · 2017-06-20T14:24:40Z

core/classifiers/algorithm_registry.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Registry for classification algorithms/classifiers."""


Point for the future: I know there are no classifiers to test for now, but, this should have a unit tests.

Yep. I will add TODO comment.

anmolshkl

@prasanna08 LGTM!

prasanna08 · 2017-06-22T04:03:51Z

@AllanYangZhou you might want to review this?

AllanYangZhou

Hi Prasanna,

I've read through it, but didn't have any comments to make--it looks good to me!

prasanna08 added 2 commits June 14, 2017 10:51

WIP: implementation of algorithm_registry and base classifier.

6bb33ea

Implement main worker process of VM.

7431ea6

prasanna08 requested review from anmolshkl and AllanYangZhou and removed request for anmolshkl and AllanYangZhou June 16, 2017 03:24

anmolshkl reviewed Jun 16, 2017

View reviewed changes

anmolshkl suggested changes Jun 16, 2017

View reviewed changes

prasanna08 added 3 commits June 16, 2017 16:00

Addressed review comments

fb420a6

change FIXED_TIME_WAITING_SECS to FIXED_TIME_WAITING_PERIOD in vmconf

6638eb3

Fix main.py code.

e427df5

anmolshkl suggested changes Jun 20, 2017

View reviewed changes

Addressed review comments

bf5914d

anmolshkl approved these changes Jun 21, 2017

View reviewed changes

AllanYangZhou approved these changes Jun 22, 2017

View reviewed changes

prasanna08 merged commit 543df4e into oppia:develop Jun 23, 2017

prasanna08 deleted the main-process branch June 23, 2017 05:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #7: implement main worker process, algorithm_registry and logging #15

Fix #7: implement main worker process, algorithm_registry and logging #15

prasanna08 commented Jun 14, 2017

anmolshkl Jun 16, 2017 •

edited

seanlip commented Jun 16, 2017 via email

anmolshkl left a comment

anmolshkl Jun 16, 2017

prasanna08 Jun 16, 2017

anmolshkl Jun 16, 2017

anmolshkl Jun 16, 2017 •

edited

anmolshkl Jun 16, 2017

prasanna08 Jun 16, 2017

anmolshkl Jun 16, 2017

anmolshkl Jun 16, 2017

prasanna08 Jun 16, 2017

anmolshkl Jun 16, 2017

prasanna08 Jun 16, 2017

anmolshkl left a comment

anmolshkl Jun 20, 2017

anmolshkl Jun 20, 2017

prasanna08 Jun 21, 2017

anmolshkl Jun 20, 2017

prasanna08 Jun 21, 2017

anmolshkl Jun 20, 2017

prasanna08 Jun 21, 2017

anmolshkl left a comment

prasanna08 commented Jun 22, 2017

AllanYangZhou left a comment

Fix #7: implement main worker process, algorithm_registry and logging #15

Fix #7: implement main worker process, algorithm_registry and logging #15

Conversation

prasanna08 commented Jun 14, 2017

anmolshkl Jun 16, 2017 • edited

Choose a reason for hiding this comment

seanlip commented Jun 16, 2017 via email

anmolshkl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anmolshkl Jun 16, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anmolshkl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anmolshkl left a comment

Choose a reason for hiding this comment

prasanna08 commented Jun 22, 2017

AllanYangZhou left a comment

Choose a reason for hiding this comment

anmolshkl Jun 16, 2017 •

edited

anmolshkl Jun 16, 2017 •

edited