get_model Takes more too long to initialise so the endpoint times out #78

m4nuC · 2019-05-15T10:14:27Z

In the first get_model() call, the weight needs to be loaded and model initialized which is taking 10mins or so. The problem is that Sagemaker endpoint timeout after 60 seconds so the model never has time to initialize.

Would you have any suggestions on ways to address this?

Thanks

The text was updated successfully, but these errors were encountered:

pm3310 · 2019-05-15T13:10:13Z

Hey @m4nuC

Great question! Please, find below an example solution

from __future__ import absolute_import

import os
# Do not remove the following line
import sys;sys.path.append("..")  # NOQA

import logging
import concurrent.futures


_MODEL_PATH = os.path.join('/opt/ml/', 'model')  # Path where all your model(s) live in

log = logging.getLogger(__name__)


class ModelLoader:
    def __init__(self, load_method):
        log.info("setting up ModelLoader")
        # Invoke load method asynchronously so process does not block.
        self.model = None
        self.executor = concurrent.futures.ThreadPoolExecutor(max_workers=1)
        self.future = self.executor.submit(load_method)
        self.future.add_done_callback(self.done)

    def get_model(self):
        if self.future.done():
            return self.model

        log.error("get_model called before ready")
        raise Exception("model not loaded")

    def done(self, future):
        """Callback method invoked when model load complete. Sets us to ready status."""
        log.info("model load done")
        self.model = future.result()
        log.info("shutting down executor")
        self.executor.shutdown(wait=False)

    def get_ready(self):
        return self.future.done()


class ModelService(object):
    model = None

    @staticmethod
    def load_model():
        # Load your model, your model that takes time, here
        from sklearn.externals import joblib

        return joblib.load(os.path.join(_MODEL_PATH, 'model.pkl'))

    @classmethod
    def get_model(cls):
        """Get the model object for this instance, loading it if it's not already loaded."""
        if cls.model is None:

            cls.model = ModelLoader(cls.load_model)
        return cls.model

    @classmethod
    def predict(cls, input):
        """For the input, do the predictions and return them."""
        return cls.model.get_model().predict(input)


ModelService.get_model()


def predict(json_input):
    """
    Prediction given the request input
    :param json_input: [dict], request input
    :return: [dict], prediction
    """

    # TODO Transform json_input and assign the transformed value to model_input
    try:
        model_input = json_input['features']
        prediction = ModelService.predict(model_input)

        result = {'prediction': prediction.item()}

        return result
    except Exception as e:
        return {"error": str(e)}

Essentially, it spawns a thread to load a long-loading model. @ilazakis Do you think we can make the ModelLoader a Sagify utility?

Pls, @m4nuC let me know if it solved your issue.

Thanks

ilazakis · 2019-05-15T17:49:08Z

Adding some model-loading specific utilities sounds sensible @pm3310, yes.

Loading the model on a different thread will cause the sync call to the API not to block, but if it still needs 10 minutes to load, the end user will still not get anything back, the client or a proxy or similar in between will time out anyway. We could return a specific "loading model, please try again in X minutes" response to mitigate the bad experience.

One thing we could do to solve the actual problem is tie the loading of the model to the deploy command. Training and deploying takes time anyway, so if we add it right after the deploy command, the model will be loaded for whoever calls the predict endpoint first.

Open to any suggestions.

m4nuC · 2019-05-16T07:35:01Z

Thanks for the input guys, I 've ended up using threading as well but with a less refined solution than @pm3310 suggest. I will try this out and confirm.

pm3310 · 2019-05-16T09:21:14Z

@ilazakis I like the idea of returning a specific "loading model, please try again in X minutes" response to mitigate the bad experience.

m4nuC · 2019-05-16T12:46:26Z

Works great indeed. Thanks @pm3310
I have made a little modification handle case where model is not yet loaded using a cusomt exception. However I am not sure it's idiomatic python. See below.

from __future__ import absolute_import

import os
# Do not remove the following line
import sys;sys.path.append("..")  # NOQA

import logging
import concurrent.futures


_MODEL_PATH = os.path.join('/opt/ml/', 'model')  # Path where all your model(s) live in

log = logging.getLogger(__name__)

class ModelNotYetLoadedException(Exception):
    def __init__(self, message):
        super().__init__(message)

class ModelLoader:
    def __init__(self, load_method):
        log.info("setting up ModelLoader")
        # Invoke load method asynchronously so process does not block.
        self.model = None
        self.executor = concurrent.futures.ThreadPoolExecutor(max_workers=1)
        self.future = self.executor.submit(load_method)
        self.future.add_done_callback(self.done)

    def get_model(self):
        if self.future.done():
            return self.model

        log.error("get_model called before ready")
        raise ModelNotYetLoadedException("model not loaded")

    def done(self, future):
        """Callback method invoked when model load complete. Sets us to ready status."""
        log.info("model load done")
        self.model = future.result()
        log.info("shutting down executor")
        self.executor.shutdown(wait=False)

    def get_ready(self):
        return self.future.done()


class ModelService(object):
    model = None

    @staticmethod
    def load_model():
        # Load your model, your model that takes time, here
        from sklearn.externals import joblib

        return joblib.load(os.path.join(_MODEL_PATH, 'model.pkl'))

    @classmethod
    def init_model(cls):
        """Get the model object for this instance, loading it if it's not already loaded."""
        if cls.model is None:
            cls.model = ModelLoader(cls.load_model)
        return cls.model

    @classmethod
    def predict(cls, input):
        """For the input, do the predictions and return them."""
        try:
            return cls.model.get_model().predict(input)

        except ModelNotYetLoadedException as e:
            return 'model not yet loaded'

ModelService.init_model()


def predict(json_input):
    """
    Prediction given the request input
    :param json_input: [dict], request input
    :return: [dict], prediction
    """

    # TODO Transform json_input and assign the transformed value to model_input
    try:
        model_input = json_input['features']
        prediction = ModelService.predict(model_input)

        result = {'prediction': prediction.item()}

        return result
    except Exception as e:
        return {"error": str(e)}

pm3310 · 2019-05-16T14:00:54Z

@m4nuC Perfect solution ;-)

m4nuC changed the title ~~get_model Take more than a minute to initialise so the request is killed~~ get_model Takes more too long to initialise so the endpoint times out May 15, 2019

pm3310 closed this as completed May 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_model Takes more too long to initialise so the endpoint times out #78

get_model Takes more too long to initialise so the endpoint times out #78

m4nuC commented May 15, 2019

pm3310 commented May 15, 2019

ilazakis commented May 15, 2019

m4nuC commented May 16, 2019

pm3310 commented May 16, 2019

m4nuC commented May 16, 2019

pm3310 commented May 16, 2019

get_model Takes more too long to initialise so the endpoint times out #78

get_model Takes more too long to initialise so the endpoint times out #78

Comments

m4nuC commented May 15, 2019

pm3310 commented May 15, 2019

ilazakis commented May 15, 2019

m4nuC commented May 16, 2019

pm3310 commented May 16, 2019

m4nuC commented May 16, 2019

pm3310 commented May 16, 2019