Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recursively large parameters #254

Closed
vaaaaanquish opened this issue Oct 17, 2021 · 6 comments
Closed

Recursively large parameters #254

vaaaaanquish opened this issue Oct 17, 2021 · 6 comments

Comments

@vaaaaanquish
Copy link
Contributor

This is recursion.

https://github.com/m3dev/gokart/blob/master/gokart/task.py#L285-L303

self.to_str_params(only_significant=True) append the result of the json serialization of the parameter.
As a result of repeated json serialization, we have the following in dependencies.

dependencies.append(self.to_str_params(only_significant=True))

\"params\": {\"target\": \"{\\\"type\\\": \\\"task.Aggregation\\\", \\\"params\\\": {\\\"train\\\": \\\"{\\\\\\\"type\\\\\\\": \\\\\\\"task.Sample\\\\\\\", \\\\\\\"params\\\\\\\": {\\\\\\\"target\\\\\\\": \\\\\\\"{\\\\\\\\\\\\\\\"type\\\\\\\\\\\\\\\": \\\\\\\\\\\\\\\"task.Query\\\\\\\\\\\\\\\", \\\\\\\\\\\\\\\"params\\\\\\\\\\\\\\\": {\\\\\\\\\\\\\\\"target\\\\\\\\\\\\\\\": \\\\\\\\\\\\\\\"{\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"type\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\": \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"task.Add\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\", \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"params\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\": {\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"target\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\": \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"{\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"type\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\": \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\"task.Drop\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\",

Gokart use a lot of memory when long pipeline.
And the job is starting very slowly.

@vaaaaanquish
Copy link
Contributor Author

Sample that takes a long time to start.

import luigi

import gokart


class Zero(gokart.TaskOnKart):
    def run(self):
        self.dump(0)


class Add(gokart.TaskOnKart):
    x = gokart.TaskInstanceParameter()
    y = luigi.IntParameter()

    def run(self):
        self.dump(self.load() + self.y)


x = Zero()
for i in range(100):
    x = Add(x=x, y=i)

gokart.build(x)

@vaaaaanquish
Copy link
Contributor Author

DictParameter recursively json serializes the parameters.
https://github.com/spotify/luigi/blob/master/luigi/parameter.py#L1003

Same goes for TaskInstanceParameter. And TaskInstanceParameter is using DictParameter.
https://github.com/m3dev/gokart/blob/master/gokart/parameter.py

These are what's causing this hell.

@vaaaaanquish
Copy link
Contributor Author

There are two solutions.

  1. luigi.Task.to_str_params is overridden by gokart
  2. TaskInstanceParameter.serialize has its own serialize

@vaaaaanquish
Copy link
Contributor Author

TaskInstanceParameter.serialize is executed 25249 times in the above sample code.

And values what's inside serialize is following

{'type': 'Zero', 'params': {}}
{'type': 'Add', 'params': {'x': '{"type": "Zero", "params": {}}', 'y': '0'}}
{'type': 'Zero', 'params': {}}
{'type': 'Zero', 'params': {}}
{'type': 'Add', 'params': {'x': '{"type": "Zero", "params": {}}', 'y': '0'}}
{'type': 'Add', 'params': {'x': '{"type": "Add", "params": {"x": "{\\"type\\": \\"Zero\\", \\"params\\": {}}", "y": "0"}}', 'y': '1'}}
{'type': 'Add', 'params': {'x': '{"type": "Add", "params": {"x": "{\\"type\\": \\"Add\\", \\"params\\": {\\"x\\": \\"{\\\\\\"type\\\\\\": \\\\\\"Zero\\\\\\", \\\\\\"params\\\\\\": {}}\\", \\"y\\": \\"0\\"}}",
 "y": "1"}}', 'y': '2'}}
...

Imagine this being repeated 25249 times :)

@vaaaaanquish
Copy link
Contributor Author

#257 will solve the problem of bloated memory.

@vaaaaanquish
Copy link
Contributor Author

[future] Caching TaskInstanceParameter.serialize input can speed up the process.

Hi-king pushed a commit that referenced this issue Nov 5, 2021
* fix issue

* yapf

* isort

* try/except

* upper DictParameter

* recursive decompress

* staticmethod

* fix ut
@Hi-king Hi-king closed this as completed May 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants