Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate performance of very large deployment #7262

Closed
wouterdb opened this issue Feb 27, 2024 · 1 comment
Closed

Investigate performance of very large deployment #7262

wouterdb opened this issue Feb 27, 2024 · 1 comment
Assignees
Labels
performance Address a performance concern support Customer support ticket

Comments

@wouterdb
Copy link
Contributor

https://code.inmanta.com/solutions/athonet/athonet_mpn/-/merge_requests/237

@sanderr sanderr added support Customer support ticket performance Address a performance concern labels Feb 27, 2024
@arnaudsjs
Copy link
Contributor

My observations:

  • The inmanta server reaches 100% CPU. Given that the server runs using a single thread, this is the maximum CPU power it can consume.
  • When the low query log is enabled on PostgreSQL, it becomes clear that the queries on the resource_persistent_state table are particularly slow. This is strange because the query performs an update using the primary key fields of the table. Investigation of the lock state using these queries, learns us that the query is not waiting for a specific lock.
2024-02-23 14:27:56.536 UTC [210] LOG:  duration: 17526.568 ms  execute __asyncpg_stmt_481c0__: UPDATE public.resource_persistent_state SET last_deploy=$3,last_non_deploying_status=$4,last_success=$5,last_deployed_attribute_hash=$6,last_deployed_version=$7 WHERE environment=$1 and resource_id=$2
  • The setup was running using the PostgreSQL JIT enabled. Disabling the JIT didn't have any significant impact on the scalability.

@wouterdb wouterdb mentioned this issue Feb 29, 2024
14 tasks
inmantaci pushed a commit that referenced this issue Mar 4, 2024
# For Reviewers

I changed some things, I would like to have reviewer input on what is acceptable:
1. more fine grained triggering of agents on put_partial: only notify agents that are in the increment
2. `put_version` and `put_partial` no longer wait for auto deploy to be completed **this may break wait conditions in tests everywhere**
3. increment cache is pre sorted per agent (slower on release (done once), faster for every agent (done often))
4. increment cache now refuses to move back to older versions
5. micro optimizations to use the DB more efficiently

** Do I need more tests anywhere? **

close #7262

# Self Check:

Strike through any lines that are not applicable (`~~line~~`) then check the box

- [x] Attached issue to pull request
- [x] Changelog entry
- [x] Type annotations are present
- [x] Code is clear and sufficiently documented
- [ ] No (preventable) type errors (check using make mypy or make mypy-diff)
```

src/inmanta/config.py:219: error: Argument 1 to "CronTab" has incompatible type "str | int"; expected "str"  [arg-type]
src/inmanta/config.py:311: error: Incompatible default for argument "validator" (default has type "Callable[[str], str]", argument has type "Callable[[str | T], T]")  [assignment]
src/inmanta/data/__init__.py:4982: error: Argument 1 to "loads" has incompatible type "object"; expected "str | bytes | bytearray"  [arg-type]
src/inmanta/data/__init__.py:5441: error: Signature of "get_list" incompatible with supertype "BaseDocument"  [override]
src/inmanta/data/__init__.py:5441: note:      Superclass:
src/inmanta/data/__init__.py:5441: note:          @classmethod
src/inmanta/data/__init__.py:5441: note:          def get_list(cls, *, order_by_column: str | None = ..., order: str | None = ..., limit: int | None = ..., offset: int | None = ..., no_obj: bool | None = ..., lock: RowLockMode | None = ..., connection: Connection | None = ..., **query: object) -> Coroutine[Any, Any, list[ConfigurationModel]]
src/inmanta/data/__init__.py:5441: note:      Subclass:
src/inmanta/data/__init__.py:5441: note:          @classmethod
src/inmanta/data/__init__.py:5441: note:          def get_list(cls, *, order_by_column: str | None = ..., order: str | None = ..., limit: int | None = ..., offset: int | None = ..., no_obj: bool | None = ..., lock: RowLockMode | None = ..., connection: Connection | None = ..., no_status: bool = ..., **query: object) -> Coroutine[Any, Any, list[ConfigurationModel]]
src/inmanta/data/__init__.py:5509: error: Argument 2 to "_get_status_field" of "ConfigurationModel" has incompatible type "object"; expected "str"  [arg-type]
src/inmanta/data/__init__.py:5513: error: Argument 1 to "append" of "list" has incompatible type "ConfigurationModel"; expected "dict[str, object]"  [arg-type]
src/inmanta/data/__init__.py:5514: error: Incompatible return value type (got "list[dict[str, object]]", expected "list[ConfigurationModel]")  [return-value]
src/inmanta/data/__init__.py:5796: error: "object" has no attribute "__iter__"; maybe "__dir__" or "__str__"? (not iterable)  [attr-defined]
src/inmanta/data/__init__.py:5886: error: "object" has no attribute "__iter__"; maybe "__dir__" or "__str__"? (not iterable)  [attr-defined]
src/inmanta/data/__init__.py:5962: error: Incompatible types in assignment (expression has type "object", variable has type "int")  [assignment]
src/inmanta/server/services/orchestrationservice.py:849: error: Argument 1 to "add_background_task" of "TaskHandler" has incompatible type "Coroutine[Any, Any, tuple[int, dict[str, Any] | None]]"; expected "Coroutine[object, None, Result | None]"  [arg-type]
src/inmanta/server/services/compilerservice.py:795: error: Incompatible types in assignment (expression has type "bool | int | float | str | dict[str, str | int | bool]", variable has type "int")  [assignment]
Generated HTML report (via XSLT): /home/wouter/projects/inmanta/mypy/index.html
```
- [x] Sufficient test cases (reproduces the bug/tests the requested feature)
- [x] Correct, in line with design
- [ ] End user documentation is included or an issue is created for end-user documentation (add ref to issue here: )
- [ ] If this PR fixes a race condition in the test suite, also push the fix to the relevant stable branche(s) (see [test-fixes](https://internal.inmanta.com/development/core/tasks/build-master.html#test-fixes) for more info)

# Preliminary results
(on 15k resources)

1. we cause a storm of agent pulls (each put_partial makes every agent pull, we recompile fast)
   - [x] make it smarter
3. increment calculation is both on agent pull path and the release version
   - very performance sensitive
   - [x] it pulled in all attributes, so large config, slow increment
   - [ ] it interferes with itself somehow or mucks up the cache (to be investigated)
       - [x] make cache invalidation monotonic (only ever allow newer versions)
   - [x] auto deploy triggering is done in-line: the compile has to wait for it, we could change that
4. pyinstrument works somewhat as an async profiler, but not to the point where the numbers add up
6. slow query log, lock timing log and all query log on postgresql help somehwat
7. we still have some slow queries, basically everything related to the whole version or the part that remains the same
 - ` UPDATE resource SET status='deployed' WHERE environment=$1 AND model=$2 AND resource_id =ANY($3) `
 -
 ```
 INSERT INTO resource(
	                environment,
	                model,
	                resource_id,
	                resource_type,
	                resource_id_value,
	                agent,
	                status,
	                attributes,
	                attribute_hash,
	                resource_set,
	                provides
	            )(
	                SELECT
	                    r.environment,
	                    $3,
	                    r.resource_id,
	                    r.resource_type,
	                    r.resource_id_value,
	                    r.agent,
	                    (
	                        CASE WHEN r.status='undefined'::resourcestate
	                        THEN 'undefined'::resourcestate
	                        ELSE 'available'::resourcestate
	                        END
	                    ) AS status,
	                    r.attributes AS attributes,
	                    r.attribute_hash,
	                    r.resource_set,
	                    r.provides
	                FROM resource AS r
	                WHERE r.environment=$1 AND r.model=$2 AND r.resource_set IS NOT NULL AND NOT r.resource_set=ANY($4)
	            )
	            RETURNING resource_id, resource_set
```
```
  SUM(CASE WHEN r.status NOT IN($1,$2) THEN 1 ELSE 0 END) AS done,
	                           to_json(array(SELECT jsonb_build_object('status', r2.status, 'id', r2.resource_id)
	                                         FROM resource AS r2
	                                         WHERE c.environment=r2.environment AND c.version=r2.model
	                                        )
	                           ) AS status
	                    FROM configurationmodel AS c LEFT OUTER JOIN resource AS r
	                    ON c.environment = r.environment AND c.version = r.model
	                    WHERE c.environment=$3
	                    GROUP BY c.environment, c.version

```
```
INSERT INTO public.resourceaction_resource (resource_id, resource_version, environment, resource_action_id) SELECT unnest($1::text[]), unnest($2::int[]), $3, $4
```

Current time taken over parts of put_partial

```
2024-02-28 17:41:57,308 performance              WARNING STARTING PUT PARTIAL
2024-02-28 17:41:57,312 performance              WARNING INPUT VALIDATION: 0.0035941700043622404
2024-02-28 17:41:57,441 performance              WARNING LOAD STAGE: 0.1291558850207366
2024-02-28 17:41:57,802 performance              WARNING MERGE STAGE: 0.3613146049901843
2024-02-28 17:41:59,651 performance              WARNING PUT STAGE: 1.849367157992674
2024-02-28 17:42:01,870 performance              WARNING AUTO DEPLOY STAGE: 2.218535807012813
```
inmantaci pushed a commit that referenced this issue Mar 4, 2024
Pull request opened by the merge tool on behalf of #7278
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Address a performance concern support Customer support ticket
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants