Skip to content
This repository has been archived by the owner on Jan 9, 2023. It is now read-only.

Improved pre-migration memory consumption #236

Merged
merged 1 commit into from Oct 6, 2020

Conversation

goosemania
Copy link
Member

@goosemania goosemania commented Sep 16, 2020

Pre-migrate compressed repodata for RPMs to save some memory.

Added batch_size to all the bulk_create calls.
Decrease the size of the content batch as well.

closes #7490
https://pulp.plan.io/issues/7490

@goosemania
Copy link
Member Author

Might not be enough, however the failure seems to point to bulk_create failure.

Copy link
Contributor

@dralley dralley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@goosemania
Copy link
Member Author

Katello has reproducer and testing it. So I'll leave it as draft until it's confirmed that the issue is resolved.

@jlsherrill
Copy link
Contributor

This didn't seem to prevent an OOM, but this time postgresql was killed (which may or may not mean anything). I may need to adjust my postgresql settings a bit (i had to increase them to get the customer db to load, its possible i need a bit more memory or might need to reduce how much memory postgresql was taking.)

I'll re-test and monitor how much memory the rq workers are actually taking!

@@ -8,3 +8,5 @@
}

NOT_USED = 'Not Used'

BULK_CREATE_BATCH_SIZE = 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that this change is not strictly related to the PR, however what do you think about renaming BULK_CREATE_BATCH_SIZE to just BATCH_SIZE and also update `bulk_update calls to use this constant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

@jlsherrill
Copy link
Contributor

jlsherrill commented Sep 17, 2020

With some additional runs and different findings, here was the memory usage of the rq worker doing the migration, as a percentage of overall memory, taken every 10 seconds:

1.0
1.1
1.0
1.0
1.1
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
0.9
1.9
1.5
2.0
1.9
2.0
3.0
1.5
3.1
2.0
2.0
1.8
4.1
3.5
3.5
2.1
3.4
1.9
2.3
2.9
1.8
2.4
1.8
2.7
1.7
1.8
2.2
6.6
2.2
11.2
3.3
8.9
16.7

That last reading was 2324444 kB of memory (RSS). Checking the oom for this PID, i see:

Killed process 18341 (rq) total-vm:2981060kB, anon-rss:2322684kB, file-rss:0kB, shmem-rss:0kB

So it seems like the process spiked heavily in a very short time, within 10-20 seconds.

@jlsherrill
Copy link
Contributor

Also, i examined the postgresql logs right before the OOM and saw some very very large queries. Here is the last one trimmed to the first 400 characters.

2020-09-17 17:38:02 UTC LOG: duration: 5625.576 ms statement: INSERT INTO "pulp_2to3_migration_pulp2rpm" ("pulp_id", "pulp_created", "pulp_last_updated", "pulp2content_id", "name", "epoch", "version", "release", "arch", "checksum", "checksumtype", "repodata", "is_modular", "size", "filename") VALUES ('185b2e88-6123-4ac3-ac43-fa0cc4ca45d4'::uuid, '2020-09-17T17:37:55.487155+00:00'::timestamptz, '

the length of that one line in the log is: 267,818,406

I did notice that 'repodata' is in there, i assume that is what is taking up all the data. Does pulp3 still use this information? I don't see it in the rpm_package table

@jlsherrill
Copy link
Contributor

Another note, i did see postgresql jump up quite a bit on one run:

postgres 27415 11.5 17.2 2690156 2385480 ? Ss 19:02 6:28 postgres: pulp pulpcore ::1(54378) INSERT

(so this one process is taking 17% of memory)

@goosemania
Copy link
Member Author

I did notice that 'repodata' is in there, i assume that is what is taking up all the data. Does pulp3 still use this information? I don't see it in the rpm_package table

Migration uses this information to construct Pulp 3 objects, not everything is available in a parsed form in Pulp 2.

@goosemania
Copy link
Member Author

goosemania commented Sep 18, 2020

I'm looking into where the memory consumption can spike + I think we might want to save pulp_2to3_migration_pulp2rpm specifically in a smaller batches than the rest.
Thanks for all the testing and pointers!

@goosemania goosemania changed the title Added batch_size to all the bulk_create calls. Improved pre-migration memory consumption Sep 21, 2020
@goosemania
Copy link
Member Author

I pushed a second commit which should decrease memory consumption at pre-migration time.
Now repodata is pre-migrated in compressed way and it's being decompressed right before it's needed to be used.
I believe it was not done this way before because I struggled to find a way to have dictionary with binary values in postgresql.
My workaround is to store it in 3 different fields and not as dictionary.
cc @ipanova @dralley

),
]

# TODO: data migration - for consistency and in case some content hasn't been migrated before the upgrade.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This migration doesn't contain data migration. Suitable for tests/experiments only.

@@ -124,8 +124,6 @@ def pre_migrate_content_detail(cls, content_batch):
*cls.unit_fields)
pulp2rpm_to_save = []
for rpm in pulp2_content_batch:
rpm['repodata'] = decompress_repodata(rpm['repodata'])

pulp2rpm_to_save.append(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a lot of sense, it also has the benefit that decompression will happen at publish time, so the parallelism can help with it.

Copy link
Contributor

@dralley dralley Sep 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goosemania LGTM pending data migration / any final changes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dralley , I believe it happens earlier, at content migration time. We use repodata from pulp 2 to create a Package instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Derp, yeah, I think I realized that at some point and then promptly forgot it.

@jlsherrill
Copy link
Contributor

This helped quite a bit and i was able to get past the pre-migration step!

@dralley
Copy link
Contributor

dralley commented Sep 21, 2020

@jlsherrill From your wording I assume you hit another issue later on?

@jlsherrill
Copy link
Contributor

@dralley yeah, i ended up hitting this: https://pulp.plan.io/issues/7538 and then manually hacked around it, and then hit this: https://pulp.plan.io/issues/7540

Neither are related to this change specifically

pkg.primary_template_gz = gzip.zlib.compress(pkg.repodata.get('other').encode())
pkg.primary_template_gz = gzip.zlib.compress(pkg.repodata.get('filelists').encode())
pkg.repodata = ''
pkg.save()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing, likely need to switch to bulk_update

@goosemania goosemania marked this pull request as ready for review October 2, 2020 14:27
@goosemania goosemania force-pushed the issue7490 branch 2 times, most recently from 8b32b1a to fbd3787 Compare October 2, 2020 14:31
@goosemania
Copy link
Member Author

@ipanova , @dralley , ready for review (added data migration, see my comments there)

Comment on lines +16 to +18
pkg.primary_template_gz = gzip.zlib.compress(pkg.repodata.get('primary').encode())
pkg.other_template_gz = gzip.zlib.compress(pkg.repodata.get('other').encode())
pkg.filelists_template_gz = gzip.zlib.compress(pkg.repodata.get('filelists').encode())
Copy link
Member Author

@goosemania goosemania Oct 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: This is expensive and takes a lot of time. 20K RPMs migrate 2-3 mins. The majority of them will never be used. Is it worth migrating only those which will be used?

Pulp2Rpm and Pulp2Srpm currently have repodata in uncompressed form in the repodata field. This migration compresses it and splits into 3 fields, one for each type.
Repodata field is only used to create a Package object in Pulp3. When Pulp2Rpm object is migrated, and created package is linked to it, I believe Pulp2Rpm.repodata will never be used again. Should we just set all *_template_gz to null in such case and that's it? Am I forgetting any corner case when Package can disappear from the Pulp2Rpm relation?

Why is this migration needed? - for the case when user started RPM migration and it didn't fully finish (e.g. OOM happened), then they are upgrading (because they need this memory fix) and trigger RPM migration again. Those items which were already premigrated but not migrated yet, need this data migration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid a need for such migrations, I think we might need a reset or force option. For the case when something went south, user is able to start again. Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will you identify which packages will be used?
+1 on unsetting the repodata, git grep did not show any other usage of it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will that reset or force option do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will you identify which packages will be used?

Pulp2Rpm.objects.filter(pulp2content__pulp3_content=None)

Copy link
Member

@ipanova ipanova Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid a need for such migrations, I think we might need a reset or force option. For the case when something went south, user is able to start again. Thoughts?

I am in favour of adding such option, as you mentioned we should meet and brainstorm

how will you identify which packages will be used?

Pulp2Rpm.objects.filter(pulp2content__pulp3_content=None)

I think it makes sense to filter only the needed content, this will improve the speed of the migration.
Speaking of whether to remove the 'repodata' field - i would rather let it be in case in the future, for some reason, we might need to get to that info. Yes, some of the content will have empty repodata and other not but I do not see this as big issue. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ipanova , after a discussion with @dralley , I think I'm convinced that we can leave this migration as is. It doesn't affect katello users, since it's only for RPM content and RPM migration is not a part of katello yet. As for pure pulp users, there will be very few of them affected (upgrade in the middle of broken migration) and only if their setup is very large. So we go with consistency and migrate all data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goosemania I'm not sure if it warrants much more thought, but I'm going to test your PR, and I'm going to try a thread pool. I've seen some blog posts to suggest that the Gzip module bypasses the global interpreter lock so compressing / decompressing multiple things in true parallel is possible. But I approve of the current approach regardless.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dralley , to be honest, I think for the amount of cases we discussed, it's likely not worth the effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goosemania On my dev laptop it only took about 30 seconds to run the django migration with 20k RPMs, but I imagine that will be a fair bit faster than the majority of production boxes. I consider it a good sign though. Approved.

pkg.primary_template_gz = gzip.zlib.compress(pkg.repodata.get('primary').encode())
pkg.other_template_gz = gzip.zlib.compress(pkg.repodata.get('other').encode())
pkg.filelists_template_gz = gzip.zlib.compress(pkg.repodata.get('filelists').encode())
pkg.repodata = ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Added batch_size to all the bulk_create calls.
Decrease the size of the content batch as well.

closes #7490
https://pulp.plan.io/issues/7490
@goosemania goosemania merged commit bfbf6cc into pulp:master Oct 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants