Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Taught export to insure de-duplicated Artifact.json. #4161

Merged
merged 1 commit into from
Jul 29, 2023

Conversation

ggainey
Copy link
Contributor

@ggainey ggainey commented Jul 26, 2023

fixes #4159.

CHANGES/4159.bugfix Outdated Show resolved Hide resolved
@jpasqualetto
Copy link
Contributor

What's the impact of this change on fs-exports?

@ggainey
Copy link
Contributor Author

ggainey commented Jul 28, 2023

What's the impact of this change on fs-exports?

That is a good question. fs-export is a different code-path aimed at a different end-output, but might benefit from what we've learned here - if not as part of this PR/issue. I'll review and open a new issue if it looks like we can follow a similar pattern there.

@dralley
Copy link
Contributor

dralley commented Jul 28, 2023

+1 to new issue / separate PR

@ggainey ggainey marked this pull request as ready for review July 28, 2023 01:22
@ggainey
Copy link
Contributor Author

ggainey commented Jul 28, 2023

What's the impact of this change on fs-exports?

Taking a quick look at _export_to_filesystem, it looks like it is relying on hydrated Artifacts where it could get away with just knowing the Artifact.file. This is "mildly" suboptimal, but not a dealbreaker - going to the ddatabase at all takes most of your time, Artifacts aren't so large that "all fields" is horribly worse than just "a few fields".

I don't see a path that suffers the same duplication-effort - it doesn't do multiple-repos-at-once, the source of our Pain in this PR.

My first thought is that I wouldn't spend time trying to squeeze more performance out of that codepath, unless/until we actually see a problem.

Copy link
Contributor

@dralley dralley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@mdellweg mdellweg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd even be fine with merging this as is and promise to start on further refactoring right away with both ideas:

  • Using a db iterator
  • using qs.none()

Another thing (to think about later) that came to my mind: In the case of incremental, we now still include artifacts that are new to a repository version, but guaranteed to be in another (exported) repositories previous version. Could we safely skip them too?
Say the logic switching from:

$\displaystyle \bigcup_{\textrm{version}} \left( \textrm{version}.\textrm{artifacts} \setminus \textrm{prev}(\textrm{version}).\textrm{artifacts} \right)$

to

$\left( \displaystyle \bigcup_{\textrm{version}} \textrm{version}.\textrm{artifacts} \right) \setminus \left( \displaystyle \bigcup_{\textrm{version}} \textrm{prev}(\textrm{version}).\textrm{artifacts} \right)$

pulpcore/app/tasks/export.py Show resolved Hide resolved
@ggainey
Copy link
Contributor Author

ggainey commented Jul 28, 2023

I'd even be fine with merging this as is and promise to start on further refactoring right away with both ideas:

I am absolutely open to ways to make this better in the face of "many repos at once", as a future improvement. Open an issue, if you would, so we have at least a placeholder pointing back to this discussion.

@ggainey
Copy link
Contributor Author

ggainey commented Jul 28, 2023

LGTM, but please leave a TODO about maybe using .iterator() here

https://github.com/ggainey/pulpcore/blob/134610ff1eb48ce47fb06c0e5bae2f74dd815701/pulpcore/app/importexport.py#L104

Clarify please? Do you mean just change to ...pb.iter(artifacts.iterator())? I feel liike I'm missing something .

@ggainey
Copy link
Contributor Author

ggainey commented Jul 28, 2023

hold off on merging while I investigate some test weirdness, please

@dralley
Copy link
Contributor

dralley commented Jul 28, 2023

Clarify please? Do you mean just change to ...pb.iter(artifacts.iterator())? I feel liike I'm missing something .

iterate over artifacts.iterator(), increment the progressbar manually (and probably not one-at-a-time)

Along the way taught export to operate on a QuerySet of Artifacts
instead of (prematurely) hydrating all affected Artifacts into
a list.

fixes pulp#4159.
@ggainey
Copy link
Contributor Author

ggainey commented Jul 29, 2023

Clarify please? Do you mean just change to ...pb.iter(artifacts.iterator())? I feel liike I'm missing something .

iterate over artifacts.iterator(), increment the progressbar manually (and probably not one-at-a-time)

Gotcha - good call, results of doing so improve the time even more:

"started_at": "2023-07-29T14:06:46.511803Z",
"finished_at": "2023-07-29T14:16:42.599406Z", : 9m56s

@ggainey ggainey marked this pull request as ready for review July 29, 2023 14:29
@ggainey ggainey merged commit 6178887 into pulp:main Jul 29, 2023
14 checks passed
@patchback
Copy link

patchback bot commented Jul 29, 2023

Backport to 3.18: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.18/617888748a85fa994be9258742d83ae3a7d4b034/pr-4161

Backported as #4197

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link

patchback bot commented Jul 29, 2023

Backport to 3.21: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.21/617888748a85fa994be9258742d83ae3a7d4b034/pr-4161

Backported as #4198

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link

patchback bot commented Jul 29, 2023

Backport to 3.22: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.22/617888748a85fa994be9258742d83ae3a7d4b034/pr-4161

Backported as #4199

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link

patchback bot commented Jul 29, 2023

Backport to 3.23: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.23/617888748a85fa994be9258742d83ae3a7d4b034/pr-4161

Backported as #4200

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link

patchback bot commented Jul 29, 2023

Backport to 3.28: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.28/617888748a85fa994be9258742d83ae3a7d4b034/pr-4161

Backported as #4201

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

@patchback
Copy link

patchback bot commented Jul 29, 2023

Backport to 3.29: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.29/617888748a85fa994be9258742d83ae3a7d4b034/pr-4161

Backported as #4202

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PulpExport ArtifactResource.json has duplicate entries for overlapping content.
5 participants