Bug 1470942 - Support maven on S3 #163
Bug 1470942 - Support maven on S3 #163
Conversation
a73d142
to
3a061d5
Compare
Pull Request Test Coverage Report for Build 681
💛 - Coveralls |
a43377f
to
d8c5a46
Compare
68868a4
to
685fa14
Compare
d7f7a1d
to
53f13a4
Compare
This was tested end to end in https://tools.taskcluster.net/groups/TJW4b-CNSwGsJukgQCeoUA/tasks/PojKrc74Q5KFqghgPlosCg/runs/1/. The testing bucket lives at https://s3.console.aws.amazon.com/s3/buckets/geckoviewtest/?region=us-east-1&tab=overview . Here's how things work:
At the first suspicious behavior detected, we bail out. To be honest, now that things are implemented I'm not too sure whether beetmover is the right place for such behavior. I feel Please let me know if you need more information about what this PR does. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overall looks good to me. There are some things I don't understand yet but none of them are blocking I suppose.
Overall comment:
a) once declarative artifacts is in place, we won't be having any manifests templates at all so we could maybe add indeed all these processing in a separate "extractors" or w\e new kind that beetmover depends on.
b) the term "manifest" was used for mapping. Once declarative artifacts is done, this concept will no longer apply, at least not in beetmoverscript terminology. All the mappings will be defined in a separate artifactMap
dict in task payload. If we don't end up adding extractors
like suggested in a), maybe we can override the term "manifest" to contain validation data or something else?
c) to be honest, I kind of like this new glorious future. I'd much prefer ditching "extractors" as new kinds in-tree as suggested at a) and instead have this logic in beetmoverscript.
Once declarative artifacts is in place, we can see beetmoverscript as something like Geckoview structure.
Beetmoverscript default behavior to wrap the boto calls and talk to S3. It will keep having all these API calls/endpoints (like move_beets
or alike) but then we can clean a bit those and make them more generic so that we could cope other behaviors (such as maven one or any other upcoming) to perform various operations (such as zip.py or maybe in the future something else too).
This sounds like we could have: script.py, utils.py, etc in default folder and then "specifics" like "maven" in dedicated subfolders.
... and integration tests to cover them all.
Hm, I can sketch this elsewhere in a separate bug to not derail from your PR which looks good for now, when we still have the "old-way" of doing things.
@@ -163,3 +173,9 @@ | |||
'target.dmg', | |||
'target.apk', | |||
) | |||
|
|||
# Zip archive can theoretically have a better compression ratio, like when there's a big amount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ on the comment. Any reason why we're not moving these two ZIP_* related-constants within the script_config ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good call. I had forgotten about script_config. I added the max file size in mozilla-releng/build-puppet@bcef5fc.
I don't foresee a need to change the compression ratio as much as the max file size. So, I haven't added it. Moreover, I think increasing the ratio can be a bigger risk than just increasing the file size. If one day we realize we need to deal with zip files that are compressed more than 10 times, we probably want to revisit the ratio check within beetmover. That's why, I think we shouldn't expose that value in script_config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thanks for clarifying.
beetmoverscript/constants.py
Outdated
# Zip archive can theoretically have a better compression ratio, like when there's a big amount | ||
# of redundancy (e.g.: files full of zeros). Let beetmover only deal with regular cases. Edge cases | ||
# are considered too suspicious, so we bail out on them. | ||
ZIP_MAX_FILE_SIZE_IN_MB = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this usage from here where we try to read this from script_config
and if it fails, default to some value defined in beetmoverscript. And it makes sense if we ever need to amend that + puppet fix, instead of beetmoverscript roll-out.
However, I'm thinking whether this can be slightly confusing for anyone else reading this later on. I'm wondering if we shouldn't simply just read it from puppet and that's it. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I see the confusion. How about I rename the constant ZIP_MAX_FILE_SIZE_IN_MB into DEFAULT_ZIP_MAX_FILE_SIZE_IN_MB. This should make the intent of this constant clearer. I made this change in 42ade2a.
Please let me know what you think of it!
@@ -0,0 +1,206 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not blocking but it'd be nice to add docstrings for all these functions. Doesn't have to be too verbose, but at least some sum-up 1-2 lines to define what the function is doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I added the doctstring of the only public function in 3c02cc5. I usually prefer to not document private functions in favor of having (long) but explicit names. The reason is: this is an implementation detail to the consumers of the module. Ergo, consumers should not rely on these functions but documenting them can send mixed signals.
I sometimes briefly document private functions (like https://github.com/mozilla-releng/mozilla-version/blob/master/mozilla_version/firefox.py#L245), if the function name cannot capture the intent and the function is too obscure to guess the output.
Please let me know if you still want me to document these private functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public is perfect; your argument for private ones makes a lot of sense. Now I understand retroactively why you were always suggesting the best of the names when reviewing 😃 👍
beetmoverscript/zip.py
Outdated
|
||
# We make this check at this stage (and not when all files from all tasks got extracted) | ||
# because files from different tasks are stored in different folders by scriptworker. Moreover | ||
# we tested no relative paths like ".." are not used within the archive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny nit: either s/tested/test or s/are/were I think? And I think the second negation is not needed. Something like "we tested no relative paths [...] were used within the archive"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Done in: b747099
beetmoverscript/zip.py
Outdated
# We make this check at this stage (and not when all files from all tasks got extracted) | ||
# because files from different tasks are stored in different folders by scriptworker. Moreover | ||
# we tested no relative paths like ".." are not used within the archive. | ||
_ensure_no_file_got_overwritten(task_id, extracted_files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to have the check here 👍
beetmoverscript/script.py
Outdated
raise NotImplementedError('More than 1 archive extracted. Only 1 is supported at once') | ||
extracted_paths_per_relative_path = list(extracted_paths_per_archive.values())[0] | ||
|
||
context.artifacts_to_beetmove = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering if it's worth moving this context.artifacts_to_beetmove
logic + logic from 169-179 in a separate function. Again, no strong feelings, we can leave them as they are, I'm more leaning towards consistency with the other push_to_*
operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was wondering the same. I can't find a name for this function. I get the feeling too many concepts are intertwined. That's why I left it in this function. How would you describe this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, good question. I wonder, how bad it would be if we defined that function within push_to_maven()? Something like:
async def push_to_maven(context):
def _process_and_validate_artifacts(...)
....
return artifacts
......
context.artifacts_to_beetmove = _process_and_validate_artifacts(...)
I know most of the other similar functions are cleaner but if we don't find any better solution, this could be a good compromise. Alternatively, we can leave things be, I don't want to block on this, nor I have strong feelings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 7f5682e. Unit tests are failing because of #176.
Tested e2e in https://tools.taskcluster.net/groups/FN8cz6cTSyiKeVxa8Y0KAg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks for making this amendment!
beetmoverscript/maven.py
Outdated
@@ -0,0 +1,54 @@ | |||
import os |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm tempted to see this file renamed to maven_utils.py
. Not sure if it's worth it though. What say you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me. Done in 5cab293
} | ||
}, | ||
"required": ["taskId", "taskType", "paths", "locale"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why trim locale
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
geckoview tasks don't specify a locale to beetmove. That's why I removed it from the required fields. I'm fine putting it back and creating a different schema. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my fault from the very beginnings of beetmoverscript. In shipitscript
and bouncerscript
we have task schemas for each of the possible tasks and full integration tests as we should have! For beetmover unfortunately it is what it is. So to answer your question, ideally, we add a new schema so that we don't touch existing ones (like trimming the locale
or hashType
or platform
to fit new maven tasks). Old beetmover jobs still have those, hence be tested against those correctly.
Again, this is not blocking so we can definitely follow-up in a separate PR. I have a backlog task to take over (hopefully) within declarative artifacts, to rewrite all the tasks to have integration tests for all possible beetmover jobs so this work will be undertaken anyway at some point this quarter so not sure it's worth time investing now. If it's easy for you to tweak tasks to fit to a new schema and not touch existing, that's be ideal, but again, we're going to do that sooner or later anyway for all tasks. Up to you, am fine either way 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got! I'd prefer to make it in a follow up. I think we're safe for now because beetmoverscript relies on locale
being defined. So if a task def misses one, a KeyError
will be raised somewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -46,9 +46,12 @@ | |||
"items": { | |||
"type": "string" | |||
} | |||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: in the near future, in my backlog, I have a task to add integration tests for beetmover in such a way that we have beetmover schema for all possible beetmover jobs + actual tests against them. Like we currently have in bouncerscript or shipitscript. Right now we are reusing these two possible tasks but it's not enough.
} | ||
}, | ||
"required": ["taskId", "taskType", "paths", "locale"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I presume you remove 'locale' to fit the maven jobs right? The old beetmover jobs are still having the locale
. Maybe it'd be worth adding it as optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that's what this change is about. locale
is now optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
da06b4a
to
f609c78
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -163,3 +173,9 @@ | |||
'target.dmg', | |||
'target.apk', | |||
) | |||
|
|||
# Zip archive can theoretically have a better compression ratio, like when there's a big amount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, thanks for clarifying.
} | ||
}, | ||
"required": ["taskId", "taskType", "paths", "locale"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
}, | ||
"required": ["taskId", "taskType", "paths", "locale"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my fault from the very beginnings of beetmoverscript. In shipitscript
and bouncerscript
we have task schemas for each of the possible tasks and full integration tests as we should have! For beetmover unfortunately it is what it is. So to answer your question, ideally, we add a new schema so that we don't touch existing ones (like trimming the locale
or hashType
or platform
to fit new maven tasks). Old beetmover jobs still have those, hence be tested against those correctly.
Again, this is not blocking so we can definitely follow-up in a separate PR. I have a backlog task to take over (hopefully) within declarative artifacts, to rewrite all the tasks to have integration tests for all possible beetmover jobs so this work will be undertaken anyway at some point this quarter so not sure it's worth time investing now. If it's easy for you to tweak tasks to fit to a new schema and not touch existing, that's be ideal, but again, we're going to do that sooner or later anyway for all tasks. Up to you, am fine either way 👍
beetmoverscript/script.py
Outdated
raise NotImplementedError('More than 1 archive extracted. Only 1 is supported at once') | ||
extracted_paths_per_relative_path = list(extracted_paths_per_archive.values())[0] | ||
|
||
context.artifacts_to_beetmove = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, good question. I wonder, how bad it would be if we defined that function within push_to_maven()? Something like:
async def push_to_maven(context):
def _process_and_validate_artifacts(...)
....
return artifacts
......
context.artifacts_to_beetmove = _process_and_validate_artifacts(...)
I know most of the other similar functions are cleaner but if we don't find any better solution, this could be a good compromise. Alternatively, we can leave things be, I don't want to block on this, nor I have strong feelings.
@@ -109,29 +110,45 @@ def add_balrog_manifest_to_artifacts(context): | |||
utils.write_json(abs_file_path, context.balrog_manifest) | |||
|
|||
|
|||
def get_upstream_artifact(context, taskid, path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, absolutely no worries, this is super nice! 👍
@@ -0,0 +1,206 @@ | |||
import logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public is perfect; your argument for private ones makes a lot of sense. Now I understand retroactively why you were always suggesting the best of the names when reviewing 😃 👍
def check_and_extract_zip_archives(artifacts_per_task_id, expected_files_per_archive_per_task_id, zip_max_size_in_mb): | ||
"""Verify zip archives and extract them. | ||
|
||
This function enhances the checks done in python's zipfile. Each of the zip file passed is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Holy smokes, this is docs for realz 😮
|
||
def _check_archive_itself(zip_path, zip_max_size_in_mb): | ||
zip_size = os.path.getsize(zip_path) | ||
zip_size_in_mb = zip_size // (1024 * 1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
log.info('Archive "{}" contains files with legitimate sizes.'.format(zip_path)) | ||
|
||
|
||
def _ensure_all_expected_files_are_present_in_archive(zip_path, files_in_archive, expected_files): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\o/
files_in_archive = set(files_in_archive) | ||
|
||
unique_expected_files = set(expected_files) | ||
if len(expected_files) != len(unique_expected_files): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, thanks for adding all these checks. Makes me feel so much safer having all of them. GG again for all the validations!
bda126c
to
f98517f
Compare
Do not merge yet. This has been untested from an end to another. This is just to let people know this kind of changes are incoming. In my opinion, they aren't too heavy and shouldn't (hopefully) block any ongoing work. Please let me know if that's not the case.All done!