Bug 1470942 - Support maven on S3 #163

JohanLorenzo · 2018-06-25T16:50:31Z

Do not merge yet. This has been untested from an end to another. This is just to let people know this kind of changes are incoming. In my opinion, they aren't too heavy and shouldn't (hopefully) block any ongoing work. Please let me know if that's not the case. All done!

coveralls · 2018-06-25T17:06:19Z

Pull Request Test Coverage Report for Build 681

168 of 168 (100.0%) changed or added relevant lines in 6 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.3%) to 99.071%

Totals
Change from base Build 679:	0.3%
Covered Lines:	640
Relevant Lines:	646

💛 - Coveralls

JohanLorenzo · 2018-07-25T15:29:05Z

This was tested end to end in https://tools.taskcluster.net/groups/TJW4b-CNSwGsJukgQCeoUA/tasks/PojKrc74Q5KFqghgPlosCg/runs/1/. The testing bucket lives at https://s3.console.aws.amazon.com/s3/buckets/geckoviewtest/?region=us-east-1&tab=overview . Here's how things work:

We only pass 1 target.maven.zip in upstreamArtifacts. This file is flagged with zipExtract
We get the manifest (maven_geckoview.yml). This manifest tells us exactly what files are present in the archive (no more no less).
We pass target.maven.zip to the new zip.py module. This module verifies the sanity of the archive (file size, file paths, no duplicate) and checks all expected files (from the manifest) are present. This module supports several zip archives, even though we currently expect to pass target.maven.zip alone.
The zip module returns the files that were extracted via a dict. We massage the dict to make it understandable by move_beets().
We also massage the context to make move_beets() happy.

At the first suspicious behavior detected, we bail out.

To be honest, now that things are implemented I'm not too sure whether beetmover is the right place for such behavior. I feel push_to_maven() cheats a lot to make move_beets() happy. We could create a different kind of workers to deal with zip extraction, but we would still need the boto wrappers done in beetmover. I also enjoy the manifest model, which helps in defining what files are supposedly in the archive. What do you think @MihaiTabara ? I'm fine keeping the behavior in beetmover or moving it elsewhere.

Please let me know if you need more information about what this PR does.

MihaiTabara

This overall looks good to me. There are some things I don't understand yet but none of them are blocking I suppose.

Overall comment:
a) once declarative artifacts is in place, we won't be having any manifests templates at all so we could maybe add indeed all these processing in a separate "extractors" or w\e new kind that beetmover depends on.

b) the term "manifest" was used for mapping. Once declarative artifacts is done, this concept will no longer apply, at least not in beetmoverscript terminology. All the mappings will be defined in a separate artifactMap dict in task payload. If we don't end up adding extractors like suggested in a), maybe we can override the term "manifest" to contain validation data or something else?

c) to be honest, I kind of like this new glorious future. I'd much prefer ditching "extractors" as new kinds in-tree as suggested at a) and instead have this logic in beetmoverscript.

Once declarative artifacts is in place, we can see beetmoverscript as something like Geckoview structure.

Beetmoverscript default behavior to wrap the boto calls and talk to S3. It will keep having all these API calls/endpoints (like move_beets or alike) but then we can clean a bit those and make them more generic so that we could cope other behaviors (such as maven one or any other upcoming) to perform various operations (such as zip.py or maybe in the future something else too).

This sounds like we could have: script.py, utils.py, etc in default folder and then "specifics" like "maven" in dedicated subfolders.

... and integration tests to cover them all.
Hm, I can sketch this elsewhere in a separate bug to not derail from your PR which looks good for now, when we still have the "old-way" of doing things.

MihaiTabara · 2018-07-27T12:12:29Z

beetmoverscript/constants.py

@@ -163,3 +173,9 @@
    'target.dmg',
    'target.apk',
 )
+
+# Zip archive can theoretically have a better compression ratio, like when there's a big amount


++ on the comment. Any reason why we're not moving these two ZIP_* related-constants within the script_config ?

Oh good call. I had forgotten about script_config. I added the max file size in mozilla-releng/build-puppet@bcef5fc.

I don't foresee a need to change the compression ratio as much as the max file size. So, I haven't added it. Moreover, I think increasing the ratio can be a bigger risk than just increasing the file size. If one day we realize we need to deal with zip files that are compressed more than 10 times, we probably want to revisit the ratio check within beetmover. That's why, I think we shouldn't expose that value in script_config.

That makes sense, thanks for clarifying.

MihaiTabara · 2018-07-27T12:18:01Z

beetmoverscript/constants.py

+# Zip archive can theoretically have a better compression ratio, like when there's a big amount
+# of redundancy (e.g.: files full of zeros). Let beetmover only deal with regular cases. Edge cases
+# are considered too suspicious, so we bail out on them.
+ZIP_MAX_FILE_SIZE_IN_MB = 100


I like this usage from here where we try to read this from script_config and if it fails, default to some value defined in beetmoverscript. And it makes sense if we ever need to amend that + puppet fix, instead of beetmoverscript roll-out.

However, I'm thinking whether this can be slightly confusing for anyone else reading this later on. I'm wondering if we shouldn't simply just read it from puppet and that's it. What do you think?

Understood. I see the confusion. How about I rename the constant ZIP_MAX_FILE_SIZE_IN_MB into DEFAULT_ZIP_MAX_FILE_SIZE_IN_MB. This should make the intent of this constant clearer. I made this change in 42ade2a.

Please let me know what you think of it!

MihaiTabara · 2018-07-27T12:28:25Z

beetmoverscript/zip.py

@@ -0,0 +1,206 @@
+import logging


Not blocking but it'd be nice to add docstrings for all these functions. Doesn't have to be too verbose, but at least some sum-up 1-2 lines to define what the function is doing.

Okay. I added the doctstring of the only public function in 3c02cc5. I usually prefer to not document private functions in favor of having (long) but explicit names. The reason is: this is an implementation detail to the consumers of the module. Ergo, consumers should not rely on these functions but documenting them can send mixed signals.

I sometimes briefly document private functions (like https://github.com/mozilla-releng/mozilla-version/blob/master/mozilla_version/firefox.py#L245), if the function name cannot capture the intent and the function is too obscure to guess the output.

Please let me know if you still want me to document these private functions.

Public is perfect; your argument for private ones makes a lot of sense. Now I understand retroactively why you were always suggesting the best of the names when reviewing 😃 👍

MihaiTabara · 2018-07-27T12:35:32Z

beetmoverscript/zip.py

+
+    # We make this check at this stage (and not when all files from all tasks got extracted)
+    # because files from different tasks are stored in different folders by scriptworker. Moreover
+    # we tested no relative paths like ".." are not used within the archive.


tiny nit: either s/tested/test or s/are/were I think? And I think the second negation is not needed. Something like "we tested no relative paths [...] were used within the archive"

Thanks! Done in: b747099

MihaiTabara · 2018-07-27T12:35:45Z

beetmoverscript/zip.py

+    # We make this check at this stage (and not when all files from all tasks got extracted)
+    # because files from different tasks are stored in different folders by scriptworker. Moreover
+    # we tested no relative paths like ".." are not used within the archive.
+    _ensure_no_file_got_overwritten(task_id, extracted_files)


Makes sense to have the check here 👍

MihaiTabara · 2018-07-27T13:26:12Z

beetmoverscript/script.py

+        raise NotImplementedError('More than 1 archive extracted. Only 1 is supported at once')
+    extracted_paths_per_relative_path = list(extracted_paths_per_archive.values())[0]
+
+    context.artifacts_to_beetmove = {


I was wondering if it's worth moving this context.artifacts_to_beetmove logic + logic from 169-179 in a separate function. Again, no strong feelings, we can leave them as they are, I'm more leaning towards consistency with the other push_to_* operations.

Yeah, I was wondering the same. I can't find a name for this function. I get the feeling too many concepts are intertwined. That's why I left it in this function. How would you describe this function?

Hm, good question. I wonder, how bad it would be if we defined that function within push_to_maven()? Something like:

async def push_to_maven(context): def _process_and_validate_artifacts(...) .... return artifacts ...... context.artifacts_to_beetmove = _process_and_validate_artifacts(...)

I know most of the other similar functions are cleaner but if we don't find any better solution, this could be a good compromise. Alternatively, we can leave things be, I don't want to block on this, nor I have strong feelings.

Done in 7f5682e. Unit tests are failing because of #176.

Tested e2e in https://tools.taskcluster.net/groups/FN8cz6cTSyiKeVxa8Y0KAg

Looks great, thanks for making this amendment!

MihaiTabara · 2018-07-27T13:27:23Z

beetmoverscript/maven.py

@@ -0,0 +1,54 @@
+import os


I'm tempted to see this file renamed to maven_utils.py. Not sure if it's worth it though. What say you?

Fine by me. Done in 5cab293

MihaiTabara · 2018-07-27T13:27:57Z

beetmoverscript/data/release_beetmover_task_schema.json

                            }
                        },
-                        "required": ["taskId", "taskType", "paths", "locale"]


Why trim locale here?

geckoview tasks don't specify a locale to beetmove. That's why I removed it from the required fields. I'm fine putting it back and creating a different schema. What do you think?

This is my fault from the very beginnings of beetmoverscript. In shipitscript and bouncerscript we have task schemas for each of the possible tasks and full integration tests as we should have! For beetmover unfortunately it is what it is. So to answer your question, ideally, we add a new schema so that we don't touch existing ones (like trimming the locale or hashType or platform to fit new maven tasks). Old beetmover jobs still have those, hence be tested against those correctly.

Again, this is not blocking so we can definitely follow-up in a separate PR. I have a backlog task to take over (hopefully) within declarative artifacts, to rewrite all the tasks to have integration tests for all possible beetmover jobs so this work will be undertaken anyway at some point this quarter so not sure it's worth time investing now. If it's easy for you to tweak tasks to fit to a new schema and not touch existing, that's be ideal, but again, we're going to do that sooner or later anyway for all tasks. Up to you, am fine either way 👍

Got! I'd prefer to make it in a follow up. I think we're safe for now because beetmoverscript relies on locale being defined. So if a task def misses one, a KeyError will be raised somewhere

MihaiTabara · 2018-07-27T13:28:57Z

beetmoverscript/data/release_beetmover_task_schema.json

@@ -46,9 +46,12 @@
                                "items": {
                                    "type": "string"
                                }
+                            },


Note to self: in the near future, in my backlog, I have a task to add integration tests for beetmover in such a way that we have beetmover schema for all possible beetmover jobs + actual tests against them. Like we currently have in bouncerscript or shipitscript. Right now we are reusing these two possible tasks but it's not enough.

MihaiTabara · 2018-07-27T13:29:47Z

beetmoverscript/data/beetmover_task_schema.json

                            }
                        },
-                        "required": ["taskId", "taskType", "paths", "locale"]


I presume you remove 'locale' to fit the maven jobs right? The old beetmover jobs are still having the locale. Maybe it'd be worth adding it as optional.

Yup, that's what this change is about. locale is now optional

JohanLorenzo · 2018-07-30T12:46:30Z

Addressed comments were tested in https://tools.taskcluster.net/groups/NaetTOqNQqOLoI0cGDH2_Q/tasks/JDuBbPHpRaqAvOH4mR2JXA/runs/0/logs/public%2Flogs%2Flive_backing.log#L71

MihaiTabara

MihaiTabara · 2018-07-30T15:10:53Z

beetmoverscript/constants.py

@@ -163,3 +173,9 @@
    'target.dmg',
    'target.apk',
 )
+
+# Zip archive can theoretically have a better compression ratio, like when there's a big amount


That makes sense, thanks for clarifying.

MihaiTabara · 2018-07-30T15:11:28Z

beetmoverscript/data/beetmover_task_schema.json

                            }
                        },
-                        "required": ["taskId", "taskType", "paths", "locale"]


MihaiTabara · 2018-07-30T15:15:43Z

beetmoverscript/data/release_beetmover_task_schema.json

                            }
                        },
-                        "required": ["taskId", "taskType", "paths", "locale"]


This is my fault from the very beginnings of beetmoverscript. In shipitscript and bouncerscript we have task schemas for each of the possible tasks and full integration tests as we should have! For beetmover unfortunately it is what it is. So to answer your question, ideally, we add a new schema so that we don't touch existing ones (like trimming the locale or hashType or platform to fit new maven tasks). Old beetmover jobs still have those, hence be tested against those correctly.

Again, this is not blocking so we can definitely follow-up in a separate PR. I have a backlog task to take over (hopefully) within declarative artifacts, to rewrite all the tasks to have integration tests for all possible beetmover jobs so this work will be undertaken anyway at some point this quarter so not sure it's worth time investing now. If it's easy for you to tweak tasks to fit to a new schema and not touch existing, that's be ideal, but again, we're going to do that sooner or later anyway for all tasks. Up to you, am fine either way 👍

MihaiTabara · 2018-07-30T15:28:51Z

beetmoverscript/script.py

+        raise NotImplementedError('More than 1 archive extracted. Only 1 is supported at once')
+    extracted_paths_per_relative_path = list(extracted_paths_per_archive.values())[0]
+
+    context.artifacts_to_beetmove = {


Hm, good question. I wonder, how bad it would be if we defined that function within push_to_maven()? Something like:

async def push_to_maven(context): def _process_and_validate_artifacts(...) .... return artifacts ...... context.artifacts_to_beetmove = _process_and_validate_artifacts(...)

I know most of the other similar functions are cleaner but if we don't find any better solution, this could be a good compromise. Alternatively, we can leave things be, I don't want to block on this, nor I have strong feelings.

MihaiTabara · 2018-07-30T15:29:35Z

beetmoverscript/task.py

@@ -109,29 +110,45 @@ def add_balrog_manifest_to_artifacts(context):
    utils.write_json(abs_file_path, context.balrog_manifest)


-def get_upstream_artifact(context, taskid, path):


No, absolutely no worries, this is super nice! 👍

MihaiTabara · 2018-07-30T15:39:31Z

beetmoverscript/zip.py

@@ -0,0 +1,206 @@
+import logging


Public is perfect; your argument for private ones makes a lot of sense. Now I understand retroactively why you were always suggesting the best of the names when reviewing 😃 👍

MihaiTabara · 2018-07-30T15:40:05Z

beetmoverscript/zip.py

+def check_and_extract_zip_archives(artifacts_per_task_id, expected_files_per_archive_per_task_id, zip_max_size_in_mb):
+    """Verify zip archives and extract them.
+
+    This function enhances the checks done in python's zipfile. Each of the zip file passed is


Holy smokes, this is docs for realz 😮

MihaiTabara · 2018-07-30T15:40:34Z

beetmoverscript/zip.py

+
+def _check_archive_itself(zip_path, zip_max_size_in_mb):
+    zip_size = os.path.getsize(zip_path)
+    zip_size_in_mb = zip_size // (1024 * 1024)


MihaiTabara · 2018-07-30T15:41:16Z

beetmoverscript/zip.py

+    log.info('Archive "{}" contains files with legitimate sizes.'.format(zip_path))
+
+
+def _ensure_all_expected_files_are_present_in_archive(zip_path, files_in_archive, expected_files):


MihaiTabara · 2018-07-30T15:42:01Z

beetmoverscript/zip.py

+    files_in_archive = set(files_in_archive)
+
+    unique_expected_files = set(expected_files)
+    if len(expected_files) != len(unique_expected_files):


Sounds good to me, thanks for adding all these checks. Makes me feel so much safer having all of them. GG again for all the validations!

…h move_beets

JohanLorenzo force-pushed the bug-1470942 branch from a73d142 to 3a061d5 Compare June 25, 2018 16:58

JohanLorenzo force-pushed the bug-1470942 branch 2 times, most recently from a43377f to d8c5a46 Compare June 26, 2018 08:59

JohanLorenzo force-pushed the bug-1470942 branch from 68868a4 to 685fa14 Compare July 10, 2018 07:28

JohanLorenzo mentioned this pull request Jul 10, 2018

get_upstream_artifacts_full_paths_per_task_id(): Add test to ensure s… mozilla-releng/scriptworker#245

Merged

JohanLorenzo force-pushed the bug-1470942 branch 3 times, most recently from d7f7a1d to 53f13a4 Compare July 23, 2018 11:46

JohanLorenzo requested a review from MihaiTabara July 25, 2018 15:29

MihaiTabara approved these changes Jul 27, 2018

View reviewed changes

JohanLorenzo force-pushed the bug-1470942 branch from da06b4a to f609c78 Compare July 30, 2018 09:01

MihaiTabara approved these changes Jul 30, 2018

View reviewed changes

JohanLorenzo added 15 commits August 13, 2018 13:51

Bug 1470942 - Support maven on S3

d9df485

Use scriptworker's get_and_check_single_upstream_artifact_full_path

e1b61ec

Allow tasks without locales in upstream artifact

76aa949

WIP

7b5901d

WIP 2. Still missing tests in test_zip. Still missing integration wit…

262e9b2

…h move_beets

Cover zip.py with

7943a29

Cover zip.py with 100% test coverage

2747d97

WIP3

f8ef32d

Test get_upstream_artifacts_with_zip_extract_param

0a4c86b

All tests back to green

5c292ba

wire artifacts_to_beetmove to move_beets

9f55e9c

Fix wire

3f850cc

Fix breakages found in end to end test (part 1)

b638c20

Fix breakages found in end to end test (part 2)

a05f3c2

Let maven give the full path to target.maven.zip

a58f7a4

JohanLorenzo added 19 commits August 13, 2018 13:52

Fix bad max zip file

381d845

pass right data structure to move_beets

41382a7

Log out expected files

af1f061

Fix bad version number taken from payload

abbabd3

Path base name to move_beets

215ddb7

Log out what files were extracted

23b85d8

Fill empty raw_balrog_manifest

d9ed157

Catch up missing coverage

54b5368

Rename constant ZIP_MAX_FILE_SIZE_IN_MB into DEFAULT_{}

39294f9

Document check_and_extract_zip_archives()

0985fe8

Fix nit in code comment

570e4e5

Fix inconsistency regarding files_in_archive

e461f6a

Avoid using "file" keyword

54a6fde

Rename files_in_archive into relative_paths_in_archive

32ecbe3

Fix doc in is_maven_action()

d69ae80

indent product in maven_geckoview.yml

2b66aa8

Rename maven.py into maven_utils.py

44a526a

Fix bad config_example.json

8c88f83

split push_to_maven() into 2 functions

f98517f

JohanLorenzo force-pushed the bug-1470942 branch from bda126c to f98517f Compare August 13, 2018 11:53

JohanLorenzo merged commit aee370a into mozilla-releng:master Aug 13, 2018

		@@ -109,29 +110,45 @@ def add_balrog_manifest_to_artifacts(context):
		utils.write_json(abs_file_path, context.balrog_manifest)


		def get_upstream_artifact(context, taskid, path):

		log.info('Archive "{}" contains files with legitimate sizes.'.format(zip_path))


		def _ensure_all_expected_files_are_present_in_archive(zip_path, files_in_archive, expected_files):

Bug 1470942 - Support maven on S3 #163

Bug 1470942 - Support maven on S3 #163

Conversation

JohanLorenzo commented Jun 25, 2018 • edited

coveralls commented Jun 25, 2018 • edited

Pull Request Test Coverage Report for Build 681

💛 - Coveralls

JohanLorenzo commented Jul 25, 2018 • edited

MihaiTabara left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JohanLorenzo commented Jul 30, 2018

MihaiTabara left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JohanLorenzo commented Jun 25, 2018 •

edited

coveralls commented Jun 25, 2018 •

edited

JohanLorenzo commented Jul 25, 2018 •

edited