Skip to content
This repository has been archived by the owner on Dec 31, 2019. It is now read-only.

Bug 1470942 - Support maven on S3 #163

Merged
merged 34 commits into from Aug 13, 2018

Conversation

JohanLorenzo
Copy link
Contributor

@JohanLorenzo JohanLorenzo commented Jun 25, 2018

Do not merge yet. This has been untested from an end to another. This is just to let people know this kind of changes are incoming. In my opinion, they aren't too heavy and shouldn't (hopefully) block any ongoing work. Please let me know if that's not the case. All done!

@coveralls
Copy link

coveralls commented Jun 25, 2018

Pull Request Test Coverage Report for Build 681

  • 168 of 168 (100.0%) changed or added relevant lines in 6 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.3%) to 99.071%

Totals Coverage Status
Change from base Build 679: 0.3%
Covered Lines: 640
Relevant Lines: 646

💛 - Coveralls

@JohanLorenzo
Copy link
Contributor Author

JohanLorenzo commented Jul 25, 2018

This was tested end to end in https://tools.taskcluster.net/groups/TJW4b-CNSwGsJukgQCeoUA/tasks/PojKrc74Q5KFqghgPlosCg/runs/1/. The testing bucket lives at https://s3.console.aws.amazon.com/s3/buckets/geckoviewtest/?region=us-east-1&tab=overview . Here's how things work:

  1. We only pass 1 target.maven.zip in upstreamArtifacts. This file is flagged with zipExtract
  2. We get the manifest (maven_geckoview.yml). This manifest tells us exactly what files are present in the archive (no more no less).
  3. We pass target.maven.zip to the new zip.py module. This module verifies the sanity of the archive (file size, file paths, no duplicate) and checks all expected files (from the manifest) are present. This module supports several zip archives, even though we currently expect to pass target.maven.zip alone.
  4. The zip module returns the files that were extracted via a dict. We massage the dict to make it understandable by move_beets().
  5. We also massage the context to make move_beets() happy.

At the first suspicious behavior detected, we bail out.

To be honest, now that things are implemented I'm not too sure whether beetmover is the right place for such behavior. I feel push_to_maven() cheats a lot to make move_beets() happy. We could create a different kind of workers to deal with zip extraction, but we would still need the boto wrappers done in beetmover. I also enjoy the manifest model, which helps in defining what files are supposedly in the archive. What do you think @MihaiTabara ? I'm fine keeping the behavior in beetmover or moving it elsewhere.

Please let me know if you need more information about what this PR does.

Copy link
Contributor

@MihaiTabara MihaiTabara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overall looks good to me. There are some things I don't understand yet but none of them are blocking I suppose.

Overall comment:
a) once declarative artifacts is in place, we won't be having any manifests templates at all so we could maybe add indeed all these processing in a separate "extractors" or w\e new kind that beetmover depends on.

b) the term "manifest" was used for mapping. Once declarative artifacts is done, this concept will no longer apply, at least not in beetmoverscript terminology. All the mappings will be defined in a separate artifactMap dict in task payload. If we don't end up adding extractors like suggested in a), maybe we can override the term "manifest" to contain validation data or something else?

c) to be honest, I kind of like this new glorious future. I'd much prefer ditching "extractors" as new kinds in-tree as suggested at a) and instead have this logic in beetmoverscript.

Once declarative artifacts is in place, we can see beetmoverscript as something like Geckoview structure.

Beetmoverscript default behavior to wrap the boto calls and talk to S3. It will keep having all these API calls/endpoints (like move_beets or alike) but then we can clean a bit those and make them more generic so that we could cope other behaviors (such as maven one or any other upcoming) to perform various operations (such as zip.py or maybe in the future something else too).

This sounds like we could have: script.py, utils.py, etc in default folder and then "specifics" like "maven" in dedicated subfolders.

... and integration tests to cover them all.
Hm, I can sketch this elsewhere in a separate bug to not derail from your PR which looks good for now, when we still have the "old-way" of doing things.

@@ -163,3 +173,9 @@
'target.dmg',
'target.apk',
)

# Zip archive can theoretically have a better compression ratio, like when there's a big amount
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ on the comment. Any reason why we're not moving these two ZIP_* related-constants within the script_config ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good call. I had forgotten about script_config. I added the max file size in mozilla-releng/build-puppet@bcef5fc.

I don't foresee a need to change the compression ratio as much as the max file size. So, I haven't added it. Moreover, I think increasing the ratio can be a bigger risk than just increasing the file size. If one day we realize we need to deal with zip files that are compressed more than 10 times, we probably want to revisit the ratio check within beetmover. That's why, I think we shouldn't expose that value in script_config.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks for clarifying.

# Zip archive can theoretically have a better compression ratio, like when there's a big amount
# of redundancy (e.g.: files full of zeros). Let beetmover only deal with regular cases. Edge cases
# are considered too suspicious, so we bail out on them.
ZIP_MAX_FILE_SIZE_IN_MB = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this usage from here where we try to read this from script_config and if it fails, default to some value defined in beetmoverscript. And it makes sense if we ever need to amend that + puppet fix, instead of beetmoverscript roll-out.

However, I'm thinking whether this can be slightly confusing for anyone else reading this later on. I'm wondering if we shouldn't simply just read it from puppet and that's it. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I see the confusion. How about I rename the constant ZIP_MAX_FILE_SIZE_IN_MB into DEFAULT_ZIP_MAX_FILE_SIZE_IN_MB. This should make the intent of this constant clearer. I made this change in 42ade2a.

Please let me know what you think of it!

@@ -0,0 +1,206 @@
import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking but it'd be nice to add docstrings for all these functions. Doesn't have to be too verbose, but at least some sum-up 1-2 lines to define what the function is doing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I added the doctstring of the only public function in 3c02cc5. I usually prefer to not document private functions in favor of having (long) but explicit names. The reason is: this is an implementation detail to the consumers of the module. Ergo, consumers should not rely on these functions but documenting them can send mixed signals.

I sometimes briefly document private functions (like https://github.com/mozilla-releng/mozilla-version/blob/master/mozilla_version/firefox.py#L245), if the function name cannot capture the intent and the function is too obscure to guess the output.

Please let me know if you still want me to document these private functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public is perfect; your argument for private ones makes a lot of sense. Now I understand retroactively why you were always suggesting the best of the names when reviewing 😃 👍


# We make this check at this stage (and not when all files from all tasks got extracted)
# because files from different tasks are stored in different folders by scriptworker. Moreover
# we tested no relative paths like ".." are not used within the archive.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nit: either s/tested/test or s/are/were I think? And I think the second negation is not needed. Something like "we tested no relative paths [...] were used within the archive"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Done in: b747099

# We make this check at this stage (and not when all files from all tasks got extracted)
# because files from different tasks are stored in different folders by scriptworker. Moreover
# we tested no relative paths like ".." are not used within the archive.
_ensure_no_file_got_overwritten(task_id, extracted_files)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to have the check here 👍

raise NotImplementedError('More than 1 archive extracted. Only 1 is supported at once')
extracted_paths_per_relative_path = list(extracted_paths_per_archive.values())[0]

context.artifacts_to_beetmove = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if it's worth moving this context.artifacts_to_beetmove logic + logic from 169-179 in a separate function. Again, no strong feelings, we can leave them as they are, I'm more leaning towards consistency with the other push_to_* operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was wondering the same. I can't find a name for this function. I get the feeling too many concepts are intertwined. That's why I left it in this function. How would you describe this function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, good question. I wonder, how bad it would be if we defined that function within push_to_maven()? Something like:

async def push_to_maven(context):
       def _process_and_validate_artifacts(...)
        ....
        return artifacts
        
        ......
       context.artifacts_to_beetmove = _process_and_validate_artifacts(...)

I know most of the other similar functions are cleaner but if we don't find any better solution, this could be a good compromise. Alternatively, we can leave things be, I don't want to block on this, nor I have strong feelings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 7f5682e. Unit tests are failing because of #176.

Tested e2e in https://tools.taskcluster.net/groups/FN8cz6cTSyiKeVxa8Y0KAg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks for making this amendment!

@@ -0,0 +1,54 @@
import os
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm tempted to see this file renamed to maven_utils.py. Not sure if it's worth it though. What say you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me. Done in 5cab293

}
},
"required": ["taskId", "taskType", "paths", "locale"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why trim locale here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

geckoview tasks don't specify a locale to beetmove. That's why I removed it from the required fields. I'm fine putting it back and creating a different schema. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my fault from the very beginnings of beetmoverscript. In shipitscript and bouncerscript we have task schemas for each of the possible tasks and full integration tests as we should have! For beetmover unfortunately it is what it is. So to answer your question, ideally, we add a new schema so that we don't touch existing ones (like trimming the locale or hashType or platform to fit new maven tasks). Old beetmover jobs still have those, hence be tested against those correctly.

Again, this is not blocking so we can definitely follow-up in a separate PR. I have a backlog task to take over (hopefully) within declarative artifacts, to rewrite all the tasks to have integration tests for all possible beetmover jobs so this work will be undertaken anyway at some point this quarter so not sure it's worth time investing now. If it's easy for you to tweak tasks to fit to a new schema and not touch existing, that's be ideal, but again, we're going to do that sooner or later anyway for all tasks. Up to you, am fine either way 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got! I'd prefer to make it in a follow up. I think we're safe for now because beetmoverscript relies on locale being defined. So if a task def misses one, a KeyError will be raised somewhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -46,9 +46,12 @@
"items": {
"type": "string"
}
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: in the near future, in my backlog, I have a task to add integration tests for beetmover in such a way that we have beetmover schema for all possible beetmover jobs + actual tests against them. Like we currently have in bouncerscript or shipitscript. Right now we are reusing these two possible tasks but it's not enough.

}
},
"required": ["taskId", "taskType", "paths", "locale"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I presume you remove 'locale' to fit the maven jobs right? The old beetmover jobs are still having the locale. Maybe it'd be worth adding it as optional.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that's what this change is about. locale is now optional

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@JohanLorenzo
Copy link
Contributor Author

Copy link
Contributor

@MihaiTabara MihaiTabara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@@ -163,3 +173,9 @@
'target.dmg',
'target.apk',
)

# Zip archive can theoretically have a better compression ratio, like when there's a big amount
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, thanks for clarifying.

}
},
"required": ["taskId", "taskType", "paths", "locale"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}
},
"required": ["taskId", "taskType", "paths", "locale"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my fault from the very beginnings of beetmoverscript. In shipitscript and bouncerscript we have task schemas for each of the possible tasks and full integration tests as we should have! For beetmover unfortunately it is what it is. So to answer your question, ideally, we add a new schema so that we don't touch existing ones (like trimming the locale or hashType or platform to fit new maven tasks). Old beetmover jobs still have those, hence be tested against those correctly.

Again, this is not blocking so we can definitely follow-up in a separate PR. I have a backlog task to take over (hopefully) within declarative artifacts, to rewrite all the tasks to have integration tests for all possible beetmover jobs so this work will be undertaken anyway at some point this quarter so not sure it's worth time investing now. If it's easy for you to tweak tasks to fit to a new schema and not touch existing, that's be ideal, but again, we're going to do that sooner or later anyway for all tasks. Up to you, am fine either way 👍

raise NotImplementedError('More than 1 archive extracted. Only 1 is supported at once')
extracted_paths_per_relative_path = list(extracted_paths_per_archive.values())[0]

context.artifacts_to_beetmove = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, good question. I wonder, how bad it would be if we defined that function within push_to_maven()? Something like:

async def push_to_maven(context):
       def _process_and_validate_artifacts(...)
        ....
        return artifacts
        
        ......
       context.artifacts_to_beetmove = _process_and_validate_artifacts(...)

I know most of the other similar functions are cleaner but if we don't find any better solution, this could be a good compromise. Alternatively, we can leave things be, I don't want to block on this, nor I have strong feelings.

@@ -109,29 +110,45 @@ def add_balrog_manifest_to_artifacts(context):
utils.write_json(abs_file_path, context.balrog_manifest)


def get_upstream_artifact(context, taskid, path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, absolutely no worries, this is super nice! 👍

@@ -0,0 +1,206 @@
import logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public is perfect; your argument for private ones makes a lot of sense. Now I understand retroactively why you were always suggesting the best of the names when reviewing 😃 👍

def check_and_extract_zip_archives(artifacts_per_task_id, expected_files_per_archive_per_task_id, zip_max_size_in_mb):
"""Verify zip archives and extract them.

This function enhances the checks done in python's zipfile. Each of the zip file passed is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holy smokes, this is docs for realz 😮


def _check_archive_itself(zip_path, zip_max_size_in_mb):
zip_size = os.path.getsize(zip_path)
zip_size_in_mb = zip_size // (1024 * 1024)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

log.info('Archive "{}" contains files with legitimate sizes.'.format(zip_path))


def _ensure_all_expected_files_are_present_in_archive(zip_path, files_in_archive, expected_files):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\o/

files_in_archive = set(files_in_archive)

unique_expected_files = set(expected_files)
if len(expected_files) != len(unique_expected_files):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me, thanks for adding all these checks. Makes me feel so much safer having all of them. GG again for all the validations!

@JohanLorenzo JohanLorenzo merged commit aee370a into mozilla-releng:master Aug 13, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants