Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reuse Build Workspace in PR based Pipeline #66

Open
shirangj opened this issue Nov 1, 2022 · 12 comments
Open

Reuse Build Workspace in PR based Pipeline #66

shirangj opened this issue Nov 1, 2022 · 12 comments
Assignees
Labels
rfc-feature Request for Comments for a Feature

Comments

@shirangj
Copy link
Contributor

shirangj commented Nov 1, 2022

Reuse Build Workspace in PR based Pipeline

Summary:

Each pull request has to pass PR based build pipeline in order to be merged. This proposal is to reduce PR based build time by reusing more relevant build workspace from previous builds, and thus increase O3DE contributors' work efficiency.

Problem try to resolve

O3DE Jenkins build workspace is located on EBS volume created from daily EBS snapshot. Currently, more than half of O3DE PR based builds take over 3 hours to finish, this is mainly because:

  1. The PR is based on a too old or too new commit. There is a very high chance that the PR is based on a commit that has major difference with the commit on EBS snapshot. The build artifacts from the EBS snapshot cannot be reused in this case.
  2. There is no EBS snapshot for the PR's target branch. For example, when a PR is created based off stabilization branch and there is no EBS snapshot for stabilization branch, the build pipeline will have no artifacts to reuse and will rebuild everything.
  3. It's not feasible to create EBS snapshots for each commit id due to high cost and long EBS snapshot creation time.

What is the relevance of this feature?

This is important because it reduces O3DE contributors' waiting time to merge pull requests and enables the following:

  1. Store build workspace on S3 for each commit id. Developers can download build workspace to local machine for debugging purposes.
  2. Share build workspace between different build pipelines and reduce build time.

Feature design description:

S3SIS (S3 Single Instance Storage) will be used to share build workspace between different build pipelines. It reduces the file transfer time and cost by only transferring the delta files. In addition, files with same content hash are deduplicated using S3SIS, this significantly saves S3 storage cost.

After each non-PR based AR build is done, the build workspace will be uploaded to S3 using S3SIS. PR based build will download and reuse the build workspace.

Use cases:

  1. Upload build workspace from non-PR based AR build: After a successful build, use S3SIS to upload build workspace to S3, and tag all files with commit id.
  2. Download build workspace to PR based AR build: The first build creates EBS volume from the latest EBS snapshot, find parent commit id and then download the build workspace tagged with the commit id, all files' timestamps and attributes will be preserved during the download process.
  3. Download build workspace to local machine: O3DE contributors can install S3SIS CLI and download build workspace to their local machines for debugging purpose.

Technical design description:

Install S3SIS CLI on Build Image

Due to the issue that CLI tool installed using O3DE Python cannot be executed, S3SIS CLI needs to be installed with system installed Python on build nodes.

Steps to install S3SIS CLI:

  1. Run git clone https://github.com/aws-lumberyard/s3sis.git or download the source code.
  2. Run python setup.py install" to install S3SIS CLI.
  3. Run s3siscli configure" to configure S3SIS CLI.

S3SIS CLI should be part of the build image, so it doesn't need to be installed and configured in every build.

Build Workspace Label

S3SIS requires a label to upload and download, the label is used to group a set of file objects on S3.

In order to make Jenkins build be able to find the correct workspace to sync, the label needs to include information about pipeline name, platform, build configuration and commit id.

Label format used to upload/download workspace using S3SIS:
{pipeline_name}{platform}{build_configuration}_{commit_id}

S3SIS Manifest

S3SIS manifest file stores necessary file information, and it's the main file to lookup when doing upload or download.

{
    "label":"O3DE_Windows_profile_e0e455742f0109e7dfdf6ea44d9ec875cd0848a9",
    "filelist":{
        "engine.json":{
            "isfile":true,
            "md5":"cb87aae4199e0f920fd4f275a38fbfba",
            "size":"2920",
            "atimestamp":"1667166054970941766",
            "mtimestamp":"1667166040290092687",
            "attribute":"33206"
        },
        "Code/Framework":{
            "isfile":false
        },
        "Code/Framework/CMakeLists.txt":{
            "isfile":true,
            "md5":"25af1ce1496b0b0fa9a0fcbec3c383bf",
            "size":"576",
            "atimestamp":"1667166810450666423",
            "mtimestamp":"1667166810438665729",
            "attribute":"33206"
        }
    }
}

Upload Workspace

To reduce the number of workspace uploaded, workspace is uploaded by non-PR based AR builds on success, like https://jenkins.build.o3de.org/job/O3DE/job/development/. The upload won't increase any PR build time.

If the build has multiple commits, s3siscli upload should run for each commit id. After first upload, all files are already on S3, the following upload command will upload nothing but a manifest file.

Upload workflow:

  1. Use currentBuild.changeSets to find commit ids built in current build.
  2. Run s3siscli upload --label {pipeline_name}_{platform}_{build_configuration}_{commit_id} for each commit id.

Download Workspace

Most build systems use timestamp to decide when to rebuild. In order to reuse the build workspace, all files' timestamps and attributes should be preserved during the download process.

Ninja is used for Linux build, it tracks file timestamp at nanoseconds level. Ninja will rebuild a file whenever the file timestamp changed at nanoseconds level. In this case, all files' timestamps should be preserved at nanoseconds level.

Only first PR based build will sync workspace from the closest commit id, all following builds will reuse workspace from previous builds.

Download workflow:

  1. Find first parent commit id that can be found as a S3SIS label on S3.
  2. Build the label name {pipeline_name}_{platform}_{build_configuration}_{parent_commit_id}.
  3. Run s3siscli download --label {label} --preserve-timestamp --preserve-attributes --preserve-empty-folders --cleanup to syn workspace.

File Cleanup

Stale files should be cleanup regularly to avoid high S3 storage cost.

Each file on S3 is linked to one or multiple S3SIS manifest file, and each manifest file is linked to a commit id. A manifest is considered to be stale if the commit is over 1 month old.

Use a DynamoDB table to keep track of the number of manifest that each file is linked to. Set object hash as primary key to reduce lookup time. For example:

object_md5_hash (Primary Key) ref_count
cb87aae4199e0f920fd4f275a38fbfba 3
25af1ce1496b0b0fa9a0fcbec3c383bf 3

Run a daily lambda function to lookup manifest files, update the table and delete objects with 0 ref_count. Run a weekly lambda function to make sure S3 object, manifest files and DynamoDB table are synced.

Daily Cleanup Lambda Workflow:

  1. Read manifest files uploaded in the past 24 hours, count each object's occurrence in these manifest files and increase ref_count in DynamoDB table.
  2. Read stale manifest, count each object's occurrence in these manifest files, delete stale manifest and decrease ref_count in DynamoDB table.
  3. Delete S3 objects that have 0 ref_count.

Cost

The cost increased is mainly from S3 storage and S3 transfer. However, it reduces EC2 cost when build takes less time.

To reduce cost in the first place, only enable this feature for bottleneck build (Linux) since it directly impact the overall pipeline time. This feature can be rolled out gradually if it produces a good result.

S3 Cost Increased:

  • Storage: $0.023/GB * TRANSFERRED_FILE_SIZE * AR_BUILD_COUNT
  • Transfer to S3: 0
  • Transfer out S3: $0.02/GB * TRANSFERRED_FILE_SIZE * PR_COUNT

Linux build workspace is 180GB. Assume there are 150 AR builds and 200 pull request created per month.

In worst case scenario, all 180GB files are transferred for every build, the S3 cost would be $1341 per month.

Because S3SIS only transfers delta files, the files transferred will be significantly reduced, assume there are 30GB files transferred per build on average (in fact, the size could be less, it transfers 0 files for a zero build). The cost would be $223 per month.

EC2 Cost Reduced:

Linux node type is c4.4xlarge whose price is $0.796 per hour.

Assume 1 hour is saved on average for 1 PR based Linux build, 200 PR based builds would save $159.2.

What are the advantages of the feature?

  • Each commit has a build workspace linked to it.
  • It provides a solution to reuse more relevant build workspace to reduce PR based build time.
  • S3SIS eliminates duplicate files on S3 and saves S3 storage cost.
  • S3SIS reduces file transfer time and cost when syncing workspaces.

What are the disadvantages of the feature?

  • This solution is specifically to reduce PR based build time, it may slightly increase non-PR based AR build time since workspace upload is done by non-PR based AR build.
  • Reusing workspace for a core change build won't provide much benefit, and the build time may be increased due to the extra workspace sync time.

How will this be implemented or integrated into the O3DE environment?

First integrate this with O3DE bottleneck build, like Linux profile_nounity build, because it directly impacts the overall PR based build time. If it produces good result, then gradually roll this out to other builds.

Are there any alternatives to this feature?

Yes, using distributed build or build cache can also reduce overall build time.

How will users learn this feature?

  • Users can see the upload/download stage from Jenkins BlueOcean build page.

Are there any open questions?

  • Will this reduce the local build time?
    • No, this is only intended to reduce PR based build time on O3DE Jenkins.
  • Will this reduce the build time if I make a core change?
    • No, a core change build won't benefit from reusing previous workspace, since everything need to be rebuilt anyway.
  • Can I download the build workspace to my local machine for debugging purpose?
    • Yes, you can download the workspace to local machine if you install the S3SIS CLI on the machine.
@shirangj shirangj added the rfc-feature Request for Comments for a Feature label Nov 1, 2022
@brianherrera
Copy link
Contributor

Yes, you can download the workspace to local machine if you install the S3SIS CLI on the machine.

I'm assuming this will be public read access for contributors. We should also include this access as part of the cost. We may also need to setup monitoring to avoid unexpected costs with high S3 transfers.

@shirangj
Copy link
Contributor Author

shirangj commented Nov 2, 2022

Yes, you can download the workspace to local machine if you install the S3SIS CLI on the machine.

I'm assuming this will be public read access for contributors. We should also include this access as part of the cost. We may also need to setup monitoring to avoid unexpected costs with high S3 transfers.

Agreed, the transfer cost to internet is much higher than to any AWS AZ, we should limit the artifacts size that can be downloaded by public. Probably require an approval before they can download the workspace to public.

@brianherrera
Copy link
Contributor

brianherrera commented Nov 2, 2022

Download build workspace to PR based AR build:

What's the additional build time added to the AR run using S3SIS to download artifacts. Or the avg download time to download the entire workspace.

Only first PR based build will sync workspace from the closest commit id, all following builds will reuse workspace from previous builds.

Just curious, does this mean that only the first build will use the S3SIS tool? Will other builds skip this step?

@brianherrera
Copy link
Contributor

Reusing workspace for a core change build won't provide much benefit, and the build time may be increased due to the extra workspace sync time.

You mentioned in the meeting we can disable this with a jenkins parameter since S3SIS won't help much. I think this is fine for the initial version. Later on I think add a step in the pipeline to determine the files that changed and disable/enable s3sis based on that.

@shirangj
Copy link
Contributor Author

shirangj commented Nov 2, 2022

Download build workspace to PR based AR build:

What's the additional build time added to the AR run using S3SIS to download artifacts. Or the avg download time to download the entire workspace.

Only first PR based build will sync workspace from the closest commit id, all following builds will reuse workspace from previous builds.

Just curious, does this mean that only the first build will use the S3SIS tool? Will other builds skip this step?

Yeah, only the build first will use S3SIS to sync workspace, the following builds will reuse artifacts that are already built in first build

@shirangj
Copy link
Contributor Author

shirangj commented Nov 2, 2022

Reusing workspace for a core change build won't provide much benefit, and the build time may be increased due to the extra workspace sync time.

You mentioned in the meeting we can disable this with a jenkins parameter since S3SIS won't help much. I think this is fine for the initial version. Later on I think add a step in the pipeline to determine the files that changed and disable/enable s3sis based on that.

Agreed, we should investigate automation for this

@brianherrera
Copy link
Contributor

Adding a note for a discussion point made regarding developer access to the artifacts uploaded by S3SIS. This will be a feature implemented later on so we have time to consider and review the impact of making read access public for O3DE developers.

@amzn-changml
Copy link
Contributor

We should probably add some data on this solution vs other build caching solutions (ccache, sccache, etc.)

@Kadino
Copy link

Kadino commented Nov 8, 2022

May want to avoid enabling this for non-Profile builds, due to the significant increase in size from artifacts such as PDBs. This seems adequately covered by the staged rollout plan.

Recommend the SIS upload is performed before any tests execute, as tests (like any other tool) can corrupt the workspace. Definitely should persist after build, and perhaps processed assets should also be included?

Is this going to persist everything in the entire drive, just the github repo, or only specific artifact folders within the repo? The latter few would reduce size, but may also require more steps to synchronize. May also need to make sure that nothing containing cached AWS Instance or Jenkins secrets gets picked up.

What is the proposed security on the S3 bucket(s) which hold the SIS resources?
...It sounds like use case 3. Download build workspace to local machine may only be accessible to account administrators. Exposing it also could inflate S3 costs, unless "requester pays" is enabled.

@shirangj
Copy link
Contributor Author

shirangj commented Nov 9, 2022

We should probably add some data on this solution vs other build caching solutions (ccache, sccache, etc.)

Sure, will add it

@shirangj
Copy link
Contributor Author

shirangj commented Nov 9, 2022

We can specify which folders to include/exclude for download/upload. In my testing, I included the entire workspace because we need the .git folder to checkout the commit, and we also need to preserve the timestamps of all source codes and build outputs.

There will be no credentials stored on build nodes since S3SIS will use build node's IAM role.

In the first phase, we will restrict the S3 access to build node only. "Download build workspace to local machine" will be a future improvement and we will setup a cloudfront endpoint and come up with an approval process for it.

@brianherrera
Copy link
Contributor

Related to test files left in the workspace. When enabling TIAF in the AR pipeline, they discovered an issue where tests would leave behind files that caused the next build to fail. One of those tests was fixed here: o3de/o3de#13049

This was mainly related to temp files being generated in the wrong locations so they were not cleaned up.

One solution brought up to prevent this was to check for temp files in the workspace (files not committed to the repo and not build artifacts) prior to the run. This may be something to investigate as a check prior to uploading in a later version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc-feature Request for Comments for a Feature
Projects
None yet
Development

No branches or pull requests

4 participants