Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add S3 Cache to pants #4589

Closed
wants to merge 3 commits into from
Closed

Conversation

toddgardner
Copy link

Problem

This adds an s3 implementation of the pants cache. S3 is used as it is simpler for smaller companies to maintain than running and securing another server and potential caching layer, and configuring development laptops, etc.

Solution

This is an implementation written for my previous startup by @JoeEnnever that we're pushing upstream. It's got several practical tweaks like abandoning on errors, enforcing timeouts, etc.

The advantage of not having this as a plugin is it avoids having to monkey patch cache_setup, or build a new plugin system just for cache backends. Additionally there are some problems installing boto in a plugin, because it isn't zipsafe (#4428).

The downsides is that it adds two dependencies to pants core, in terms of boto3 and pyjavaproperties.

It has several ways to pass credentials, including standard boto chain, with an optionally specified profile.

It also includes a way to have the properties come from a java-properties file. This is useful if your company uses s3 as an ivy artifact storage as well; you can share the credentials between pants and ivy xml. It's unclear if this is a wide enough use-case to actually push upstream.

I also haven't done any performance testing on this versus actually just rebuilding targets. It doesn't seem to feel slow, at least.

@stuhood
Copy link
Sponsor Member

stuhood commented May 15, 2017

Thanks Todd! Not sure why Travis isn't kicking off... perhaps because this is your first PR. Will review tomorrow!

# Delegate storage and extraction to local cache
try:
return self._localcache.store_and_use_artifact(
cache_key, iter_content(get_result['Body']), results_dir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to close get_result['Body'], otherwise python holds onto the file contents in memory.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ha, that's your fabled memory leak. Done

Copy link
Sponsor Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify the help string for the --resolver option in the context of this new capability? For example, indicating that the resolver is only relevant to http(s) URLs.

Copy link
Sponsor Member

@stuhood stuhood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

I'd like a second opinion on the risk of adding boto as a dependency, but... seems low?

@toddgardner
Copy link
Author

Changed resolver flag, waiting on a response from travis support.

Copy link
Member

@mateor mateor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am super excited for this. I spoke with AWS about Pants support last week and we discussed how we would probably want to write this interface someday.

Code looks good too. I have some suggestions and some nits. The biggest effort ask is the switch from pytest to unittest. If Stu doesn't mind pytest creeping back in, I will not block on it.

Thanks Todd!

@@ -1,5 +1,6 @@
ansicolors==1.0.2
beautifulsoup4>=4.3.2,<4.4
boto3==1.4.4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

boto is for sure the way to go - not a concern for me.

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem with this dep.

@@ -84,6 +85,12 @@ def register_options(cls, register):
help='number of times pinger tries a cache')
register('--write-permissions', advanced=True, type=str, default=None,
help='Permissions to use when writing artifacts to a local cache, in octal.')
register('--s3-credential-file', advanced=True, type=str,
default=os.path.expanduser('~/.pants/.s3credentials'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a pants_configdir option that this should use instead.

(Pdb) self.get_options().pants_configdir
'/Users/mateo/.config/pants'

@@ -0,0 +1,178 @@
# coding=utf-8
# Copyright 2014 Pants project contributors (see CONTRIBUTORS.md).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: copyright year

exceptions.EndpointConnectionError, exceptions.ChecksumError
]

def _connect_to_s3(config_file, profile_name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote something similar, android_config_util.py, as an intern - at some point there should be a general interface for non-Pants config reads for secrets like these.

Not a problem for this review, though

s3_object = self._get_object(cache_key)
try:
get_result = s3_object.get()
except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be nice to catch a tighter exception here, but I think staying in sync with what the local_artifact_cache does was probably the right choice.

def _get_object(self, cache_key):
return self._s3.Object(self._bucket, self._path_for_key(cache_key))

def _path_for_key(self, cache_key):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move _cache_file_for_key up to the base class instead please?

@@ -0,0 +1,198 @@
# coding=utf-8
# Copyright 2014 Pants project contributors (see CONTRIBUTORS.md).
Copy link
Member

@mateor mateor May 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: year again

from contextlib import contextmanager

import boto3
import pytest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be best to not import pytest here and use unittest like the rest of the codebase.

There was some effort put into ripping pytest out and I don't see any remaining usage in the Pants tests tree

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, didn't realize this was an intentional choice in pants. I'm fine with rewriting it (and doing the rest of these changes); will take a stab after work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am excited about this contribution, glad you (re)joined the Pants community :)

Feel free to ping me if you need a committer/reviewer.

Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an intentional choice, although not necessarily the right one. Pytest certainly encourages you not to, and there are some features unavailable to class-based tests. Nonetheless, fixtures are confusing, and xtest-style tests are well understood by most, so...

Copy link
Sponsor Contributor

@benjyw benjyw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an awesome change, that will increase Pants's appeal (assuming latency to S3 is acceptable in a build environment). Thanks for this! Just a few comments inline.

@@ -1,5 +1,6 @@
ansicolors==1.0.2
beautifulsoup4>=4.3.2,<4.4
boto3==1.4.4
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no problem with this dep.

logger.debug('Reading access key from {0}'.format(config_file))
boto_kwargs['aws_secret_access_key'] = secret_key
except IOError:
logger.debug('Could not load {0}, using ENV vars'.format(config_file))
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the IOError's message to the log message, so the user can figure out what happened.

Also, should this be a warn? Or is this expected behavior in some cases? If the latter, it seems like asking for trouble to be unable to distinguish between expected behavior and an error... Perhaps we should expect that the config_file, if not None, must be readable?

logger.debug('Could not load {0}, using ENV vars'.format(config_file))

session = boto3.Session(**boto_kwargs)
config = Config(connect_timeout=4, read_timeout=4)
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be options?

return session.resource('s3', config=config)


_READ_SIZE_BYTES = 4 * 1024 * 1024
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be an option? Or is it just super-standard to read 4MB chunks from S3?


def try_insert(self, cache_key, paths):
logger.debug('Insert {0}'.format(cache_key))
# Delegate creation of artifacts to the local cache
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: End sentence with a period.

from contextlib import contextmanager

import boto3
import pytest
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was an intentional choice, although not necessarily the right one. Pytest certainly encourages you not to, and there are some features unavailable to class-based tests. Nonetheless, fixtures are confusing, and xtest-style tests are well understood by most, so...

_TEST_BUCKET = 'verst-test-bucket'


@pytest.yield_fixture(scope="function")
Copy link
Sponsor Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: single quotes, here and all other fixtures below. Although I guess it doesn't matter since you're going to rewrite this anyway.

@stuhood
Copy link
Sponsor Member

stuhood commented Jul 19, 2017

@toddgardner : Do you have time to wrap this one up? It would be really appreciated!

@mateor
Copy link
Member

mateor commented Sep 4, 2017

I am going to experiment with this. If the latencies are workable, I can possibly carry the work forward if Todd doesn't have the bandwidth

@mateor
Copy link
Member

mateor commented Sep 5, 2017

experiment was positive, will move forward unless Todd objects

Copy link
Author

@toddgardner toddgardner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just came back from vacation. This looks great, thanks for picking it up.

@mateor
Copy link
Member

mateor commented Sep 7, 2017

Okay, will do. We just converted a cache to s3 and will be experimenting. Once that looks reasonable, I will push a PR.

@Eric-Arellano
Copy link
Contributor

Closing because caching has changed dramatically with the V2 implementation of Pants to now require gRPC.

Thank you for originally putting up this change, though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants