Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Google Cloud Pubsub support for data records #209

Merged
merged 26 commits into from
Feb 6, 2019

Conversation

vic3lord
Copy link
Contributor

@vic3lord vic3lord commented Jan 30, 2019

Description

Adding Google Cloud Pubsub support for data records, currently just a minimal implementation

Motivation and Context

We are using Google Cloud Pubsub and needed something native for data records

How Has This Been Tested?

running make with all tasks, passed all tests and started flagr server

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@codecov-io
Copy link

codecov-io commented Jan 30, 2019

Codecov Report

Merging #209 into master will decrease coverage by 1.11%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #209      +/-   ##
==========================================
- Coverage   86.73%   85.62%   -1.12%     
==========================================
  Files          23       24       +1     
  Lines        1342     1384      +42     
==========================================
+ Hits         1164     1185      +21     
- Misses        127      144      +17     
- Partials       51       55       +4
Impacted Files Coverage Δ
pkg/handler/data_recorder.go 100% <100%> (ø) ⬆️
pkg/handler/data_recorder_pubsub.go 47.5% <47.5%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4cc964b...e9328ef. Read the comment docs.

@crberube
Copy link
Contributor

Awesome. We are also looking to use Pub/Sub as our data pipeline.

client, err := pubsub.NewClient(
context.Background(),
config.Config.RecorderPubsubProjectID,
option.WithServiceAccountFile(config.Config.RecorderPubsubKeyFile),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is deprecated, it is recommended to use WithCredentialsFile instead. Just starting to look into this to check their equivalency, but they seem to operate on the same idea.

As this is an option, does setting it to an empty string affect the use of Application Default Credentials? Would want to make sure that is not the case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, you are right! this option is deprecated I just used a chunk of code I am already using for the past 2 years in many go services here... I am pushing a fix now (thanks I'm also fixing my own services)

Regarding the empty string, yes it will work with default creds, matter effect this is exactly how we develop on our machines using our own accounts with proper IAM

}
}

type pubsubEvalResult struct {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to #203, I think it would be great if pubsub also follow the same pattern. The better if we can fix #203 and have a single frame struct for data records.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree, I looked at the code and tried to follow your style just to make sure everything will pass and be ok on your end... But sure I think there's a lot of place to improve some of the things.

I will take a look at those relevant issues and will update ASAP (this is night over here)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I think you can use

type pubsubMessageFrame struct {
	Payload   string `json:"payload"`
	Encrypted bool   `json:"encrypted"`
}

Consolidating 2 structs is the same amout of work for 3. We can fix #203 later.

Also, make sure the final payload is a JSON struct looks like

{
  "payload": "<json marshal of EvalResult>",
  "encrypted": false
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and test coverage of this file :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhouzhuojie regarding the pubsubMessageFrame, I added it on my end but it looks weird to marshal into json string because pubsub accepts []byte anyways so the conversion happens for no reason... I'm pushing the code to share this with you, LMK what you think

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew the json string payload at the first look is weird. The reason I want to standardize the log format is that

  • Extensibility. If people want to end-to-end encryption, compression, or other meta things related to the message frame, they can do it regardless of the choice of the recorder type. They can add more fields into the message frame struct. Payload as a string fits into this decision.
  • Portability. The data analytics pipeline requires no changes if one wants to switch between providers.

@crberube
Copy link
Contributor

Need to check further to see if this is a larger-scale thing, but it seems that if the pubsub connection fails then an EvaluationResult fails to be returned. @zhouzhuojie would this be expected behavior? It seems to me that we'd want to return a result anyways given that the data recording should be an async process.

@zhouzhuojie
Copy link
Collaborator

Need to check further to see if this is a larger-scale thing, but it seems that if the pubsub connection fails then an EvaluationResult fails to be returned. @zhouzhuojie would this be expected behavior? It seems to me that we'd want to return a result anyways given that the data recording should be an async process.

Why do you think "if the pubsub connection fails then an EvaluationResult fails to be returned"? The connection won't affect online evaluation. AsyncRecord function should just log errors if there's any failure, and it shouldn't panic or exit as well.

@crberube
Copy link
Contributor

Here is what I'm seeing:

  1. Start up Flagr (default SQLite, w/ data recording enabled. in this case pubsub). PubSub is not running to simulate connection issue.
  2. Issue request:
curl --header "Content-Type: application/json" --request POST --data '{"flagId":1}' localhost:18000/api/v1/evaluation
curl: (52) Empty reply from server

Server logs:

INFO[0248] started handling request                      method=POST remote="172.17.0.1:34236" request=/api/v1/evaluation
{"FlagEvalResult":{"evalContext":{"entityID":"randomly_generated_544474078","flagID":1},"evalDebugLog":{"segmentDebugLogs":[]},"flagID":1,"flagKey":"kmmcd1nsd6ze56chh","flagSnapshotID":9,"segmentID":1,"timestamp":"2019-01-31T19:57:38Z","variantAttachment":null,"variantID":null,"variantKey":null}}
ERRO[0308] error pushing to pubsub                       id= pubsub_error="context deadline exceeded"
INFO[0308] completed handling request                    measure#flagr.latency=60002253200 method=POST remote="172.17.0.1:34236" request=/api/v1/evaluation status=200 text_status=OK took=1m0.0022532s

I'd expect the result to be returned immediately, but it's never returned at all. In the OpenAPI client for Go, this ends up setting the error value of the '''PostEvaluationResult''' method.

It looks like it's hanging on the pubsub.NewClient part of the code, but again, still digging.

@zhouzhuojie
Copy link
Collaborator

@crberube @vic3lord

I see. Inside NewPubsubRecorder should do logrus.Fatal instead of just logrus.Error when we get error from pubsub.NewClient.

option.WithCredentialsFile(config.Config.RecorderPubsubKeyFile),
)
if err != nil {
logrus.WithField("pubsub_error", err).Error("error getting pubsub client")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's Fatal here instead of Error

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue is caused by get method from pubsub which is a blocking operation, see pubsub doc here
we should pass context either with a deadline or timeout to get method instead of just background context.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm this is the issue. When I disable verbose logging, everything works as expected.

res := p.topic.Publish(ctx, &pubsub.Message{Data: payload})
if config.Config.RecorderPubsubVerbose {
ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
Copy link
Contributor

@crberube crberube Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about combining this with running the Get within a new goroutine?

Tried this and it seems to work well:

if config.Config.RecorderPubsubVerbose {
		go func() {
			ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
			defer cancel()
			id, err := res.Get(ctx)
			if err != nil {
				logrus.WithFields(logrus.Fields{"pubsub_error": err, "id": id}).Error("error pushing to pubsub")
			}
		}()
	}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for another goroutine. Also, 5s can be moved into env config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I added both

@vic3lord
Copy link
Contributor Author

vic3lord commented Feb 3, 2019

I need some help regarding the tests, I added test coverage to pubsub but now the data_recorder_test.go fails, I don't really understand how it passed before because it used the same NewPubsubRecorder() function when calling GetDataRecorder().

I can see that we use in all data recorders with their production functions without mocking, meaning connection should fail on all of them, unless I am missing something that runs as a mock server on CI

@crberube
Copy link
Contributor

crberube commented Feb 4, 2019

Still figuring out the best way to handle this, but I know what is wrong currently:

What is failing is the data_recorder_pubsub_test. The data_recorder_test.go file passes still and that is because we are stubbing out the NewPubsubRecorder function. For the former function...

When the following code is called:

	client, err := pubsub.NewClient(
		context.Background(),
		config.Config.RecorderPubsubProjectID,
		option.WithCredentialsFile(config.Config.RecorderPubsubKeyFile),
	)

the library uses the following rules to create a new client:

GCP client libraries use a strategy called Application Default Credentials (ADC) to find your application's credentials. When your code uses a client library, the strategy checks for your credentials in the following order:

  1. First, ADC checks to see if the environment variable GOOGLE_APPLICATION_CREDENTIALS is set. If the variable is set, ADC uses the service account file that the variable points to. The next section describes how to set the environment variable.
  1. If the environment variable isn't set, ADC uses the default service account that Compute Engine, Kubernetes Engine, App Engine, and Cloud Functions provide, for applications that run on those services.
  1. If ADC can't use either of the above credentials, an error occurs.

On our machines, step 2 is happening, likely because we have run a gcloud auth command in the past. On the CircleCI containers, this has never been run and so we go to step 3 (error).

NewClient signature is as follows:

func NewClient(ctx context.Context, projectID string, opts ...option.ClientOption) (c *Client, err error) {

and the return value in case of an error is this:

return nil, fmt.Errorf("pubsub: %v", err)

so when the rest of the code executes:

	if err != nil {
		// TODO: use Fatal again after fixing the test expecting to not panic.
		// logrus.WithField("pubsub_error", err).Fatal("error getting pubsub client")
		logrus.WithField("pubsub_error", err).Error("error getting pubsub client")
	}

	return &pubsubRecorder{
		producer: client,
		topic:    client.Topic(config.Config.RecorderPubsubTopicName),
		enabled:  config.Config.RecorderEnabled,
	}

client is a null pointer, and you get a null pointer dereference.

So that's the issue.

Having the Fatal call will prevent us from dereferencing a nil pointer which is good. It is also the same way we handle error cases in the kafka client so it would follow the standards there.

So to fix the issue in the test we need to either
a) allow the client to retrieve credentials or
b) mock the client somehow
c) run the pubsub emulator

a) is not great because you would need to provide some sort of real credentials.
c) is pretty heavy handed for our current use case since all we are testing is that we can create a client
so it sounds like b) is our best bet. It appears that the library has a way to mock a client fairly easily, I'm playing around with it now and seeing if I can get something figured out.

RecorderPubsubTopicName string `env:"FLAGR_RECORDER_PUBSUB_TOPIC_NAME" envDefault:"flagr-records"`
RecorderPubsubKeyFile string `env:"FLAGR_RECORDER_PUBSUB_KEYFILE" envDefault:""`
RecorderPubsubVerbose bool `env:"FLAGR_RECORDER_PUBSUB_VERBOSE" envDefault:"false"`
RecorderPubsubVerboseCancel time.Duration `env:"FLAGR_RECORDER_PUBSUB_VERBOSE_CANCEL" envDefault:"5s"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about RecorderPubsubVerboseCancelTimeout? Otherwise, it's not clear to me that cancel can be a time.Duration.

@crberube
Copy link
Contributor

crberube commented Feb 4, 2019

@vic3lord I've created a PR on your fork which fixes the above issues

Christopher Berube and others added 2 commits February 4, 2019 21:26
@crberube
Copy link
Contributor

crberube commented Feb 5, 2019

This looks good to me pending my question about handling potential marshal errors.

@crberube
Copy link
Contributor

crberube commented Feb 5, 2019

Cool. @zhouzhuojie how are you feeling about everything at this point?

"github.com/checkr/flagr/pkg/config"
"github.com/checkr/flagr/swagger_gen/models"
"google.golang.org/api/option"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, no need a new line here

)

type pubsubRecorder struct {
enabled bool
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this enabled necessary here? eval.go checks it here https://github.com/checkr/flagr/blob/master/pkg/handler/eval.go#L178

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah... this could probably be removed from the kafka and kinesis versions as they do the check as well

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't realize that it was also used in kafka and kinesis, we can probably clean them up in other PRs

Copy link
Collaborator

@zhouzhuojie zhouzhuojie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, I think I left 1 or 2 nit comments, nothing major

Also, I think you can document how to authenticate with pubsub, for example, setting GOOGLE_APPLICATION_CREDENTIALS

@vic3lord
Copy link
Contributor Author

vic3lord commented Feb 6, 2019

Thank you so much for helping, I fixed the little nits.

Where would it be best to document the pubsub addition and its auth docs?

@zhouzhuojie
Copy link
Collaborator

Thank you so much for helping, I fixed the little nits.

Where would it be best to document the pubsub addition and its auth docs?

I think you can put it after the Kinesis section in https://github.com/checkr/flagr/blob/master/docs/flagr_env.md

@zhouzhuojie zhouzhuojie merged commit 9610385 into openflagr:master Feb 6, 2019
@vic3lord vic3lord deleted the pubsub branch February 7, 2019 07:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants