-
Notifications
You must be signed in to change notification settings - Fork 59
Adds counters to processed_variant and creates a wrapper for Beam counters. #125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pull Request Test Coverage Report for Build 403
💛 - Coveralls |
|
Folks, this PR does not really need two reviews. I have added you both to take a look at how I have implemented counter utils in case that you also need to use it for other features. In my experience, counters and other metrics are a great way of reporting/monitoring issues/progress of a pipeline. Note that these counters are exposed in the Dataflow dashboard too while the job is running (see attached snapshot), so it is not just in the final logs. |
| """ | ||
| counter_filter = MetricsFilter().with_name(self._counter_name) | ||
| query_result = pipeline_result.metrics().query(counter_filter) | ||
| if query_result['counters']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make 'counters' a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| if query_result['counters']: | ||
| counter = query_result['counters'][0] | ||
| return counter.committed | ||
| return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does this happen? Does it make sense to log error too or not? Any risk on silently returnin zero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main case in which this can happen without a bug, is when we have a counter, for example for counting error/warning cases (like the annotation_alt_mismatch_counter I have added in this PR) and there are none of those particular errors. So this counter is never registered in the Beam Metrics for legitimate reasons.
| # don't own (i.e., they are in our upstream dependency) but filling a counter | ||
| # and connecting it to PipelineResults properly seems excessive for this test. | ||
| class MockMetrics(object): | ||
| def __init__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please add a blank link before every def (here and everywhere).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Are you aware of a way to tell pylint to catch this? I prefer style related issues being caught by tools as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for pylint to catch this, but I don't know how to do it. Can you please look into adding it?
| """Extracts and returns the value of this counter. | ||
| Note that this goes through all counters every time it is called to find the | ||
| matching counter (see `query` implementation). So if the goal is to just |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What use case do you envision for get_counter()? Is it used in PR#126 (it's not used in this PR)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that it is not used in this PR (and not even PR #126) except in unit-tests. But I originally started by logging the counters one by one and that is useful in those cases. Since it is also unit-tested and the logic is not trivial (it took me some time to find what is the right way of getting a single counter value), I decided to keep it but if you feel strongly about it, I can remove it either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it's a good practice to remove unused code, otherwise we will end up with unused code, no one remember about it. Can we put what you found valuable in the comment instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
For now I just removed it.
| _METRICS_NAME_SPACE = 'VT_metrics_name_space' | ||
|
|
||
|
|
||
| class BaseCounter(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what value this base class adds here (esp. in python world)? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you already know I am not a big fan developing software "the Python way" :-)
But on a more serious note, this is basically the interface for all Counter implementations. It serves two needs:
-
Our library methods do not need to depend on anything Beam specific. So for example ProcessedVariant constructor takes a BaseCounterFactory by default. I could set the default to None there but then I should have had "if not None ..." anywhere inside that code that I needed to use the counters (this is the design pattern).
-
It makes it easy for downstream code to be unit-tested. For example in processed_variant_test.py a CounterMock implements this interface and then is fed easily into the ProcessedVariant test objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need inheritance for the benefits you mentioned, do you? Define the child and test (and if you really need a no-op) class as before just without inheritance and it works. I'll leave it to you if you prefer this way. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So as discussed offline, the benefit of BaseCounter is that (i) as an interface it documents the intended API of Counter implementations (we can also impose it to some extent in Python with "types" we are adding) and (ii) Having a base implementation let us implement the Null object pattern easily.
gcp_variant_transforms/vcf_to_bq.py
Outdated
| result = pipeline.run() | ||
| result.wait_until_finish() | ||
|
|
||
| beam_util.log_all_counters(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my information: since it was the first use of beam counter in this code base (right?), I am wondering what were the motivations behind it? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right this is the first use and I have tried to make subsequent uses simple by adding the beam_util library.
Part of the motivation is what I mentioned in my first comments after sending the PR (I am not sure if GitHub sent you emails for that, if not please take a look at them in the UI, in particular the snapshot I have attached from Dataflow dashboard). There are at least two benefits:
-
In error/warning cases, (e.g.,
annotation_alt_mismatch_counter) it is hard to find/monitor those through logs since each worker has its own logs (I am not even sure if the Dataflow/Stackdriver dashboards easily expose those logs but certainly you could get them in the Borg world). One higher level way is to count those cases and log them at the end. So at least one can easily/quickly check if their job ran error/warning free. -
Non-error counters can be used for monitoring the progress of the job. For example, I vaguely know that there are N variants in my input and through Dataflow dashboard pipeline I can see live how many of those records have been processed. It is needless to say that monitoring error counters is also very useful: If I get too many errors while the pipeline is running I can stop it early and investigate the issue instead of waiting until it finished and potentially produce garbage results.
|
|
||
|
|
||
| class CounterFactory(BaseCounterFactory): | ||
| def create_counter(self, counter_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just calling ctor in the client code? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the client class (e.g., ProcessedVariantFactory) will become attached to a particular implementation. That will reduce modularity and will make unit-testing more difficult (I know, I know, in Python you can overwrite everything :-) ). This is basically the factory pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the above, I am not sure if the base class and factory is needed. Another option is to pass fn obj that calls ctor in client code. Leave it to you if you prefer this design. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this case, again as discussed offline, I am making this a pure interface and removing the default implementation. I think the "fn obj" method you mentioned works too but I prefer to use interfaces to make it more similar to usual object oriented design patterns (the main advantage of the interface in this case is its documentation and pseudo-type enforcement).
nmousavi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR, Bashir! Just few comments on the design choices in beam_util.
bashir2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review. PTAL.
| _METRICS_NAME_SPACE = 'VT_metrics_name_space' | ||
|
|
||
|
|
||
| class BaseCounter(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, you already know I am not a big fan developing software "the Python way" :-)
But on a more serious note, this is basically the interface for all Counter implementations. It serves two needs:
-
Our library methods do not need to depend on anything Beam specific. So for example ProcessedVariant constructor takes a BaseCounterFactory by default. I could set the default to None there but then I should have had "if not None ..." anywhere inside that code that I needed to use the counters (this is the design pattern).
-
It makes it easy for downstream code to be unit-tested. For example in processed_variant_test.py a CounterMock implements this interface and then is fed easily into the ProcessedVariant test objects.
| """Extracts and returns the value of this counter. | ||
| Note that this goes through all counters every time it is called to find the | ||
| matching counter (see `query` implementation). So if the goal is to just |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right that it is not used in this PR (and not even PR #126) except in unit-tests. But I originally started by logging the counters one by one and that is useful in those cases. Since it is also unit-tested and the logic is not trivial (it took me some time to find what is the right way of getting a single counter value), I decided to keep it but if you feel strongly about it, I can remove it either.
| """ | ||
| counter_filter = MetricsFilter().with_name(self._counter_name) | ||
| query_result = pipeline_result.metrics().query(counter_filter) | ||
| if query_result['counters']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| if query_result['counters']: | ||
| counter = query_result['counters'][0] | ||
| return counter.committed | ||
| return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main case in which this can happen without a bug, is when we have a counter, for example for counting error/warning cases (like the annotation_alt_mismatch_counter I have added in this PR) and there are none of those particular errors. So this counter is never registered in the Beam Metrics for legitimate reasons.
|
|
||
|
|
||
| class CounterFactory(BaseCounterFactory): | ||
| def create_counter(self, counter_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the client class (e.g., ProcessedVariantFactory) will become attached to a particular implementation. That will reduce modularity and will make unit-testing more difficult (I know, I know, in Python you can overwrite everything :-) ). This is basically the factory pattern.
| # don't own (i.e., they are in our upstream dependency) but filling a counter | ||
| # and connecting it to PipelineResults properly seems excessive for this test. | ||
| class MockMetrics(object): | ||
| def __init__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Are you aware of a way to tell pylint to catch this? I prefer style related issues being caught by tools as much as possible.
gcp_variant_transforms/vcf_to_bq.py
Outdated
| result = pipeline.run() | ||
| result.wait_until_finish() | ||
|
|
||
| beam_util.log_all_counters(result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right this is the first use and I have tried to make subsequent uses simple by adding the beam_util library.
Part of the motivation is what I mentioned in my first comments after sending the PR (I am not sure if GitHub sent you emails for that, if not please take a look at them in the UI, in particular the snapshot I have attached from Dataflow dashboard). There are at least two benefits:
-
In error/warning cases, (e.g.,
annotation_alt_mismatch_counter) it is hard to find/monitor those through logs since each worker has its own logs (I am not even sure if the Dataflow/Stackdriver dashboards easily expose those logs but certainly you could get them in the Borg world). One higher level way is to count those cases and log them at the end. So at least one can easily/quickly check if their job ran error/warning free. -
Non-error counters can be used for monitoring the progress of the job. For example, I vaguely know that there are N variants in my input and through Dataflow dashboard pipeline I can see live how many of those records have been processed. It is needless to say that monitoring error counters is also very useful: If I get too many errors while the pipeline is running I can stop it early and investigate the issue instead of waiting until it finished and potentially produce garbage results.
|
Oops, sorry that commits from my development branch leaked into this review branch :-) |
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments about the design of the file/classes. I like the factory pattern :)
| from apache_beam.metrics.metric import MetricsFilter | ||
|
|
||
| _METRICS_NAME_SPACE = 'VT_metrics_name_space' | ||
| _COUNTERS = 'counters' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide comments about the context of these constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| @@ -0,0 +1,97 @@ | |||
| # Copyright 2018 Google Inc. All Rights Reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this file be named counters_util or counter_factory(preferred) instead? In general, we should aim to limit the scope of 'util' classes as they can become a bag of unrelated stuff over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we will have more little Beam related things that do not warrant a separate file, but I also see your concern about "util bags". For now, I changed this to metrics_util (as a minimum this can include other metrics too, not just counters) and we can rename again depending on future needs (it is probably not a great idea to create a separate file+test-file for each little util thing as well).
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """Wrapper for Beam classes, e.g., Metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, I think we should limit the scope of this file. It's ok to have multiple files that specialize for each function.
Please also add an example of how classes in this file should be used as there are several classes and their connection is not trivial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| class BaseCounter(object): | ||
| """The interface of counter objects which does nothing. | ||
| This also acts as a nop implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any value of providing a no-op implementation? I think it's better to raise NotImplementError in inc instead so that it forces subclasses to override (i.e. actually become an 'interface'). Can also rename BaseCounter to CounterInterface if we expect all methods to be abstract.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is value in this no-op implementation and it is to avoid checking for None (null) everywhere in the code where counters are used. This is the Null object pattern.
| if query_result[_COUNTERS]: | ||
| counter = query_result[_COUNTERS][0] | ||
| return counter.committed | ||
| return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's better to return None instead to distinguish between a case there the counter is not there vs. it's actually 0 (if that matters)? Also, please add a brief comment about when this would happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep it simple for now and return 0, there is very subtle difference between None and 0 in this context and let's leave this until there is a real use for this. Also this will happen, when for example counter.inc() has never been called on that counter, it does not mean that it is not referenced anywhere in the code for example (that's why I said the difference is very subtle).
| return 0 | ||
|
|
||
|
|
||
| class BaseCounterFactory(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this become an interface with abstract methods as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as before: Null object pattern.
| header_fields, # type: vcf_header_parser.HeaderFields | ||
| split_alternate_allele_info_fields=True, # type: bool | ||
| annotation_fields=None, # type: List[str] | ||
| counter_factory=beam_util.BaseCounterFactory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not use mutable objects as default values as it can have unexpected characteristics: http://docs.python-guide.org/en/latest/writing/gotchas/
You can set this to None and then initialize to counter_factory or beam_util.BaseCounterFactory() in the method. Or just make it a required field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is mutable in BaseCounterFactory? As I said this is exactly to avoid checking for "None" everywhere these objects are accessed.
BTW, that gotchas link was mentioned in another review comment between us recently, remember? :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking a little more about this, while BaseCounterFactory is semantically immutable but I suppose in Python a rogue function can mutate anything passed to it (e.g., by accessing __dict__) and so cause confusing bugs (for that very function/class). If that is the concern, I can make counter_factory's default value None and check for None only in constructor. Still I will keep BaseCounter as a no-op implementation and if no counter_factory is passed to the class, use instances of BaseCounter to initialize member counters. Please let me know which way you prefer (so in that case BaseCounterFactory will become CounterFactoryInterface with no implementation but BaseCounter will be kept the way it is).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I went with the second option that I suggested to make sure this is safe. The code is not as clean as it used to be but I think that is fine. If you have more comments, please let me know and I will fix in future PRs.
|
|
||
| _FIELD_COUNT_ALTERNATE_ALLELE = 'A' | ||
| # Counter names | ||
| _VARIANT_COUNTER = 'variant_counter' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider creating a class for this to group them together. You can also remove the _COUNTER suffix in that case (we use this pattern in other files like BigQuery constants).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
bashir2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments, PTAL.
| @@ -0,0 +1,97 @@ | |||
| # Copyright 2018 Google Inc. All Rights Reserved. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we will have more little Beam related things that do not warrant a separate file, but I also see your concern about "util bags". For now, I changed this to metrics_util (as a minimum this can include other metrics too, not just counters) and we can rename again depending on future needs (it is probably not a great idea to create a separate file+test-file for each little util thing as well).
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| """Wrapper for Beam classes, e.g., Metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| from apache_beam.metrics.metric import MetricsFilter | ||
|
|
||
| _METRICS_NAME_SPACE = 'VT_metrics_name_space' | ||
| _COUNTERS = 'counters' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| class BaseCounter(object): | ||
| """The interface of counter objects which does nothing. | ||
| This also acts as a nop implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is value in this no-op implementation and it is to avoid checking for None (null) everywhere in the code where counters are used. This is the Null object pattern.
| if query_result[_COUNTERS]: | ||
| counter = query_result[_COUNTERS][0] | ||
| return counter.committed | ||
| return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep it simple for now and return 0, there is very subtle difference between None and 0 in this context and let's leave this until there is a real use for this. Also this will happen, when for example counter.inc() has never been called on that counter, it does not mean that it is not referenced anywhere in the code for example (that's why I said the difference is very subtle).
| return 0 | ||
|
|
||
|
|
||
| class BaseCounterFactory(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same reason as before: Null object pattern.
|
|
||
| _FIELD_COUNT_ALTERNATE_ALLELE = 'A' | ||
| # Counter names | ||
| _VARIANT_COUNTER = 'variant_counter' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| header_fields, # type: vcf_header_parser.HeaderFields | ||
| split_alternate_allele_info_fields=True, # type: bool | ||
| annotation_fields=None, # type: List[str] | ||
| counter_factory=beam_util.BaseCounterFactory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is mutable in BaseCounterFactory? As I said this is exactly to avoid checking for "None" everywhere these objects are accessed.
BTW, that gotchas link was mentioned in another review comment between us recently, remember? :-)
|
BTW, main changes since last review can be seen in this commit (barring the beam_util->metrics_util rename). |
nmousavi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM and few minor comments which I leave them to you to decide. Nice Work!
96282ca to
4cc953f
Compare
bashir2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments. PTAL.
| _METRICS_NAME_SPACE = 'VT_metrics_name_space' | ||
|
|
||
|
|
||
| class BaseCounter(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So as discussed offline, the benefit of BaseCounter is that (i) as an interface it documents the intended API of Counter implementations (we can also impose it to some extent in Python with "types" we are adding) and (ii) Having a base implementation let us implement the Null object pattern easily.
| """Extracts and returns the value of this counter. | ||
| Note that this goes through all counters every time it is called to find the | ||
| matching counter (see `query` implementation). So if the goal is to just |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
For now I just removed it.
|
|
||
|
|
||
| class CounterFactory(BaseCounterFactory): | ||
| def create_counter(self, counter_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in this case, again as discussed offline, I am making this a pure interface and removing the default implementation. I think the "fn obj" method you mentioned works too but I prefer to use interfaces to make it more similar to usual object oriented design patterns (the main advantage of the interface in this case is its documentation and pseudo-type enforcement).
| header_fields, # type: vcf_header_parser.HeaderFields | ||
| split_alternate_allele_info_fields=True, # type: bool | ||
| annotation_fields=None, # type: List[str] | ||
| counter_factory=beam_util.BaseCounterFactory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I went with the second option that I suggested to make sure this is safe. The code is not as clean as it used to be but I think that is fine. If you have more comments, please let me know and I will fix in future PRs.
|
Nima, to see the changed since your last review, you can check this commit. |
nmousavi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second LGTM :-)
4cc953f to
6e8f6a2
Compare
|
@nmousavi please ignore my last comment about get_counter(). It is not used in production code and I removed it as you suggested. |
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes, Bashir! Added a few more comments on the design (as this PR will form the basis of our counters, I want to ensure we have the best possible design).
|
|
||
|
|
||
| class CounterFactory(CounterFactoryInterface): | ||
| def create_counter(self, counter_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Please add a blank line before the def here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| class BaseCounter(object): | ||
| """The interface of counter objects which does nothing. | ||
| This also acts as a no-op implementation such that classes that need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a better design is to make this an interface (i.e. raise NotImplementedError) and instead make a NoOp/NullCounter class that has all methods as pass. That way, you initialize the counters as NoOpCounter classes, which makes their functions more clear (usually "Base" implies an abstract class that shouldn't be used directly).
Having said that, I think ideally there shouldn't be any need for the null counter since you should be able to instantiate classes appropriately, but perhaps I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| class CounterFactoryInterface(object): | ||
| """The interface for counter factories.""" | ||
|
|
||
| def create_counter(self, _): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the full variable name here instead of _ as the signature of the method should be clear in abstract classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| from apache_beam.metrics.metric import MetricsFilter | ||
|
|
||
| # The name space in which all metrics created by this class are. | ||
| _METRICS_NAME_SPACE = 'VT_metrics_name_space' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/NAME_SPACE/NAMESPACE and s/name_space/namespace (I noticed that the Beam api has with_namespace, so we should be consistent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gcp_variant_transforms/vcf_to_bq.py
Outdated
| known_args.split_alternate_allele_info_fields, | ||
| known_args.annotation_fields) | ||
| known_args.annotation_fields, | ||
| metrics_util.CounterFactory()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify (as I don't know about the internals of Beam counters): if another module wants to use counters, does it need to use the same factory? If so, please instantiate this outside of this method so that it can be reused easily. If not, why even pass that in as an argument as it can be instantiated within the class? If it is solely used for testing, you can use python's patch feature instead of passing the argument in. This can also avoid the null counter case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CounterFactory is completely stateless and can be instantiated many times.
The reason it is passed is exactly to avoid direct instantiation inside the ProcessedVariantFactory class to avoid tying it to a particular implementation. Instead, ProcessedVariantFactory only knows/expects the interface of the factory method with no information of what implementation is actually used.
I think this is the root of our difference on this design. Direct instantiation of classes in a constructor, specially those that we like to hide their implementation (e.g., Counters and its Factory in this case) is not a good idea. I shared this writing before in a review thread as well and I appreciate if you take a look (at least glance through it and check the last section before codes/comments which is "Fixing the flaw"). As I said before, I know that not all of it applies to Python but in general, I have found injecting factories a good design pattern to have modular code.
Now, you may disagree to the general pattern or the particular usage here. In either case, please let me know and I will reduce the whole beam_util module to a single CounterWrapper class, removing all other factory/interface/no no-op details (if I want to instantiate a particular implementation why even bother with factories?). I don't think that is a good design but let's move on. I don't believe there is a "best design" and I don't want this PR to be blocked more. I am fine either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing the doc. Sorry I missed it originally. I like the factory design pattern and now see your point about keeping the constructor light. I'd still prefer to create one instance of the factory and passing that in to any module that wants to use counters (instead of each module creating their own instance in the method call), but feel free to leave as is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| import logging | ||
|
|
||
| from apache_beam.runners import runner # pylint: disable=unused-import | ||
| from apache_beam.metrics import Metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please import the metrics module instead of actual classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| of classes in this module should be injected into library classes/methods | ||
| instead of direct instantiation. For example: | ||
| class CounterUser(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this example is confusing as the main way of using this module is like this?
factory = CounterFactory()
my_counter = factory.create_counter('my_counter_name')
my_counter.inc(4)
...or I'm not sure what your example is trying to demonstrate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern is "injecting the dependency" and that is what the comment said. I changed the example as I originally thought you want an example of the recommended pattern not how to use the counter. But if we don't want to inject the dependency (which you have suggested in other comments), I am not sure what the value of the factory class is and we should probably remove that too.
| from gcp_variant_transforms.libs import processed_variant | ||
| from gcp_variant_transforms.libs import vcf_header_parser | ||
|
|
||
| class CounterMock(metrics_util.BaseCounter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered using python's mock.patch feature instead? We use them here (I think we had more use cases that got deleted as part of some refactorings). It's pretty powerful and it's the 'pythonic' way of doing mocks.
If using patch is not feasible, then consider defining these under the testing/ dir so that they can be used by any module (and please rename them as MockCounter and MockCounterFactory).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I did not consider that, thanks for pointing that out. From a quick glance, I think that can be used here too but I have not changed this here yet as I want to first resolve the bigger question on dependency injection and then I will make the modifications here accordingly.
| processed_variant.AlternateBaseData(a) for a in ['A', 'TT']] | ||
| self.assertEqual([proc_var_synthetic], [proc_var]) | ||
| self.assertEqual(counter_factory.counter_map[ | ||
| processed_variant._CounterEnum.VARIANT.value].get_value(), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider creating an alias for _CounterEnum so that it becomes shorter and you don't need as much line breaks. We have adopted the style of importing _CounterEnum as CounterEnum to make it look 'public' for tests (that's the only exception for directly importing classes in our code).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already done in the third PR (#131), so I prefer to leave it for there to get less merge conflicts.
6e8f6a2 to
73ad0e3
Compare
bashir2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments. I have not addressed all because I want to make sure we are on the same page re. the design first. Once that is done I will make other changes accordingly.
PTAL.
To see the changes since last review, check this commit.
| import logging | ||
|
|
||
| from apache_beam.runners import runner # pylint: disable=unused-import | ||
| from apache_beam.metrics import Metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| of classes in this module should be injected into library classes/methods | ||
| instead of direct instantiation. For example: | ||
| class CounterUser(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern is "injecting the dependency" and that is what the comment said. I changed the example as I originally thought you want an example of the recommended pattern not how to use the counter. But if we don't want to inject the dependency (which you have suggested in other comments), I am not sure what the value of the factory class is and we should probably remove that too.
| from apache_beam.metrics.metric import MetricsFilter | ||
|
|
||
| # The name space in which all metrics created by this class are. | ||
| _METRICS_NAME_SPACE = 'VT_metrics_name_space' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| class BaseCounter(object): | ||
| """The interface of counter objects which does nothing. | ||
| This also acts as a no-op implementation such that classes that need to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| class CounterFactoryInterface(object): | ||
| """The interface for counter factories.""" | ||
|
|
||
| def create_counter(self, _): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
|
|
||
| class CounterFactory(CounterFactoryInterface): | ||
| def create_counter(self, counter_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| processed_variant.AlternateBaseData(a) for a in ['A', 'TT']] | ||
| self.assertEqual([proc_var_synthetic], [proc_var]) | ||
| self.assertEqual(counter_factory.counter_map[ | ||
| processed_variant._CounterEnum.VARIANT.value].get_value(), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already done in the third PR (#131), so I prefer to leave it for there to get less merge conflicts.
gcp_variant_transforms/vcf_to_bq.py
Outdated
| known_args.split_alternate_allele_info_fields, | ||
| known_args.annotation_fields) | ||
| known_args.annotation_fields, | ||
| metrics_util.CounterFactory()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CounterFactory is completely stateless and can be instantiated many times.
The reason it is passed is exactly to avoid direct instantiation inside the ProcessedVariantFactory class to avoid tying it to a particular implementation. Instead, ProcessedVariantFactory only knows/expects the interface of the factory method with no information of what implementation is actually used.
I think this is the root of our difference on this design. Direct instantiation of classes in a constructor, specially those that we like to hide their implementation (e.g., Counters and its Factory in this case) is not a good idea. I shared this writing before in a review thread as well and I appreciate if you take a look (at least glance through it and check the last section before codes/comments which is "Fixing the flaw"). As I said before, I know that not all of it applies to Python but in general, I have found injecting factories a good design pattern to have modular code.
Now, you may disagree to the general pattern or the particular usage here. In either case, please let me know and I will reduce the whole beam_util module to a single CounterWrapper class, removing all other factory/interface/no no-op details (if I want to instantiate a particular implementation why even bother with factories?). I don't think that is a good design but let's move on. I don't believe there is a "best design" and I don't want this PR to be blocked more. I am fine either way.
| from gcp_variant_transforms.libs import processed_variant | ||
| from gcp_variant_transforms.libs import vcf_header_parser | ||
|
|
||
| class CounterMock(metrics_util.BaseCounter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No I did not consider that, thanks for pointing that out. From a quick glance, I think that can be used here too but I have not changed this here yet as I want to first resolve the bigger question on dependency injection and then I will make the modifications here accordingly.
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarifications, Bashir. I think this looks good. Just a few nits and one comment about making it easier for other modules to use the counters.
gcp_variant_transforms/vcf_to_bq.py
Outdated
| known_args.split_alternate_allele_info_fields, | ||
| known_args.annotation_fields) | ||
| known_args.annotation_fields, | ||
| metrics_util.CounterFactory()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing the doc. Sorry I missed it originally. I like the factory design pattern and now see your point about keeping the constructor light. I'd still prefer to create one instance of the factory and passing that in to any module that wants to use counters (instead of each module creating their own instance in the method call), but feel free to leave as is for now.
| """Subclass implementations should do increment by `n`.""" | ||
| raise NotImplementedError | ||
|
|
||
| class NoOpCounter(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please use two blank lines before class defs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| This is an example of how to use the factory and counter objects: | ||
| factory = CounterFactory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: please put this block in triple backquotes (I believe they will get translated to actual code blocks once we generate pydocs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| """Subclass implementations should do increment by `n`.""" | ||
| raise NotImplementedError | ||
|
|
||
| class NoOpCounter(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should inherit from CounterInterface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed! And of course Python is totally okay with it either way :-)
| self._split_alternate_allele_info_fields = ( | ||
| split_alternate_allele_info_fields) | ||
| self._annotation_field_set = set(annotation_fields or []) | ||
| if counter_factory: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering that we do want to keep the NoOp counters, what do you think of creating a NoOpCounterFactory as well that returns the NoOpCounter? You can then just have a line counter_factory = counter_factory or NoOpCounterFactory() and you don't need two cases anymore. This pattern can be applied to any other module that wants to use counters as well. P.S. "by induction", you can make NoOpCounter private as well in this pattern :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already done this in the third PR because dealing with null objects was getting too ugly there. But did not want to add it back here since you asked for its removal :-) Now that we are on the same page even here, I added it back.
73ad0e3 to
f78f9c0
Compare
bashir2
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comments. PTAL.
To view only the changes since last review, please check this commit.
| This is an example of how to use the factory and counter objects: | ||
| factory = CounterFactory() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| """Subclass implementations should do increment by `n`.""" | ||
| raise NotImplementedError | ||
|
|
||
| class NoOpCounter(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| """Subclass implementations should do increment by `n`.""" | ||
| raise NotImplementedError | ||
|
|
||
| class NoOpCounter(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed! And of course Python is totally okay with it either way :-)
| self._split_alternate_allele_info_fields = ( | ||
| split_alternate_allele_info_fields) | ||
| self._annotation_field_set = set(annotation_fields or []) | ||
| if counter_factory: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already done this in the third PR because dealing with null objects was getting too ugly there. But did not want to add it back here since you asked for its removal :-) Now that we are on the same page even here, I added it back.
gcp_variant_transforms/vcf_to_bq.py
Outdated
| known_args.split_alternate_allele_info_fields, | ||
| known_args.annotation_fields) | ||
| known_args.annotation_fields, | ||
| metrics_util.CounterFactory()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| from gcp_variant_transforms.libs import processed_variant | ||
| from gcp_variant_transforms.libs import vcf_header_parser | ||
|
|
||
| class CounterMock(metrics_util.CounterInterface): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up on a comment that I left to be resolved after design agreement: Since we agreed to keep injecting the factory, I think this solution is superior over patch. The name Mock is really poorly chosen for these, so I renamed them to Spies following this nomenclature. Also I chose to keep them here as module privates for now (as the only user is this module). Once we have a second usage for them we should move them to the testing directory.
arostamianfar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks, Bashir!
f78f9c0 to
b8a6c8c
Compare
|
Please ignore the "Closed" message, it was accidental. There was a recent commit on master which I had to merge, so now it needs another approval. |

Tested:
New unit-tests added; ran integration test; also ran manually on the platinum set (for performance).
Issue #81