Skip to content

Conversation

@bashir2
Copy link
Member

@bashir2 bashir2 commented Feb 28, 2018

Tested:
New unit-tests added; ran integration test; also ran manually on the platinum set (for performance).

Issue #81

@bashir2 bashir2 changed the title Adds counters processed_variant and creates a wrapper for Beam counters. Adds counters to processed_variant and creates a wrapper for Beam counters. Feb 28, 2018
@coveralls
Copy link

coveralls commented Feb 28, 2018

Pull Request Test Coverage Report for Build 403

  • 97 of 116 (83.62%) changed or added relevant lines in 5 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.02%) to 90.184%

Changes Missing Coverage Covered Lines Changed/Added Lines %
gcp_variant_transforms/libs/metrics_util.py 27 35 77.14%
gcp_variant_transforms/vcf_to_bq.py 1 12 8.33%
Files with Coverage Reduction New Missed Lines %
gcp_variant_transforms/vcf_to_bq.py 1 51.24%
Totals Coverage Status
Change from base Build 401: -0.02%
Covered Lines: 3179
Relevant Lines: 3525

💛 - Coveralls

@bashir2
Copy link
Member Author

bashir2 commented Feb 28, 2018

Folks, this PR does not really need two reviews. I have added you both to take a look at how I have implemented counter utils in case that you also need to use it for other features. In my experience, counters and other metrics are a great way of reporting/monitoring issues/progress of a pipeline. Note that these counters are exposed in the Dataflow dashboard too while the job is running (see attached snapshot), so it is not just in the final logs.

@bashir2
Copy link
Member Author

bashir2 commented Feb 28, 2018

dataflow_run_platinum_with_counter

@bashir2 bashir2 removed the request for review from arostamianfar March 2, 2018 17:49
"""
counter_filter = MetricsFilter().with_name(self._counter_name)
query_result = pipeline_result.metrics().query(counter_filter)
if query_result['counters']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make 'counters' a constant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if query_result['counters']:
counter = query_result['counters'][0]
return counter.committed
return 0
Copy link
Contributor

@nmousavi nmousavi Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this happen? Does it make sense to log error too or not? Any risk on silently returnin zero?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main case in which this can happen without a bug, is when we have a counter, for example for counting error/warning cases (like the annotation_alt_mismatch_counter I have added in this PR) and there are none of those particular errors. So this counter is never registered in the Beam Metrics for legitimate reasons.

# don't own (i.e., they are in our upstream dependency) but filling a counter
# and connecting it to PipelineResults properly seems excessive for this test.
class MockMetrics(object):
def __init__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please add a blank link before every def (here and everywhere).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Are you aware of a way to tell pylint to catch this? I prefer style related issues being caught by tools as much as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for pylint to catch this, but I don't know how to do it. Can you please look into adding it?

"""Extracts and returns the value of this counter.
Note that this goes through all counters every time it is called to find the
matching counter (see `query` implementation). So if the goal is to just
Copy link
Contributor

@nmousavi nmousavi Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What use case do you envision for get_counter()? Is it used in PR#126 (it's not used in this PR)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that it is not used in this PR (and not even PR #126) except in unit-tests. But I originally started by logging the counters one by one and that is useful in those cases. Since it is also unit-tested and the logic is not trivial (it took me some time to find what is the right way of getting a single counter value), I decided to keep it but if you feel strongly about it, I can remove it either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's a good practice to remove unused code, otherwise we will end up with unused code, no one remember about it. Can we put what you found valuable in the comment instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
For now I just removed it.

_METRICS_NAME_SPACE = 'VT_metrics_name_space'


class BaseCounter(object):
Copy link
Contributor

@nmousavi nmousavi Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what value this base class adds here (esp. in python world)? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you already know I am not a big fan developing software "the Python way" :-)

But on a more serious note, this is basically the interface for all Counter implementations. It serves two needs:

  1. Our library methods do not need to depend on anything Beam specific. So for example ProcessedVariant constructor takes a BaseCounterFactory by default. I could set the default to None there but then I should have had "if not None ..." anywhere inside that code that I needed to use the counters (this is the design pattern).

  2. It makes it easy for downstream code to be unit-tested. For example in processed_variant_test.py a CounterMock implements this interface and then is fed easily into the ProcessedVariant test objects.

Copy link
Contributor

@nmousavi nmousavi Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need inheritance for the benefits you mentioned, do you? Define the child and test (and if you really need a no-op) class as before just without inheritance and it works. I'll leave it to you if you prefer this way. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as discussed offline, the benefit of BaseCounter is that (i) as an interface it documents the intended API of Counter implementations (we can also impose it to some extent in Python with "types" we are adding) and (ii) Having a base implementation let us implement the Null object pattern easily.

result = pipeline.run()
result.wait_until_finish()

beam_util.log_all_counters(result)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my information: since it was the first use of beam counter in this code base (right?), I am wondering what were the motivations behind it? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right this is the first use and I have tried to make subsequent uses simple by adding the beam_util library.

Part of the motivation is what I mentioned in my first comments after sending the PR (I am not sure if GitHub sent you emails for that, if not please take a look at them in the UI, in particular the snapshot I have attached from Dataflow dashboard). There are at least two benefits:

  1. In error/warning cases, (e.g., annotation_alt_mismatch_counter) it is hard to find/monitor those through logs since each worker has its own logs (I am not even sure if the Dataflow/Stackdriver dashboards easily expose those logs but certainly you could get them in the Borg world). One higher level way is to count those cases and log them at the end. So at least one can easily/quickly check if their job ran error/warning free.

  2. Non-error counters can be used for monitoring the progress of the job. For example, I vaguely know that there are N variants in my input and through Dataflow dashboard pipeline I can see live how many of those records have been processed. It is needless to say that monitoring error counters is also very useful: If I get too many errors while the pipeline is running I can stop it early and investigate the issue instead of waiting until it finished and potentially produce garbage results.



class CounterFactory(BaseCounterFactory):
def create_counter(self, counter_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just calling ctor in the client code? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the client class (e.g., ProcessedVariantFactory) will become attached to a particular implementation. That will reduce modularity and will make unit-testing more difficult (I know, I know, in Python you can overwrite everything :-) ). This is basically the factory pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the above, I am not sure if the base class and factory is needed. Another option is to pass fn obj that calls ctor in client code. Leave it to you if you prefer this design. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this case, again as discussed offline, I am making this a pure interface and removing the default implementation. I think the "fn obj" method you mentioned works too but I prefer to use interfaces to make it more similar to usual object oriented design patterns (the main advantage of the interface in this case is its documentation and pseudo-type enforcement).

Copy link
Contributor

@nmousavi nmousavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR, Bashir! Just few comments on the design choices in beam_util.

Copy link
Member Author

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. PTAL.

_METRICS_NAME_SPACE = 'VT_metrics_name_space'


class BaseCounter(object):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, you already know I am not a big fan developing software "the Python way" :-)

But on a more serious note, this is basically the interface for all Counter implementations. It serves two needs:

  1. Our library methods do not need to depend on anything Beam specific. So for example ProcessedVariant constructor takes a BaseCounterFactory by default. I could set the default to None there but then I should have had "if not None ..." anywhere inside that code that I needed to use the counters (this is the design pattern).

  2. It makes it easy for downstream code to be unit-tested. For example in processed_variant_test.py a CounterMock implements this interface and then is fed easily into the ProcessedVariant test objects.

"""Extracts and returns the value of this counter.
Note that this goes through all counters every time it is called to find the
matching counter (see `query` implementation). So if the goal is to just
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that it is not used in this PR (and not even PR #126) except in unit-tests. But I originally started by logging the counters one by one and that is useful in those cases. Since it is also unit-tested and the logic is not trivial (it took me some time to find what is the right way of getting a single counter value), I decided to keep it but if you feel strongly about it, I can remove it either.

"""
counter_filter = MetricsFilter().with_name(self._counter_name)
query_result = pipeline_result.metrics().query(counter_filter)
if query_result['counters']:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if query_result['counters']:
counter = query_result['counters'][0]
return counter.committed
return 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main case in which this can happen without a bug, is when we have a counter, for example for counting error/warning cases (like the annotation_alt_mismatch_counter I have added in this PR) and there are none of those particular errors. So this counter is never registered in the Beam Metrics for legitimate reasons.



class CounterFactory(BaseCounterFactory):
def create_counter(self, counter_name):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the client class (e.g., ProcessedVariantFactory) will become attached to a particular implementation. That will reduce modularity and will make unit-testing more difficult (I know, I know, in Python you can overwrite everything :-) ). This is basically the factory pattern.

# don't own (i.e., they are in our upstream dependency) but filling a counter
# and connecting it to PipelineResults properly seems excessive for this test.
class MockMetrics(object):
def __init__(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Are you aware of a way to tell pylint to catch this? I prefer style related issues being caught by tools as much as possible.

result = pipeline.run()
result.wait_until_finish()

beam_util.log_all_counters(result)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right this is the first use and I have tried to make subsequent uses simple by adding the beam_util library.

Part of the motivation is what I mentioned in my first comments after sending the PR (I am not sure if GitHub sent you emails for that, if not please take a look at them in the UI, in particular the snapshot I have attached from Dataflow dashboard). There are at least two benefits:

  1. In error/warning cases, (e.g., annotation_alt_mismatch_counter) it is hard to find/monitor those through logs since each worker has its own logs (I am not even sure if the Dataflow/Stackdriver dashboards easily expose those logs but certainly you could get them in the Borg world). One higher level way is to count those cases and log them at the end. So at least one can easily/quickly check if their job ran error/warning free.

  2. Non-error counters can be used for monitoring the progress of the job. For example, I vaguely know that there are N variants in my input and through Dataflow dashboard pipeline I can see live how many of those records have been processed. It is needless to say that monitoring error counters is also very useful: If I get too many errors while the pipeline is running I can stop it early and investigate the issue instead of waiting until it finished and potentially produce garbage results.

@bashir2
Copy link
Member Author

bashir2 commented Mar 2, 2018

Oops, sorry that commits from my development branch leaked into this review branch :-)
To only view the changes since your last review, check this commit.

Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments about the design of the file/classes. I like the factory pattern :)

from apache_beam.metrics.metric import MetricsFilter

_METRICS_NAME_SPACE = 'VT_metrics_name_space'
_COUNTERS = 'counters'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide comments about the context of these constants.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -0,0 +1,97 @@
# Copyright 2018 Google Inc. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this file be named counters_util or counter_factory(preferred) instead? In general, we should aim to limit the scope of 'util' classes as they can become a bag of unrelated stuff over time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we will have more little Beam related things that do not warrant a separate file, but I also see your concern about "util bags". For now, I changed this to metrics_util (as a minimum this can include other metrics too, not just counters) and we can rename again depending on future needs (it is probably not a great idea to create a separate file+test-file for each little util thing as well).

# See the License for the specific language governing permissions and
# limitations under the License.

"""Wrapper for Beam classes, e.g., Metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, I think we should limit the scope of this file. It's ok to have multiple files that specialize for each function.
Please also add an example of how classes in this file should be used as there are several classes and their connection is not trivial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

class BaseCounter(object):
"""The interface of counter objects which does nothing.
This also acts as a nop implementation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any value of providing a no-op implementation? I think it's better to raise NotImplementError in inc instead so that it forces subclasses to override (i.e. actually become an 'interface'). Can also rename BaseCounter to CounterInterface if we expect all methods to be abstract.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is value in this no-op implementation and it is to avoid checking for None (null) everywhere in the code where counters are used. This is the Null object pattern.

if query_result[_COUNTERS]:
counter = query_result[_COUNTERS][0]
return counter.committed
return 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it's better to return None instead to distinguish between a case there the counter is not there vs. it's actually 0 (if that matters)? Also, please add a brief comment about when this would happen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it simple for now and return 0, there is very subtle difference between None and 0 in this context and let's leave this until there is a real use for this. Also this will happen, when for example counter.inc() has never been called on that counter, it does not mean that it is not referenced anywhere in the code for example (that's why I said the difference is very subtle).

return 0


class BaseCounterFactory(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this become an interface with abstract methods as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as before: Null object pattern.

header_fields, # type: vcf_header_parser.HeaderFields
split_alternate_allele_info_fields=True, # type: bool
annotation_fields=None, # type: List[str]
counter_factory=beam_util.BaseCounterFactory()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use mutable objects as default values as it can have unexpected characteristics: http://docs.python-guide.org/en/latest/writing/gotchas/
You can set this to None and then initialize to counter_factory or beam_util.BaseCounterFactory() in the method. Or just make it a required field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is mutable in BaseCounterFactory? As I said this is exactly to avoid checking for "None" everywhere these objects are accessed.
BTW, that gotchas link was mentioned in another review comment between us recently, remember? :-)

Copy link
Member Author

@bashir2 bashir2 Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a little more about this, while BaseCounterFactory is semantically immutable but I suppose in Python a rogue function can mutate anything passed to it (e.g., by accessing __dict__) and so cause confusing bugs (for that very function/class). If that is the concern, I can make counter_factory's default value None and check for None only in constructor. Still I will keep BaseCounter as a no-op implementation and if no counter_factory is passed to the class, use instances of BaseCounter to initialize member counters. Please let me know which way you prefer (so in that case BaseCounterFactory will become CounterFactoryInterface with no implementation but BaseCounter will be kept the way it is).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I went with the second option that I suggested to make sure this is safe. The code is not as clean as it used to be but I think that is fine. If you have more comments, please let me know and I will fix in future PRs.


_FIELD_COUNT_ALTERNATE_ALLELE = 'A'
# Counter names
_VARIANT_COUNTER = 'variant_counter'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider creating a class for this to group them together. You can also remove the _COUNTER suffix in that case (we use this pattern in other files like BigQuery constants).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member Author

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments, PTAL.

@@ -0,0 +1,97 @@
# Copyright 2018 Google Inc. All Rights Reserved.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we will have more little Beam related things that do not warrant a separate file, but I also see your concern about "util bags". For now, I changed this to metrics_util (as a minimum this can include other metrics too, not just counters) and we can rename again depending on future needs (it is probably not a great idea to create a separate file+test-file for each little util thing as well).

# See the License for the specific language governing permissions and
# limitations under the License.

"""Wrapper for Beam classes, e.g., Metrics.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

from apache_beam.metrics.metric import MetricsFilter

_METRICS_NAME_SPACE = 'VT_metrics_name_space'
_COUNTERS = 'counters'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

class BaseCounter(object):
"""The interface of counter objects which does nothing.
This also acts as a nop implementation.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is value in this no-op implementation and it is to avoid checking for None (null) everywhere in the code where counters are used. This is the Null object pattern.

if query_result[_COUNTERS]:
counter = query_result[_COUNTERS][0]
return counter.committed
return 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it simple for now and return 0, there is very subtle difference between None and 0 in this context and let's leave this until there is a real use for this. Also this will happen, when for example counter.inc() has never been called on that counter, it does not mean that it is not referenced anywhere in the code for example (that's why I said the difference is very subtle).

return 0


class BaseCounterFactory(object):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same reason as before: Null object pattern.


_FIELD_COUNT_ALTERNATE_ALLELE = 'A'
# Counter names
_VARIANT_COUNTER = 'variant_counter'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

header_fields, # type: vcf_header_parser.HeaderFields
split_alternate_allele_info_fields=True, # type: bool
annotation_fields=None, # type: List[str]
counter_factory=beam_util.BaseCounterFactory()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is mutable in BaseCounterFactory? As I said this is exactly to avoid checking for "None" everywhere these objects are accessed.
BTW, that gotchas link was mentioned in another review comment between us recently, remember? :-)

@bashir2
Copy link
Member Author

bashir2 commented Mar 6, 2018

BTW, main changes since last review can be seen in this commit (barring the beam_util->metrics_util rename).

nmousavi
nmousavi previously approved these changes Mar 6, 2018
Copy link
Contributor

@nmousavi nmousavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and few minor comments which I leave them to you to decide. Nice Work!

Copy link
Member Author

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. PTAL.

_METRICS_NAME_SPACE = 'VT_metrics_name_space'


class BaseCounter(object):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as discussed offline, the benefit of BaseCounter is that (i) as an interface it documents the intended API of Counter implementations (we can also impose it to some extent in Python with "types" we are adding) and (ii) Having a base implementation let us implement the Null object pattern easily.

"""Extracts and returns the value of this counter.
Note that this goes through all counters every time it is called to find the
matching counter (see `query` implementation). So if the goal is to just
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
For now I just removed it.



class CounterFactory(BaseCounterFactory):
def create_counter(self, counter_name):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this case, again as discussed offline, I am making this a pure interface and removing the default implementation. I think the "fn obj" method you mentioned works too but I prefer to use interfaces to make it more similar to usual object oriented design patterns (the main advantage of the interface in this case is its documentation and pseudo-type enforcement).

header_fields, # type: vcf_header_parser.HeaderFields
split_alternate_allele_info_fields=True, # type: bool
annotation_fields=None, # type: List[str]
counter_factory=beam_util.BaseCounterFactory()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I went with the second option that I suggested to make sure this is safe. The code is not as clean as it used to be but I think that is fine. If you have more comments, please let me know and I will fix in future PRs.

@bashir2
Copy link
Member Author

bashir2 commented Mar 6, 2018

Nima, to see the changed since your last review, you can check this commit.

nmousavi
nmousavi previously approved these changes Mar 6, 2018
Copy link
Contributor

@nmousavi nmousavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second LGTM :-)

@bashir2
Copy link
Member Author

bashir2 commented Mar 7, 2018

@nmousavi re. get_counter() that I said I removed it: FYI, after merging this into PR #126 I realized it is used for tests and they are valuable, so I kept the function in this PR. Sorry when you asked if it is used anywhere I forgot about that use case.

@bashir2
Copy link
Member Author

bashir2 commented Mar 7, 2018

@nmousavi please ignore my last comment about get_counter(). It is not used in production code and I removed it as you suggested.

Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes, Bashir! Added a few more comments on the design (as this PR will form the basis of our counters, I want to ensure we have the best possible design).



class CounterFactory(CounterFactoryInterface):
def create_counter(self, counter_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please add a blank line before the def here as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

class BaseCounter(object):
"""The interface of counter objects which does nothing.
This also acts as a no-op implementation such that classes that need to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better design is to make this an interface (i.e. raise NotImplementedError) and instead make a NoOp/NullCounter class that has all methods as pass. That way, you initialize the counters as NoOpCounter classes, which makes their functions more clear (usually "Base" implies an abstract class that shouldn't be used directly).
Having said that, I think ideally there shouldn't be any need for the null counter since you should be able to instantiate classes appropriately, but perhaps I'm missing something.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

class CounterFactoryInterface(object):
"""The interface for counter factories."""

def create_counter(self, _):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the full variable name here instead of _ as the signature of the method should be clear in abstract classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

from apache_beam.metrics.metric import MetricsFilter

# The name space in which all metrics created by this class are.
_METRICS_NAME_SPACE = 'VT_metrics_name_space'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/NAME_SPACE/NAMESPACE and s/name_space/namespace (I noticed that the Beam api has with_namespace, so we should be consistent).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

known_args.split_alternate_allele_info_fields,
known_args.annotation_fields)
known_args.annotation_fields,
metrics_util.CounterFactory())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify (as I don't know about the internals of Beam counters): if another module wants to use counters, does it need to use the same factory? If so, please instantiate this outside of this method so that it can be reused easily. If not, why even pass that in as an argument as it can be instantiated within the class? If it is solely used for testing, you can use python's patch feature instead of passing the argument in. This can also avoid the null counter case.

Copy link
Member Author

@bashir2 bashir2 Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CounterFactory is completely stateless and can be instantiated many times.
The reason it is passed is exactly to avoid direct instantiation inside the ProcessedVariantFactory class to avoid tying it to a particular implementation. Instead, ProcessedVariantFactory only knows/expects the interface of the factory method with no information of what implementation is actually used.

I think this is the root of our difference on this design. Direct instantiation of classes in a constructor, specially those that we like to hide their implementation (e.g., Counters and its Factory in this case) is not a good idea. I shared this writing before in a review thread as well and I appreciate if you take a look (at least glance through it and check the last section before codes/comments which is "Fixing the flaw"). As I said before, I know that not all of it applies to Python but in general, I have found injecting factories a good design pattern to have modular code.

Now, you may disagree to the general pattern or the particular usage here. In either case, please let me know and I will reduce the whole beam_util module to a single CounterWrapper class, removing all other factory/interface/no no-op details (if I want to instantiate a particular implementation why even bother with factories?). I don't think that is a good design but let's move on. I don't believe there is a "best design" and I don't want this PR to be blocked more. I am fine either way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing the doc. Sorry I missed it originally. I like the factory design pattern and now see your point about keeping the constructor light. I'd still prefer to create one instance of the factory and passing that in to any module that wants to use counters (instead of each module creating their own instance in the method call), but feel free to leave as is for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

import logging

from apache_beam.runners import runner # pylint: disable=unused-import
from apache_beam.metrics import Metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please import the metrics module instead of actual classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

of classes in this module should be injected into library classes/methods
instead of direct instantiation. For example:
class CounterUser(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this example is confusing as the main way of using this module is like this?

factory = CounterFactory()
my_counter = factory.create_counter('my_counter_name')
my_counter.inc(4)

...or I'm not sure what your example is trying to demonstrate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern is "injecting the dependency" and that is what the comment said. I changed the example as I originally thought you want an example of the recommended pattern not how to use the counter. But if we don't want to inject the dependency (which you have suggested in other comments), I am not sure what the value of the factory class is and we should probably remove that too.

from gcp_variant_transforms.libs import processed_variant
from gcp_variant_transforms.libs import vcf_header_parser

class CounterMock(metrics_util.BaseCounter):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using python's mock.patch feature instead? We use them here (I think we had more use cases that got deleted as part of some refactorings). It's pretty powerful and it's the 'pythonic' way of doing mocks.
If using patch is not feasible, then consider defining these under the testing/ dir so that they can be used by any module (and please rename them as MockCounter and MockCounterFactory).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I did not consider that, thanks for pointing that out. From a quick glance, I think that can be used here too but I have not changed this here yet as I want to first resolve the bigger question on dependency injection and then I will make the modifications here accordingly.

processed_variant.AlternateBaseData(a) for a in ['A', 'TT']]
self.assertEqual([proc_var_synthetic], [proc_var])
self.assertEqual(counter_factory.counter_map[
processed_variant._CounterEnum.VARIANT.value].get_value(), 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider creating an alias for _CounterEnum so that it becomes shorter and you don't need as much line breaks. We have adopted the style of importing _CounterEnum as CounterEnum to make it look 'public' for tests (that's the only exception for directly importing classes in our code).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already done in the third PR (#131), so I prefer to leave it for there to get less merge conflicts.

@bashir2 bashir2 force-pushed the alt_annotation_review branch from 6e8f6a2 to 73ad0e3 Compare March 7, 2018 19:49
Copy link
Member Author

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. I have not addressed all because I want to make sure we are on the same page re. the design first. Once that is done I will make other changes accordingly.

PTAL.

To see the changes since last review, check this commit.

import logging

from apache_beam.runners import runner # pylint: disable=unused-import
from apache_beam.metrics import Metrics
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

of classes in this module should be injected into library classes/methods
instead of direct instantiation. For example:
class CounterUser(object):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern is "injecting the dependency" and that is what the comment said. I changed the example as I originally thought you want an example of the recommended pattern not how to use the counter. But if we don't want to inject the dependency (which you have suggested in other comments), I am not sure what the value of the factory class is and we should probably remove that too.

from apache_beam.metrics.metric import MetricsFilter

# The name space in which all metrics created by this class are.
_METRICS_NAME_SPACE = 'VT_metrics_name_space'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

class BaseCounter(object):
"""The interface of counter objects which does nothing.
This also acts as a no-op implementation such that classes that need to
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

class CounterFactoryInterface(object):
"""The interface for counter factories."""

def create_counter(self, _):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.



class CounterFactory(CounterFactoryInterface):
def create_counter(self, counter_name):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

processed_variant.AlternateBaseData(a) for a in ['A', 'TT']]
self.assertEqual([proc_var_synthetic], [proc_var])
self.assertEqual(counter_factory.counter_map[
processed_variant._CounterEnum.VARIANT.value].get_value(), 1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already done in the third PR (#131), so I prefer to leave it for there to get less merge conflicts.

known_args.split_alternate_allele_info_fields,
known_args.annotation_fields)
known_args.annotation_fields,
metrics_util.CounterFactory())
Copy link
Member Author

@bashir2 bashir2 Mar 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CounterFactory is completely stateless and can be instantiated many times.
The reason it is passed is exactly to avoid direct instantiation inside the ProcessedVariantFactory class to avoid tying it to a particular implementation. Instead, ProcessedVariantFactory only knows/expects the interface of the factory method with no information of what implementation is actually used.

I think this is the root of our difference on this design. Direct instantiation of classes in a constructor, specially those that we like to hide their implementation (e.g., Counters and its Factory in this case) is not a good idea. I shared this writing before in a review thread as well and I appreciate if you take a look (at least glance through it and check the last section before codes/comments which is "Fixing the flaw"). As I said before, I know that not all of it applies to Python but in general, I have found injecting factories a good design pattern to have modular code.

Now, you may disagree to the general pattern or the particular usage here. In either case, please let me know and I will reduce the whole beam_util module to a single CounterWrapper class, removing all other factory/interface/no no-op details (if I want to instantiate a particular implementation why even bother with factories?). I don't think that is a good design but let's move on. I don't believe there is a "best design" and I don't want this PR to be blocked more. I am fine either way.

from gcp_variant_transforms.libs import processed_variant
from gcp_variant_transforms.libs import vcf_header_parser

class CounterMock(metrics_util.BaseCounter):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I did not consider that, thanks for pointing that out. From a quick glance, I think that can be used here too but I have not changed this here yet as I want to first resolve the bigger question on dependency injection and then I will make the modifications here accordingly.

Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarifications, Bashir. I think this looks good. Just a few nits and one comment about making it easier for other modules to use the counters.

known_args.split_alternate_allele_info_fields,
known_args.annotation_fields)
known_args.annotation_fields,
metrics_util.CounterFactory())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing the doc. Sorry I missed it originally. I like the factory design pattern and now see your point about keeping the constructor light. I'd still prefer to create one instance of the factory and passing that in to any module that wants to use counters (instead of each module creating their own instance in the method call), but feel free to leave as is for now.

"""Subclass implementations should do increment by `n`."""
raise NotImplementedError

class NoOpCounter(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please use two blank lines before class defs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

This is an example of how to use the factory and counter objects:
factory = CounterFactory()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please put this block in triple backquotes (I believe they will get translated to actual code blocks once we generate pydocs).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""Subclass implementations should do increment by `n`."""
raise NotImplementedError

class NoOpCounter(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should inherit from CounterInterface?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! And of course Python is totally okay with it either way :-)

self._split_alternate_allele_info_fields = (
split_alternate_allele_info_fields)
self._annotation_field_set = set(annotation_fields or [])
if counter_factory:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering that we do want to keep the NoOp counters, what do you think of creating a NoOpCounterFactory as well that returns the NoOpCounter? You can then just have a line counter_factory = counter_factory or NoOpCounterFactory() and you don't need two cases anymore. This pattern can be applied to any other module that wants to use counters as well. P.S. "by induction", you can make NoOpCounter private as well in this pattern :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already done this in the third PR because dealing with null objects was getting too ugly there. But did not want to add it back here since you asked for its removal :-) Now that we are on the same page even here, I added it back.

@bashir2 bashir2 force-pushed the alt_annotation_review branch from 73ad0e3 to f78f9c0 Compare March 8, 2018 20:40
Copy link
Member Author

@bashir2 bashir2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. PTAL.

To view only the changes since last review, please check this commit.

This is an example of how to use the factory and counter objects:
factory = CounterFactory()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""Subclass implementations should do increment by `n`."""
raise NotImplementedError

class NoOpCounter(object):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"""Subclass implementations should do increment by `n`."""
raise NotImplementedError

class NoOpCounter(object):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! And of course Python is totally okay with it either way :-)

self._split_alternate_allele_info_fields = (
split_alternate_allele_info_fields)
self._annotation_field_set = set(annotation_fields or [])
if counter_factory:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already done this in the third PR because dealing with null objects was getting too ugly there. But did not want to add it back here since you asked for its removal :-) Now that we are on the same page even here, I added it back.

known_args.split_alternate_allele_info_fields,
known_args.annotation_fields)
known_args.annotation_fields,
metrics_util.CounterFactory())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

from gcp_variant_transforms.libs import processed_variant
from gcp_variant_transforms.libs import vcf_header_parser

class CounterMock(metrics_util.CounterInterface):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on a comment that I left to be resolved after design agreement: Since we agreed to keep injecting the factory, I think this solution is superior over patch. The name Mock is really poorly chosen for these, so I renamed them to Spies following this nomenclature. Also I chose to keep them here as module privates for now (as the only user is this module). Once we have a second usage for them we should move them to the testing directory.

arostamianfar
arostamianfar previously approved these changes Mar 8, 2018
Copy link
Contributor

@arostamianfar arostamianfar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks, Bashir!

@bashir2 bashir2 closed this Mar 8, 2018
@bashir2 bashir2 force-pushed the alt_annotation_review branch from f78f9c0 to b8a6c8c Compare March 8, 2018 20:53
@bashir2
Copy link
Member Author

bashir2 commented Mar 8, 2018

Please ignore the "Closed" message, it was accidental. There was a recent commit on master which I had to merge, so now it needs another approval.

@bashir2 bashir2 merged commit 32781f6 into googlegenomics:master Mar 8, 2018
@bashir2 bashir2 deleted the alt_annotation_review branch March 20, 2018 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants