New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Support for Complex Watermark Types #121
Conversation
import org.codehaus.jackson.annotate.JsonTypeInfo; | ||
|
||
@JsonTypeInfo(use=JsonTypeInfo.Id.CLASS, include=JsonTypeInfo.As.PROPERTY, property="@class") | ||
public interface Watermark extends Comparable<Watermark>, Copyable<Watermark> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's already a Watermark interface in gobblin.source.extractor.watermark
. Better to name this one ComplexWatermark
or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This interface is better to be generic with a type parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The increment
method is gone now, so I'm not sure if this still makes sense. A Watermark
is no longer tied to a specific record type.
As per conversation with Chavdar. I have removed the So now the expected e2e flow is as follows: 1: The 2: Each 3: Each map task reads a series of 4: The The following questions are still open: 1: I have added a new method to 2: In order to avoid another state-store migration, I have added a hack to serialize the WatermarkInterval class. Should be discuss migrating away from SequenceFile formats? |
*/ | ||
public interface Watermark extends Comparable<Watermark>, Copyable<Watermark> { | ||
|
||
public void initFromJson(String json); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fromJson
is sufficient.
Wants to share some of my thoughts on the
|
If It seems the |
I don't quite understand your point. Regardless of whether |
It's good to have a common method like |
@liyinan926 for points 2 and 3. I like the idea of having a Perhaps, the |
@sahilTakiar, yeah, I think that's a good solution. So basically the source/extractor gives the |
Updated. The I removed the new method from the I also changed |
@liyinan926 I am not entirely sure I understand why / how to make the |
@sahilTakiar. What I thought is
With this interface, you can have something like:
Like what I said above, adding a type parameter allows the compiler to do type checking and helps reduce possible abuse of
|
Hey, so I think the
For all the above use cases I think have a However, if we take that approach then we add a restriction any class that implements the Also, I think the |
Let's discuss this in person. I don't understand why we need getValue(). The Watermark should be the watermark implementation. We already have a container for watermarks -- the workunit. |
@chavdar I still have some questions, so maybe we can sync up. Primarily, I realized that I will also have to change the Which seems a little odd from a usability perspective, ideally the user should not need to worry about the fact that |
Updated based on discussion with Chavdar. @liyinan926 any more comments? |
*/ | ||
public class WatermarkInterval { | ||
|
||
private Watermark lowWatermark; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both can be final.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
Comments have been address @liyinan926 |
LGTM. |
@chavdar any other comments? |
Adding Support for Complex Watermark Types
Merged |
This is the first pull request for adding complex watermark types to Gobblin. This will replace the legacy system of tracking watermarks. The old system was de-centralized, and depended and passing custom configuration parameters between executions via a
WorkUnitState
.The new implementation contains a new interface called
Watermark
which extends theComparable
and theCopyable
interfaces. It only contains one methodincrement(Object record)
which defines how to increment the watermark for a given record.Another class called
WatermarkInterval
contains the logic for maintaining low and high watermarks. The corresponding changes toWorkUnit
andWorkUnitState
have been made. Since this pull request mainly focuses on defining the interfaces, no migration code has been done to the framework yet.