-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HYDRATOR-552] Adding Sampling plugin #1
Conversation
docs/Sampling-batchaggregator.md
Outdated
i.e, Systematic Sampling and Reservoir Sampling. | ||
|
||
|
||
Use Case |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this section
docs/Sampling-batchaggregator.md
Outdated
**sampleSize:** The number of records that needs to be sampled from the input records. | ||
|
||
**samplePercentage:** The percentage of records that needs to be sampled from the input records. Either of | ||
'samplePercentage' or 'sampleSize' needs to be mentioned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs to be mentioned -> should be specified for this plugin.
docs/Sampling-batchaggregator.md
Outdated
'samplePercentage' or 'sampleSize' needs to be mentioned. | ||
|
||
**samplingType:** Type of the Sampling algorithm that needs to be used to sample the data. | ||
For example: Systematic or Reservoir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type of the Sampling algorithm that should to be used to sample the data. This can be either Systematic or Reservoir.
docs/Sampling-batchaggregator.md
Outdated
For example: Systematic or Reservoir | ||
|
||
**overSamplingPercentage:** The percentage of additional records that needs to be included in addition to the input | ||
sample size to account for oversampling to be used in Systematic Sampling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sample size to account for oversampling. Required for Systematic Sampling.
docs/Sampling-batchaggregator.md
Outdated
**samplingType:** Type of the Sampling algorithm that needs to be used to sample the data. | ||
For example: Systematic or Reservoir | ||
|
||
**overSamplingPercentage:** The percentage of additional records that needs to be included in addition to the input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that needs to be included -> that should be included
docs/Sampling-batchaggregator.md
Outdated
**random:** Random float value between 0 and 1 to be used in Systematic Sampling. If not provided, plugin will | ||
internally generate random value. | ||
|
||
**totalRecords:** Total number of input records to be used in Systematic Sampling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Total number of input records for Systematic Sampling.
*/ | ||
@Plugin(type = BatchAggregator.PLUGIN_TYPE) | ||
@Name("Sampling") | ||
@Description("Sampling a large dataset flowing through this plugin to pull random records.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sample a large dataset allowing only a subset of records through to the next stage.
|
||
@Override | ||
public void prepareRun(BatchAggregatorContext context) throws Exception { | ||
context.setNumPartitions(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
} | ||
|
||
@Override | ||
public void groupBy(StructuredRecord record, Emitter<String> emitter) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To group the data for sampling algorithms defined in aggregate(), this function is needed. Please suggest, if i am missing anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all good
|
||
@Nullable | ||
@Description("The number of records that needs to be sampled from the input records.") | ||
private Integer sampleSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be macro enabled?
|
||
@Nullable | ||
@Description("The percenatage of records that needs to be sampled from the input records.") | ||
private Float samplePercentage; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be macro enabled?
@Nullable | ||
@Description("The percenatage of additional records that needs to be included in addition to the input " + | ||
"sample size to account for oversampling to be used in Systematic Sampling.") | ||
private Float overSamplingPercentage; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be macro enabled?
|
||
@Nullable | ||
@Description("Random float value between 0 and 1 to be used in Systematic Sampling.") | ||
private Float random; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be macro enabled?
|
||
@Nullable | ||
@Description("Total number od input records.") | ||
private Integer totalRecords; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be macro enabled?
if (samplingType.equalsIgnoreCase(TYPE.SYSTEMATIC.toString()) && totalRecords == null) { | ||
throw new IllegalArgumentException("Please provide value for 'Total Records' when selecting sampling " + | ||
"type as 'Systematic'."); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to check for things like ranges here? like, if i enter a random number of 47?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ranges added
pom.xml
Outdated
<parent> | ||
<artifactId>hydrator-plugins</artifactId> | ||
<groupId>co.cask.hydrator</groupId> | ||
<version>1.6.0-SNAPSHOT</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.6.0 isn't out yet. this should be 1.5.1
fails checkstyle: this means you didn't actually build the project before checking in your latest changes. |
docs/Sampling-batchaggregator.md
Outdated
|
||
Properties | ||
---------- | ||
**sampleSize:** The number of records that needs to be sampled from the input records. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add Either of 'samplePercentage' or 'sampleSize' should be specified for this plugin.
public static class SamplingConfig extends PluginConfig { | ||
|
||
@Nullable | ||
@Description("The number of records that needs to be sampled from the input records.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add: Either of 'samplePercentage' or 'sampleSize' should be specified for this plugin.
private Integer sampleSize; | ||
|
||
@Nullable | ||
@Description("The percenatage of records that needs to be sampled from the input records.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add: Either of 'samplePercentage' or 'sampleSize' should be specified for this plugin.
private Float overSamplingPercentage; | ||
|
||
@Nullable | ||
@Description("Random float value between 0 and 1 to be used in Systematic Sampling.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add: If not provided, plugin will internally generate random value.
@russorat I have updated the code and fixed checkstyle issues. Please review. |
LGTM |
No description provided.