Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HYDRATOR-552] Adding Sampling plugin #1

Merged
merged 7 commits into from
Feb 13, 2017
Merged

[HYDRATOR-552] Adding Sampling plugin #1

merged 7 commits into from
Feb 13, 2017

Conversation

romy-khetan
Copy link

No description provided.

i.e, Systematic Sampling and Reservoir Sampling.


Use Case
Copy link

@russorat russorat Feb 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this section

**sampleSize:** The number of records that needs to be sampled from the input records.

**samplePercentage:** The percentage of records that needs to be sampled from the input records. Either of
'samplePercentage' or 'sampleSize' needs to be mentioned.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be mentioned -> should be specified for this plugin.

'samplePercentage' or 'sampleSize' needs to be mentioned.

**samplingType:** Type of the Sampling algorithm that needs to be used to sample the data.
For example: Systematic or Reservoir
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type of the Sampling algorithm that should to be used to sample the data. This can be either Systematic or Reservoir.

For example: Systematic or Reservoir

**overSamplingPercentage:** The percentage of additional records that needs to be included in addition to the input
sample size to account for oversampling to be used in Systematic Sampling.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sample size to account for oversampling. Required for Systematic Sampling.

**samplingType:** Type of the Sampling algorithm that needs to be used to sample the data.
For example: Systematic or Reservoir

**overSamplingPercentage:** The percentage of additional records that needs to be included in addition to the input
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that needs to be included -> that should be included

**random:** Random float value between 0 and 1 to be used in Systematic Sampling. If not provided, plugin will
internally generate random value.

**totalRecords:** Total number of input records to be used in Systematic Sampling.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total number of input records for Systematic Sampling.

*/
@Plugin(type = BatchAggregator.PLUGIN_TYPE)
@Name("Sampling")
@Description("Sampling a large dataset flowing through this plugin to pull random records.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sample a large dataset allowing only a subset of records through to the next stage.


@Override
public void prepareRun(BatchAggregatorContext context) throws Exception {
context.setNumPartitions(1);
Copy link

@russorat russorat Feb 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

}

@Override
public void groupBy(StructuredRecord record, Emitter<String> emitter) throws Exception {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To group the data for sampling algorithms defined in aggregate(), this function is needed. Please suggest, if i am missing anything.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all good


@Nullable
@Description("The number of records that needs to be sampled from the input records.")
private Integer sampleSize;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be macro enabled?


@Nullable
@Description("The percenatage of records that needs to be sampled from the input records.")
private Float samplePercentage;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be macro enabled?

@Nullable
@Description("The percenatage of additional records that needs to be included in addition to the input " +
"sample size to account for oversampling to be used in Systematic Sampling.")
private Float overSamplingPercentage;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be macro enabled?


@Nullable
@Description("Random float value between 0 and 1 to be used in Systematic Sampling.")
private Float random;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be macro enabled?


@Nullable
@Description("Total number od input records.")
private Integer totalRecords;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this be macro enabled?

if (samplingType.equalsIgnoreCase(TYPE.SYSTEMATIC.toString()) && totalRecords == null) {
throw new IllegalArgumentException("Please provide value for 'Total Records' when selecting sampling " +
"type as 'Systematic'.");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you want to check for things like ranges here? like, if i enter a random number of 47?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ranges added

pom.xml Outdated
<parent>
<artifactId>hydrator-plugins</artifactId>
<groupId>co.cask.hydrator</groupId>
<version>1.6.0-SNAPSHOT</version>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.6.0 isn't out yet. this should be 1.5.1

@russorat
Copy link

russorat commented Feb 7, 2017

fails checkstyle:
{code}
[ERROR] /Users/russellsavage/workspace/sampling-aggregator/src/main/java/co/cask/hydrator/plugin/batch/aggreagtor/aggregator/Sampling.java:239:15: 'if' is not followed by whitespace. [WhitespaceAround]
[ERROR] /Users/russellsavage/workspace/sampling-aggregator/src/main/java/co/cask/hydrator/plugin/batch/aggreagtor/aggregator/Sampling.java:244:15: 'if' is not followed by whitespace. [WhitespaceAround]
{code}

this means you didn't actually build the project before checking in your latest changes.


Properties
----------
**sampleSize:** The number of records that needs to be sampled from the input records.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add Either of 'samplePercentage' or 'sampleSize' should be specified for this plugin.

public static class SamplingConfig extends PluginConfig {

@Nullable
@Description("The number of records that needs to be sampled from the input records.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add: Either of 'samplePercentage' or 'sampleSize' should be specified for this plugin.

private Integer sampleSize;

@Nullable
@Description("The percenatage of records that needs to be sampled from the input records.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add: Either of 'samplePercentage' or 'sampleSize' should be specified for this plugin.

private Float overSamplingPercentage;

@Nullable
@Description("Random float value between 0 and 1 to be used in Systematic Sampling.")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add: If not provided, plugin will internally generate random value.

@romy-khetan
Copy link
Author

@russorat I have updated the code and fixed checkstyle issues. Please review.

@russorat
Copy link

LGTM

@russorat russorat merged commit f3783e2 into master Feb 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants