Skip to content

Commit

Permalink
[SPARK-24275][SQL] Revise doc comments in InputPartition
Browse files Browse the repository at this point in the history
## What changes were proposed in this pull request?

In apache#21145,  DataReaderFactory is renamed to InputPartition.

This PR is to revise wording in the comments to make it more clear.

## How was this patch tested?

None

Author: Gengliang Wang <gengliang.wang@databricks.com>

Closes apache#21326 from gengliangwang/revise_reader_comments.

(cherry picked from commit 6fb7d6c)
  • Loading branch information
gengliangwang authored and jzhuge committed Aug 27, 2019
1 parent b780587 commit f997b87
Show file tree
Hide file tree
Showing 7 changed files with 24 additions and 23 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ public interface ReadSupport extends DataSourceV2 {
/**
* Creates a {@link DataSourceReader} to scan the data from this data source.
*
* If this method fails (by throwing an exception), the action would fail and no Spark job was
* If this method fails (by throwing an exception), the action will fail and no Spark job will be
* submitted.
*
* @param options the options for the returned data source reader, which is an immutable
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ public interface ReadSupportWithSchema extends DataSourceV2 {
/**
* Create a {@link DataSourceReader} to scan the data from this data source.
*
* If this method fails (by throwing an exception), the action would fail and no Spark job was
* If this method fails (by throwing an exception), the action will fail and no Spark job will be
* submitted.
*
* @param schema the full schema of this data source reader. Full schema usually maps to the
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ public interface WriteSupport extends DataSourceV2 {
* Creates an optional {@link DataSourceWriter} to save the data to this data source. Data
* sources can return None if there is no writing needed to be done according to the save mode.
*
* If this method fails (by throwing an exception), the action would fail and no Spark job was
* If this method fails (by throwing an exception), the action will fail and no Spark job will be
* submitted.
*
* @param jobId A unique string for the writing job. It's possible that there are many writing
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@
* {@link ReadSupport#createReader(DataSourceOptions)} or
* {@link ReadSupportWithSchema#createReader(StructType, DataSourceOptions)}.
* It can mix in various query optimization interfaces to speed up the data scan. The actual scan
* logic is delegated to {@link InputPartition}s that are returned by
* logic is delegated to {@link InputPartition}s, which are returned by
* {@link #planInputPartitions()}.
*
* There are mainly 3 kinds of query optimizations:
Expand All @@ -45,8 +45,8 @@
* only one of them would be respected, according to the priority list from high to low:
* {@link SupportsScanColumnarBatch}, {@link SupportsScanUnsafeRow}.
*
* If an exception was throw when applying any of these query optimizations, the action would fail
* and no Spark job was submitted.
* If an exception was throw when applying any of these query optimizations, the action will fail
* and no Spark job will be submitted.
*
* Spark first applies all operator push-down optimizations that this data source supports. Then
* Spark collects information this data source reported for further optimizations. Finally Spark
Expand All @@ -59,21 +59,21 @@ public interface DataSourceReader {
* Returns the actual schema of this data source reader, which may be different from the physical
* schema of the underlying storage, as column pruning or other optimizations may happen.
*
* If this method fails (by throwing an exception), the action would fail and no Spark job was
* If this method fails (by throwing an exception), the action will fail and no Spark job will be
* submitted.
*/
StructType readSchema();

/**
* Returns a list of read tasks. Each task is responsible for creating a data reader to
* output data for one RDD partition. That means the number of tasks returned here is same as
* the number of RDD partitions this scan outputs.
* Returns a list of {@link InputPartition}s. Each {@link InputPartition} is responsible for
* creating a data reader to output data of one RDD partition. The number of input partitions
* returned here is the same as the number of RDD partitions this scan outputs.
*
* Note that, this may not be a full scan if the data source reader mixes in other optimization
* interfaces like column pruning, filter push-down, etc. These optimizations are applied before
* Spark issues the scan request.
*
* If this method fails (by throwing an exception), the action would fail and no Spark job was
* If this method fails (by throwing an exception), the action will fail and no Spark job will be
* submitted.
*/
List<InputPartition<Row>> planInputPartitions();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,14 @@

/**
* An input partition returned by {@link DataSourceReader#planInputPartitions()} and is
* responsible for creating the actual data reader. The relationship between
* {@link InputPartition} and {@link InputPartitionReader}
* responsible for creating the actual data reader of one RDD partition.
* The relationship between {@link InputPartition} and {@link InputPartitionReader}
* is similar to the relationship between {@link Iterable} and {@link java.util.Iterator}.
*
* Note that input partitions will be serialized and sent to executors, then the partition reader
* will be created on executors and do the actual reading. So {@link InputPartition} must be
* serializable and {@link InputPartitionReader} doesn't need to be.
* Note that {@link InputPartition}s will be serialized and sent to executors, then
* {@link InputPartitionReader}s will be created on executors to do the actual reading. So
* {@link InputPartition} must be serializable while {@link InputPartitionReader} doesn't need to
* be.
*/
@InterfaceStability.Evolving
public interface InputPartition<T> extends Serializable {
Expand All @@ -41,10 +42,10 @@ public interface InputPartition<T> extends Serializable {
* The location is a string representing the host name.
*
* Note that if a host name cannot be recognized by Spark, it will be ignored as it was not in
* the returned locations. By default this method returns empty string array, which means this
* task has no location preference.
* the returned locations. The default return value is empty string array, which means this
* input partition's reader has no location preference.
*
* If this method fails (by throwing an exception), the action would fail and no Spark job was
* If this method fails (by throwing an exception), the action will fail and no Spark job will be
* submitted.
*/
default String[] preferredLocations() {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@
* It can mix in various writing optimization interfaces to speed up the data saving. The actual
* writing logic is delegated to {@link DataWriter}.
*
* If an exception was throw when applying any of these writing optimizations, the action would fail
* and no Spark job was submitted.
* If an exception was throw when applying any of these writing optimizations, the action will fail
* and no Spark job will be submitted.
*
* The writing procedure is:
* 1. Create a writer factory by {@link #createWriterFactory()}, serialize and send it to all the
Expand All @@ -58,7 +58,7 @@ public interface DataSourceWriter {
/**
* Creates a writer factory which will be serialized and sent to executors.
*
* If this method fails (by throwing an exception), the action would fail and no Spark job was
* If this method fails (by throwing an exception), the action will fail and no Spark job will be
* submitted.
*/
DataWriterFactory<Row> createWriterFactory();
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ public interface DataWriterFactory<T> extends Serializable {
/**
* Returns a data writer to do the actual writing work.
*
* If this method fails (by throwing an exception), the action would fail and no Spark job was
* If this method fails (by throwing an exception), the action will fail and no Spark job will be
* submitted.
*
* @param partitionId A unique id of the RDD partition that the returned writer will process.
Expand Down

0 comments on commit f997b87

Please sign in to comment.