Skip to content

Commit

Permalink
[SPARK-8437] [DOCS] Using directory path without wildcard for filenam…
Browse files Browse the repository at this point in the history
…e slow for large number of files with wholeTextFiles and binaryFiles

Note that 'dir/*' can be more efficient in some Hadoop FS implementations that 'dir/'

Author: Sean Owen <sowen@cloudera.com>

Closes apache#7036 from srowen/SPARK-8437 and squashes the following commits:

0e813ae [Sean Owen] Note that 'dir/*' can be more efficient in some Hadoop FS implementations that 'dir/'
  • Loading branch information
srowen authored and Andrew Or committed Jun 30, 2015
1 parent fbf7573 commit 5d30eae
Showing 1 changed file with 6 additions and 2 deletions.
8 changes: 6 additions & 2 deletions core/src/main/scala/org/apache/spark/SparkContext.scala
Original file line number Diff line number Diff line change
Expand Up @@ -831,6 +831,8 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
* }}}
*
* @note Small files are preferred, large file is also allowable, but may cause bad performance.
* @note On some filesystems, `.../path/*` can be a more efficient way to read all files in a directory
* rather than `.../path/` or `.../path`
*
* @param minPartitions A suggestion value of the minimal splitting number for input data.
*/
Expand Down Expand Up @@ -878,9 +880,11 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
* (a-hdfs-path/part-nnnnn, its content)
* }}}
*
* @param minPartitions A suggestion value of the minimal splitting number for input data.
*
* @note Small files are preferred; very large files may cause bad performance.
* @note On some filesystems, `.../path/*` can be a more efficient way to read all files in a directory
* rather than `.../path/` or `.../path`
*
* @param minPartitions A suggestion value of the minimal splitting number for input data.
*/
@Experimental
def binaryFiles(
Expand Down

0 comments on commit 5d30eae

Please sign in to comment.