Skip to content

Commit

Permalink
apacheGH-34213: [C++] Use recursive calls without a delimiter if the …
Browse files Browse the repository at this point in the history
…user is doing a recursive GetFileInfo (apache#35440)

### Rationale for this change

The old model of "walk"ing the directory could lead to a large number of calls.  If someone is fully listing a bucket they will need to make one S3 API call for every single directory in the bucket.  With this approach there is only 1 call made for every 1000 files, regardless of how they are spread across directories.

The only potential regression would be if max_recursion was set to something > 1.  For example, if a user had:

```
bucket/foo/bar/<10000 files here>
```

Then if they make a request for `bucket` with `max_recursion=2` the new approach will list all 10,000 files and then eliminate the files that don't match.

However, I believe these cases (using max_recursion) to be rarer and less common than the typical case of listing all files (which dataset discovery does).

### What changes are included in this PR?

The algorithm behind GetFileInfo and DeleteDirContents in S3FileSystem has changed.

### Are these changes tested?

Yes, there should be no behavior change.  All of the existing filesystem tests will test this change.

### Are there any user-facing changes?

No, other than (hopefully) better performance.
* Closes: apache#34213

Lead-authored-by: Weston Pace <weston.pace@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
  • Loading branch information
2 people authored and loicalleyne committed Nov 13, 2023
1 parent 5f80de4 commit fc02509
Show file tree
Hide file tree
Showing 7 changed files with 638 additions and 403 deletions.
41 changes: 41 additions & 0 deletions cpp/src/arrow/filesystem/filesystem_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,34 @@ TEST(PathUtil, SplitAbstractPath) {
AssertPartsEqual(parts, {"abc", "def.ghi"});
}

TEST(PathUtil, SliceAbstractPath) {
std::string path = "abc";
ASSERT_EQ("abc", SliceAbstractPath(path, 0, 1));
ASSERT_EQ("abc", SliceAbstractPath(path, 0, 2));
ASSERT_EQ("", SliceAbstractPath(path, 0, 0));
ASSERT_EQ("", SliceAbstractPath(path, 1, 0));

path = "abc/def\\x/y.ext";
ASSERT_EQ("abc/def\\x/y.ext", SliceAbstractPath(path, 0, 4));
ASSERT_EQ("abc/def\\x/y.ext", SliceAbstractPath(path, 0, 3));
ASSERT_EQ("abc/def\\x", SliceAbstractPath(path, 0, 2));
ASSERT_EQ("abc", SliceAbstractPath(path, 0, 1));
ASSERT_EQ("def\\x/y.ext", SliceAbstractPath(path, 1, 2));
ASSERT_EQ("def\\x/y.ext", SliceAbstractPath(path, 1, 3));
ASSERT_EQ("def\\x", SliceAbstractPath(path, 1, 1));
ASSERT_EQ("y.ext", SliceAbstractPath(path, 2, 1));
ASSERT_EQ("", SliceAbstractPath(path, 3, 1));

path = "x/y\\z";
ASSERT_EQ("x", SliceAbstractPath(path, 0, 1));
ASSERT_EQ("x/y", SliceAbstractPath(path, 0, 1, /*sep=*/'\\'));

// Invalid cases but we shouldn't crash
ASSERT_EQ("", SliceAbstractPath(path, -1, 1));
ASSERT_EQ("", SliceAbstractPath(path, 0, -1));
ASSERT_EQ("", SliceAbstractPath(path, -1, -1));
}

TEST(PathUtil, GetAbstractPathExtension) {
ASSERT_EQ(GetAbstractPathExtension("abc.txt"), "txt");
ASSERT_EQ(GetAbstractPathExtension("dir/abc.txt"), "txt");
Expand All @@ -98,6 +126,19 @@ TEST(PathUtil, GetAbstractPathExtension) {
ASSERT_EQ(GetAbstractPathExtension("/run.d/abc"), "");
}

TEST(PathUtil, GetAbstractPathDepth) {
ASSERT_EQ(0, GetAbstractPathDepth(""));
ASSERT_EQ(0, GetAbstractPathDepth("/"));
ASSERT_EQ(1, GetAbstractPathDepth("foo"));
ASSERT_EQ(1, GetAbstractPathDepth("foo/"));
ASSERT_EQ(1, GetAbstractPathDepth("/foo"));
ASSERT_EQ(1, GetAbstractPathDepth("/foo/"));
ASSERT_EQ(2, GetAbstractPathDepth("/foo/bar"));
ASSERT_EQ(2, GetAbstractPathDepth("/foo/bar/"));
ASSERT_EQ(2, GetAbstractPathDepth("foo/bar"));
ASSERT_EQ(2, GetAbstractPathDepth("foo/bar/"));
}

TEST(PathUtil, GetAbstractPathParent) {
std::pair<std::string, std::string> pair;

Expand Down
37 changes: 37 additions & 0 deletions cpp/src/arrow/filesystem/path_util.cc
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@

#include <algorithm>
#include <regex>
#include <sstream>

#include "arrow/filesystem/path_util.h"
#include "arrow/filesystem/util_internal.h"
Expand Down Expand Up @@ -66,6 +67,42 @@ std::vector<std::string> SplitAbstractPath(const std::string& path, char sep) {
return parts;
}

std::string SliceAbstractPath(const std::string& s, int offset, int length, char sep) {
if (offset < 0 || length < 0) {
return "";
}
std::vector<std::string> components = SplitAbstractPath(s, sep);
std::stringstream combined;
if (offset >= static_cast<int>(components.size())) {
return "";
}
int end = offset + length;
if (end > static_cast<int>(components.size())) {
end = static_cast<int>(components.size());
}
for (int i = offset; i < end; i++) {
combined << components[i];
if (i < end - 1) {
combined << sep;
}
}
return combined.str();
}

int GetAbstractPathDepth(std::string_view path) {
if (path.empty()) {
return 0;
}
int depth = static_cast<int>(std::count(path.begin(), path.end(), kSep)) + 1;
if (path.back() == kSep) {
depth -= 1;
}
if (path.front() == kSep) {
depth -= 1;
}
return depth;
}

std::pair<std::string, std::string> GetAbstractPathParent(const std::string& s) {
// XXX should strip trailing slash?

Expand Down
20 changes: 18 additions & 2 deletions cpp/src/arrow/filesystem/path_util.h
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,25 @@ constexpr char kSep = '/';
ARROW_EXPORT
std::vector<std::string> SplitAbstractPath(const std::string& path, char sep = kSep);

// Return the extension of the file
// Slice the individual components of an abstract path and combine them
//
// If offset or length are negative then an empty string is returned
// If offset is >= the number of components then an empty string is returned
// If offset + length is >= the number of components then length is truncated
ARROW_EXPORT
std::string GetAbstractPathExtension(const std::string& s);
std::string SliceAbstractPath(const std::string& path, int offset, int length,
char sep = kSep);

// Return the extension of the file
ARROW_EXPORT std::string GetAbstractPathExtension(const std::string& s);

// Return the depth (number of components) of an abstract path
//
// Trailing slashes do not count towards depth
// Leading slashes do not count towards depth
//
// The root path ("/") has depth 0
ARROW_EXPORT int GetAbstractPathDepth(std::string_view path);

// Return the parent directory and basename of an abstract path. Both values may be
// empty.
Expand Down

0 comments on commit fc02509

Please sign in to comment.