Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a mechanism to save bucket pagination ranges #5742

Closed
coryan opened this issue Jan 25, 2021 · 17 comments
Closed

Provide a mechanism to save bucket pagination ranges #5742

coryan opened this issue Jan 25, 2021 · 17 comments
Labels
api: storage Issues related to the Cloud Storage API. cpp: backlog While desirable, we do not have time to work on this for the foreseeable future. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@coryan
Copy link
Member

coryan commented Jan 25, 2021

This is motivated by #5703. Applications may need to persist the state of pagination iterators, for example, because they are implementing a web service that returns pages of results.

I suggested this API:

struct Page {
  std::vector<std::string> bucket_names;
  std::string reader_state;
};

Page BucketsPage(gcs::Client client, std::string reader_state) {
  auto reader = client.ListBuckets(gcs::RestoreBucketReader(
      reader_state /* an empty string starts a new reader */));
  std::vector<std::string> names;
  for (auto& b : reader) {
    if (!b || names.size() >= 100) break;
    names.push_back(b->name());
  }
  return Page{std::move(names), std::move(reader).Checkpoint()};
}

I think that this problem applies to any pagination -> iterator mapping.

@coryan coryan added api: storage Issues related to the Cloud Storage API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. labels Jan 25, 2021
@coryan coryan added the good first issue This issue is a good place to started contributing to this repository. label Feb 22, 2021
@coryan coryan added this to Needs triage in Customer Issues via automation Feb 22, 2021
@coryan coryan moved this from Needs triage to High priority in Customer Issues Mar 2, 2021
@Ayrtat
Copy link

Ayrtat commented Jun 28, 2021

@coryan Hello! I would like to take this as first issue to get involved.
Does anybody solve this problem? Can I take this?

@coryan
Copy link
Member Author

coryan commented Jun 28, 2021

@Ayrtat nobody is working on this at the moment. If you are interested, I recommend you make sure you can run unit tests and integration tests locally:

https://github.com/googleapis/google-cloud-cpp/blob/main/doc/contributor/howto-guide-running-ci-builds-locally.md

https://github.com/googleapis/google-cloud-cpp/blob/main/doc/contributor/howto-guide-setup-development-workstation.md

We can then discuss how the Checkpoint() and RestoreBucketRead() would look like.

@Ayrtat
Copy link

Ayrtat commented Jul 1, 2021

@coryan
Hello, sorry for late response!
I followed the instructions and have run up CI tests:

100% tests passed, 0 tests failed out of 336

@coryan
Copy link
Member Author

coryan commented Jul 2, 2021

Very cool. I guess at this point we need to start figuring out how these functions would look like. Let me start by saying that my proposal is going to seem very complicated. The additional convolutions are to keep the public API as narrow as possible. Every function and type in the public API is a thing we need to support more-or-less forever, so we try to create as few as possible, sometimes jumping through some hoops to make it so.

Also, if you can think of a better design, please do speak up! I have no monopoly on good ideas. Just keep in mind the costs of public APIs.

I imagine we would start with something like:

#include "google/cloud/storage/internal/complex_options.h"

namespace google::cloud::storage { // C++17 for exposition-only
class PaginationToken;
namespace internal {
std::string GetTokenValue(PaginationToken const&);
PaginationToken MakePaginationToken(std::string value);
}

class PaginationToken {
 public:
  // This class should be opaque, application developers should not be able to create them, or access the internal state.
  // the only way to get one is calling 
 private:
  friend GetTokenValue;
  friend MakePaginationToken; 

  explicit PaginationToken(std::string value)

  std::string value_;
};

struct PaginationState : public ComplexOption<PaginationState, std::string> {
  // look at any of the other complex options for details
} ;
}

Then we would need to add this as a valid option to BucketsRequest:

class ListBucketsRequest
: public GenericRequest<ListBucketsRequest, MaxResults, Prefix, Projection,
UserProject> {

And use that option to initialize token_ (if the option is present).

We also need to create the CheckPoint() function. Maybe here:

using ListBucketsReader =
google::cloud::internal::PaginationRange<BucketMetadata>;

defined as:

PaginationToken Checkpoint(ListBucketsReader&& reader); // will require move, this is a good thing.

Which may require adding another function (in the internal namespace) to extract the token value from:

template <typename T, typename Request, typename Response>
class PagedStreamReader {

Let me know if you have any questions.

@Ayrtat
Copy link

Ayrtat commented Jul 2, 2021

I studied the code and already got few points (I hope so...)
As I understood -- we can get pagination from reader, like ListBucketReader but we also need to set ListBucketsRequest to get a response and a token. Also we need to set ListBucketsResponse for page extractor (get items from response). Fix me, please, if I am wrong

So, are you going to incapsulate token in PaginationToken?
Could you also say about checkpoint method? Does it snap a reader?

P.S. Please, tell me, if there is some doc that I can read and get info and not ply you with many question :)

@coryan
Copy link
Member Author

coryan commented Jul 3, 2021

I studied the code and already got few points (I hope so...)

Great.

As I understood -- we can get pagination from reader, like ListBucketReader but we also need to set ListBucketsRequest to get a response and a token. Also we need to set ListBucketsResponse for page extractor (get items from response). Fix me, please, if I am wrong

I think your reading is correct.

So, are you going to incapsulate token in PaginationToken?

That is the plan.

Could you also say about checkpoint method? Does it snap a reader?

I think it will be something like this:

PaginationToken Checkpoint(ListBucketsReader&& reader) {
  return PaginationToken(std::move(reader).CurrentToken());
}

But as I write this I realize there are some boundary conditions we need to deal with:

  1. What happens to the data already pre-loaded into the reader?
  2. What happens when the reader has not fetched all the data? What is the correct token value.
  3. Do we need this extra level of indirection? Why not make Checkpoint a member function of PagedStreamReader?

Some of these questions may seem unimportant to you, or not help you. I want to record them here for future reference anyway.

Let's tackle each in turn, because I think the are in order of difficulty:

What happens to the data already pre-loaded into the reader?

That suggests we need to change the return value to something like:

std::pair<PaginationToken, std::vector<BucketMetadata>> Checkpoint(ListBucketsReader&& reader);

this makes it impossible for the application to "forget" about any data already in the reader.

What happens when the reader has not fetched all the data? What is the correct token value.

That suggests we need to change PaginationToken to:

class PaginationToken {
 public:
  // This class should be opaque, application developers should not be able to create them, or access the internal state.
  // the only way to get one is calling 
 private:
  friend GetTokenValue;
  friend MakePaginationToken; 

  explicit PaginationToken(std::string value);
  explicit PaginationToken(abls::nullopt_t)

  abls::optional<std::string> value_;
};

That allows us to use an empty string as the marker for "no page loaded yet" (it always is in the google services), and to use an empty optional as the marker for "all pages loaded, there is no more data".

Do we need this extra level of indirection? Why not make Checkpoint a member function of PagedStreamReader?

I think the answer is "yes", but the rationale is a bit complicated:

  • We do not want to force users of the library to read the internal::PageStreamReader<> documentation, it is an internal class.
  • Maybe we do not want to expose checkpointing for all readers.

P.S. Please, tell me, if there is some doc that I can read and get info and not ply you with many question :)

This is a new feature. There is no design for a feature that does not exist, you and I are creating the design in this conversation.

PS: I apologize in advance if you know most, or some of the stuff I am explaining. I just don't know your level of expertise so I am assuming nothing. If you find that (a) I skipped some steps, please do ask for more details, or (b) if I am telling you stuff you know, please tell me to stop.

@Ayrtat
Copy link

Ayrtat commented Jul 3, 2021

@coryan
The more you tell me the better it is. To be honest, I have no any expertise in this project but I would like to study this project and make some contribution. I am interested in it! I need some time to study it and soon I will have questions! :)

@Ayrtat
Copy link

Ayrtat commented Jul 4, 2021

What would I like to propose:

  1. Allow to StreamRange keeps polymorphic functional object, i.e. either std::function or object of class/sturcture.
    So, we need to introduce optional template argument:
template <typename T>
struct IdleFunctionalObject {
 // This shuts up the compiler
 auto operator()() { return T{}; }
};

// Defined below.
template <typename T, typename FunctorT = IdleFunctionalObject<T>>
class StreamRange;
  1. We can create a range in two ways:
template <typename U>
friend StreamRange<U> internal::MakeStreamRange(internal::StreamReader<U>);

template <typename U, typename FunctorT>
friend StreamRange<U, FunctorT> internal::MakeStreamRangeState(FunctorT reader);
  1. Make reader_ within StreamRange polymorphic
  explicit StreamRange(internal::StreamReader<T> reader)
      : var_reader_(reader) {
    Next();
  }
  explicit StreamRange(FunctorT state_reader_)
      : var_reader_(state_reader_){
    Next();
  }
  // experemintal. var_reader_ can be std::function-packed functor or object of a class/structure
  absl::variant<internal::StreamReader<T>, FunctorT> var_reader_;
  1. Invoke operator()() from the polymorphic reader:
auto v = (absl::holds_alternative<internal::StreamReader<T>>(var_reader_)
                  ? absl::get<internal::StreamReader<T>>(var_reader_)
                  : absl::get<FunctorT>(var_reader_))();

    // auto v = reader_ ? reader_() : state_reader_();
    absl::visit(UnpackVariant{*this}, std::move(v));
  1. The example of stative functor:
template <typename ReaderT>
struct StativeReader {
  StativeReader(ReaderT& reader) : reader_(reader) {}

  ReaderT& reader_;
  auto operator()() { return reader_.GetNext(); }
};

instead

[reader]() mutable { return reader->GetNext(); }
  1. This allows to us:
return MakeStreamRangeState<ValueType>(StativeReader<ReaderType>(*reader));
  1. How can we retrieve reader's state from range?
// experimental 
absl::optional<typename FunctorT::reader_type> tryGetReaderState() {
    return absl::holds_alternative<internal::StreamReader<T>>(var_reader_)
               ? std::nullopt
               : absl::get<FunctorT>(var_reader_).reader_;
  }
  1. So, here we are
template <typename ReaderT>
PaginationToken checkpoint(ReaderT&& reader) {
  /*static_assert(does_type_support_get_current_token_method_v<ReaderT>,
                "Reader does not support getCurrentToken method");*/

  return PaginationToken::create(
      std::move(reader).tryGetReaderState().getCurrentToken());
}

I have already carried some experiments and got some results but I am concerned about design. @coryan, please, can you review this idea and blame/fix me. I hope this idea can be useful!

@coryan
Copy link
Member Author

coryan commented Jul 7, 2021

It seems to be that it would be easier to just not use StreamReader<>, and have separate implementations for paginated vs. non-paginated ranges. Then we can add the additional accessors for the pagination token to the paginated readers.

That is, instead of this:

template <typename T>
using PaginationRange = StreamRange<T>;

We would write something like:

template <typename T>
class PaginationRange { // no longer related to `StreamRange<T>` !!
 public:
   iterator begin() { ... }
   iterator end() { ... }

   std::pair<PaginationToken, std::vector<T>> Checkpoint() && { ... }
};

I think this would be a breaking change, but I would like to hear @devjgm's opinion.

@ghost
Copy link

ghost commented Aug 29, 2021

@Ayrtat Are you still working on this?

@Ayrtat
Copy link

Ayrtat commented Aug 29, 2021

@Ayrtat Are you still working on this?

Hi! No, I am not.
You can take this take if you would like to :)

@AditiThirdEye
Copy link

Hello @coryan ,
Is the PR done for this issue?
Otherwise, I would like to work on this issue.

@coryan
Copy link
Member Author

coryan commented Sep 14, 2021

There is no PR for this, though @remyabel expressed an interest.

@ghost
Copy link

ghost commented Sep 14, 2021

I don't mind if someone else was working on it. Just keeping it on the table incase I get around to it and didn't want to duplicate effort.

@AditiThirdEye
Copy link

Yes @remyabel ,
I would start working on this, and update here with progress.

@coryan coryan removed the good first issue This issue is a good place to started contributing to this repository. label Mar 17, 2022
@coryan
Copy link
Member Author

coryan commented Aug 18, 2022

This may be easier once I introduce a StorageConnection to implement mocks.

@coryan
Copy link
Member Author

coryan commented Jan 25, 2023

We do not have time to work on this, closing.

@coryan coryan closed this as completed Jan 25, 2023
Customer Issues automation moved this from High priority to Closed Jan 25, 2023
@coryan coryan added the cpp: backlog While desirable, we do not have time to work on this for the foreseeable future. label Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the Cloud Storage API. cpp: backlog While desirable, we do not have time to work on this for the foreseeable future. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
No open projects
Customer Issues
  
Closed
Development

No branches or pull requests

3 participants