Skip to content

Commit

Permalink
Document streams in more detail. (#773)
Browse files Browse the repository at this point in the history
This is meant to help with questions like in #644.
  • Loading branch information
jtv committed Dec 27, 2023
1 parent 92bdf08 commit 0b9c8dd
Show file tree
Hide file tree
Showing 2 changed files with 111 additions and 43 deletions.
138 changes: 102 additions & 36 deletions include/pqxx/doc/streams.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,23 @@ Streams {#streams}

Most of the time it's fine to retrieve data from the database using `SELECT`
queries, and store data using `INSERT`. But for those cases where efficiency
matters, there are two classes to help you do this better: `stream_from` and
`stream_to`. They're less flexible than SQL queries, and there's the risk of
losing your connection while you're in mid-stream, but you get some speed and
matters, there are two _data streaming_ mechanisms to help you do this more
efficiently: "streaming queries," for reading query results from the
database; and the @ref pqxx::stream_to class, for writing data from the client
into a table.

These are less flexible than SQL queries. Also, depending on your needs, it
may be a problem to lose your connection while you're in mid-stream, not
knowing that the query may not complete. But, you get some scalability and
memory efficiencies in return.

Both stream classes do data conversion for you: `stream_from` receives values
from the database in PostgreSQL's text format, and converts them to the C++
types you specify. Likewise, `stream_to` converts C++ values you provide to
PostgreSQL's text format for transfer. (On its end, the database of course
converts values to and from their SQL types.)
Just like regular querying, these streaming mechanisms do data conversion for
you. You deal with the C++ data types, and the database deals with the SQL
data types.


Null values
-----------
Interlude: null values
----------------------

So how do you deal with nulls? It depends on the C++ type you're using. Some
types may have a built-in null value. For instance, if you have a
Expand All @@ -41,48 +44,111 @@ wrappers and smart pointers by copying the implementation patterns from the
existing smart-pointer support.


stream\_from
------------
Streaming data _from a query_
-----------------------------

Use `stream_from` to read data directly from the database. It's faster than
the transaction's `exec` functions if the result contains enough rows. But
also, you won't need to keep your full result set in memory. That can really
Use @ref transaction_base::stream to read large amounts of data directly from
the database. In terms of API it works just like @ref transaction_base::query,
but it's faster than the `exec` and `query` functions For larger data sets.
Also, you won't need to keep your full result set in memory. That can really
matter with larger data sets.

And, you can start processing your data right after the first row of data comes
in from the server. With `exec()` you need to wait to receive all data, and
then you begin processing. With `stream_from` you can be processing data on
the client side while the server is still sending you the rest.

You don't actually need to create a `stream_from` object yourself, though you
can if you want to. Two shorthand functions,
@ref pqxx::transaction_base::stream and @ref pqxx::transaction_base::for_stream,
can each create the streams for you with a minimum of overhead.
Another performance advantage is that with a streaming query, you can start
processing your data right after the first row of data comes in from the
server. With `exec()` or `query()` you need to wait to receive all data, and
only then can you begin processing. With streaming queries you can be
processing data on the client side while the server is still sending you the
rest.

Not all kinds of queries will work in a stream. Internally the streams make
use of PostgreSQL's `COPY` command, so see the PostgreSQL documentation for
`COPY` for the exact limitations. Basic `SELECT` and `UPDATE ... RETURNING`
queries should just work.
queries will just work, but fancier constructs may not.

As you read a row, the stream converts its fields to a tuple type containing
the value types you ask for:

```cxx
auto stream pqxx::stream_from::query(
tx, "SELECT name, points FROM score");
std::tuple<std::string, int> row;
while (stream >> row)
process(row);
stream.complete();
for (auto [name, score] :
tx.stream<std::string_view, int>("SELECT name, points FROM score")
)
process(name, score);
```
As the stream reads each row, it converts that row's data into your tuple,
goes through your loop body, and then promptly forgets that row's data. This
means you can easily process more data than will fit in memory.
On each iteration, the stream gives you a `std::tuple` of the column types you
specify. It converts the row's fields (which internally arrive at the client in text format) to your chosen types.
The `auto [name, score]` in the example is a _structured binding_ which unpacks
the tuple's fields into separate variables. If you prefer, you can choose to
receive the tuple instead: `for (std::tuple<int, std::string_view> :`.
### Is streaming right for my query?
Here are the things you need to be aware of when deciding whether to stream a
query, or just execute it normally.
First, when you stream a query, there is no metadata describing how many rows
it returned, what the columns are called, and so on. With a regular query you
get a @ref result object which contains this metadata as well as the data
itself. If you absolutely need this metadata for a particular query, then
that means you can't stream the query.
Second, under the bonnet, streaming from a query uses a PostgreSQL-specific SQL
command `COPY (...) TO STDOUT`. There are some limitations on what kinds of
queries this command can handle. These limitations may change over time, so I
won't describe them here. Instead, see PostgreSQL's
[COPY documentation](https://www.postgresql.org/docs/current/sql-copy.html)
for the details. (Look for the `TO` variant, with a query as the data source.)
Third: when you stream a query, you start receiving and processing data before
you even know whether you will receive all of the data. If you lose your
connection to the database halfway through, you will have processed half your
data, unaware that the query may never execute to completion. If this is a
problem for your application, don't stream that query!
The fourth and final factor is performance. If you're interested in streaming,
obviously you care about this one.
I can't tell you _a priori_ whether streaming will make your query faster. It
depends on how many rows you're retrieving, how much data there is in those
rows, the speed of your network connection to the database, your client
encoding, how much processing you do per row, and the details of the
client-side system: hardware speed, CPU load, and available memory.
Ultimately, no amount of theory beats real-world measurement for your specific
situation so... if it really matters, measure. (And as per Knuth's Law: if
it doesn't really matter, don't optimise.)
That said, here are a few data points from some toy benchmarks:
If your query returns e.g. a hundred small rows, it's not likely to make up a
significant portion of your application's run time. Streaming is likely to be
_slower_ than regular querying, but most likely the difference just won't
amtter.
If your query returns _a thousand_ small rows, streaming is probably still
going to be a bit slower than regular querying, though "your mileage may vary."
If you're querying _ten thousand_ small rows, however, it becomes more likely
that streaming will speed it up. The advantage increases as the number of rows
increases.
That's for small rows, based on a test where each row consisted of just one
integer number. If your query returns larger rows, with more columns,
I find that streaming seems to become more attractive. In a simple test with 4
columns (two integers and two strings), streaming even just a thousand rows was
considerably faster than a regular query.
If your network connection to the database is slow, however, that may make
streaming a bit _less_ effcient. There is a bit more communication back and
forth between the client and the database to set up a stream. This overhead
takes a more or less constant amount of time, so for larger data sets it will
tend to become insignificant compared to the other performance costs.
stream\_to
----------
Streaming data _into a table_
-----------------------------
Use `stream_to` to write data directly to a database table. This saves you
having to perform an `INSERT` for every row, and so it can be significantly
Expand Down
16 changes: 9 additions & 7 deletions include/pqxx/transaction_base.hxx
Original file line number Diff line number Diff line change
Expand Up @@ -263,12 +263,12 @@ public:
* * The "stream" functions execute your query in a completely different way.
* Called _streaming queries,_ these don't support quite the full range of
* SQL queries, and they're a bit slower to start. But they are
* significantly * _faster_ for queries that return larger numbers of
* rows. They don't load the entire result set, so you can start
* processing data as soon as the first row of data comes in from the
* database. This can save you a lot of time. Processing itself may also
* be faster. And of course, it also means you don't need enough memory to
* hold the entire result set, just the row you're working on.
* significantly _faster_ for queries that return larger numbers of rows.
* They don't load the entire result set, so you can start processing data
* as soon as the first row of data comes in from the database. This can
* This can save you a lot of time. Processing itself may also be faster.
* And of course, it also means you don't need enough memory to hold the
* entire result set, just the row you're working on.
* * The "exec" functions are a more low-level interface. Most of them
* return a pqxx::result object. This is an object that contains all
* information abouut the query's result: the data itself, but also the
Expand Down Expand Up @@ -457,14 +457,16 @@ public:
}
}

/// Execute a query, and loop over the results row by row.
/// Execute a query, in streaming fashion; loop over the results row by row.
/** Converts the rows to `std::tuple`, of the column types you specify.
*
* Use this with a range-based "for" loop. It executes the query, and
* directly maps the resulting rows onto a `std::tuple` of the types you
* specify. Unlike with the "exec" functions, processing can start before
* all the data from the server is in.
*
* Streaming is also documented in @ref streams.
*
* The column types must all be types that have conversions from PostgreSQL's
* text format defined. Many built-in types such as `int` or `std::string`
* have pre-defined conversions; if you want to define your own conversions
Expand Down

0 comments on commit 0b9c8dd

Please sign in to comment.