Investigate pagination support #50
We need to identify if pagination support is required for TAXII. We had item based pagination support in TAXII 2.0, and took it out for TAXII 2.1. The problems we ran in to were data sets that can change rapidly make item based pagination impossible. Further, item based pagination proved to be very computationally expensive for large data sets.
There is a use case where a system may add millions of records to the TAXII server in a single database transaction. Situations like this, may also record the same "date added" for each record in that database transaction. This means, that date added based filtering / pagination would not be possible.
There needs to be some sort of solution that can allow a client to tell the server, based on some monotonically increasing counter to start there and give you records either before or after that point.
Success criteria for this feature is the ability to handle rapidly changing datasets, datasets that are really large, and provide a performant solution.
The endpoints that need pagination are:
The Object by ID resource can contain a significant number of object versions, which become unwieldy to manage in a single request/response pair. Without a mechanism to manage highly-versioned objects, effective transport is significantly limited.
The text was updated successfully, but these errors were encountered:
Date Added and IDs pose a particular issue for STIX Objects. There can be multiple versions of an object per ID/data added. There has to be a way to iterate over objects when the number of versions associated with a particular Object id exceeds the item limit of a client or server. For STIX objects, a concatenation of the object
Iterating Across a Collection of Response Items
The goal of this implementation is to:
The server can only indicate that more results may be available. The server cannot guarantee how many, if any, additional items will be available on the next request. In the case where:
The server will respond that more Collections may be available for the client. The client retrieves the first 10 Collections from the server. However, the 11th Collection could be deleted before the client makes an additional request. The response to the second request could return 1 object (no change), 10 more objects or none.
The maximum number of items returned in a response is the lesser of client or server-specified limit. If the client indicates it can receive up to 20 items and the server allows 15, a maximum of 15 items are returned.
Manage requests for additional endpoint items using Hypermedia As The Engine Of Application State (HATEOAS). Rather than specify how the client requests the additional data, the server provides a hyperlink (URL) to retrieve the data. For example:
Why not specify parameters?
Collections, Manifests and Objects, the paginated endpoints, identify unique items differently. Collections are simple, where a unique item can be identified by the Collection ID. In TAXII servers that support object versioning, there is no single property that identifies a unique item on the Objects endpoint, because items are versions of objects. With versioning support, the Objects endpoint can uniquely identify items by concatenating the object
By allowing the server to provide an opaque URL for (possible) additional items:
If we were to standardize an
A 74 character string is verbose to identify an item.
Problems with this Solution
For server implementors with SQL databases, the initial Collections query with a 10 item limit would look like:
SELECT * FROM collections ORDER BY id LIMIT 10
The follow-on query derived from:
SELECT * FROM collections WHERE id > '52892447-4d7e-4f70-b94d-d7f22742ff63' ORDER BY id LIMIT 10
If you store versions inside an
SELECT * FROM objects ORDER BY date_added, id LIMIT 10
SELECT * FROM objects WHERE CONCAT(CONCAT(id, '--'),modified) > 'indicator--29aba82c-5393-42a8-9edb-6a2cb1df070b--2016-11-01T03:04:05.634Z' ORDER BY date_added, CONCAT(CONCAT(id, '--'),modified) LIMIT 10
The date added has to be in the query above because the 2.0 spec requires objects come back in "date added" order. This is kind of weird, because the versions themselves may have been added later.
If you store object versions in their own table and they have their own database-maintained integer primary key (
SELECT * FROM versions WHERE id > 8675309 ORDER BY id LIMIT 10
If your TAXII server doesn't support versions, you can just use Object ids without having to accept a
SELECT * FROM objects WHERE id > 'indicator--29aba82c-5393-42a8-9edb-6a2cb1df070b' ORDER BY date_added, id LIMIT 10
For NoSQL databases, you may add a custom
NineFX will implement this on the server side if folks agree it's worth testing.
We have added support for pagination in TAXII 2.1. This does not solve the problem of a system adding a million records in a single database transaction and that transaction uses the same date added value for each entry (and the taxii server limits the amount of results per page to something much less than 1 million records. But this may just be an implementation specific problem that an individual vendor would need to solve.