Insight into your Google App Engine datastore costs
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Objectify Insight

This library provides insight into your high-volume GAE datastore activity. It records read and write activity broken down by time, namespace, module, version, kind, operation, and query and aggregates this data into Google BigQuery. By aggregating at multiple levels, Insight scales to thousands of requests per second.

Insight works well with Google App Engine applications that use Objectify, but (with some limitations) it can work with any application that uses the low level datastore API.

Insight is a metrics collection system. It flows aggregated data into BigQuery in a format that should be useful to developers and system administrators. It does not provide a query interface to BigQuery.



Insight has several moving parts:

  • A facade of the low-level AsyncDatastoreService which aggregates metrics in instance memory and periodically flushes them to a pull queue.
  • A task, which should be called via cron (every minute) which aggregates pull queue tasks and pushes these aggregations into BigQuery.
  • A task, which should be called via cron (infrequently) which ensures that the appropriate BigQuery tables exist.

The resulting BigQuery table data will look something like this:

| uploaded                | codepoint                        | namespace  | module   | version | kind   | op     | query                          | time                    | reads | writes |
| ----------------------- | -------------------------------- | ---------- | -------- | ------- | ------ | ------ | ------------------------------ | ----------------------- | ----- | ------ |
| 2014-09-15 04:58:40 UTC | d41d8cd98f00b204e9800998ecf8427e | namespace2 | default  | v1      | Thing1 | QUERY  | SELECT * FROM Thing1 WHERE ... | 2014-09-15 04:58:40 UTC | 4     | 0      |	 
| 2014-09-15 04:58:40 UTC | 9e107d9d372bb6826bd81d3542a419d6 | namespace1 | deferred | v1      | Thing2 | DELETE |                                | 2014-09-15 04:58:40 UTC | 0     | 1      |	 
| 2014-09-15 04:58:40 UTC | e4d909c290d0fb1ca068ffaddf22cbd0 | namespace1 | default  | v2      | Thing1 | SAVE   |                                | 2014-09-15 04:58:40 UTC | 0     | 1      |

If you've ever seen a ROLAP database, this should look familiar. codepoint, namespace, module, version, kind, op, query, and time are dimensions; reads and writes are the aggregated statistics.

uploaded is the date that the batch was uploaded to BigQuery. time is the actual date of the operation, rounded to a configurable boundary (default 1 minute) to allow for reasonable aggregation.

reads and writes are entity counts, not operation counts.

codepoint is the md5 hash of a stacktrace to the unique point in your code where the datastore operation took place. To look up the actual stacktrace, grep your App Engine logs for the hash value. Each instance will log the definition of each codepoint exactly once. Enable INFO logging at com.googlecode.objectify.insight.


If you use Guice, you may find it helpful to read the code at Guice is not required to use Insight, but it helps. This documentation assumes you will use Guice.

Set up Queue

Add a pull queue named "insight" to your queue.xml:


Set up Cron

Add two entries to your cron.xml:

		<description>Make sure we have enough tables for a week</description>
		<schedule>every 8 hours</schedule>
		<description>Move all data to BQ</description>
		<schedule>every 1 minutes</schedule>

Enable the servlets

In your guice ServletModule, serve the paths you specified in cron.xml above with the relevant servlets:


You will likely want to secure these servlets by using the standard security features in GAE:

It is not dangerous to expose these endpoints to the public, but they are not for human consumption.

Servlets without Guice

If you are not using Guice (or another JSR-330 compabile DI framework), extend the AbstractTableMakerServlet and AbstractPullerServlet classes. They offer a poor-man's DI system.

Get an AsyncDatastoreService

Insight is implemented as a wrapper to the GAE low-level API AsyncDatastoreService class. The InsightAsyncDatastoreService itself is constructed by passing in the raw AsyncDatastoreService you get from Google, plus the Insight Recorder. The Recorder requires the Collector and BucketFactory... etc. Guice (or any other JSR-330 compatible DI framework) makes this much more convenient and all pretty much automatic.

Here is the minimum Guice configuration you would need to be able to get the Recorder out of the injector. That is, we want this to work:

AsyncDatastoreService raw = DatastoreServiceFactory.getAsyncDatastoreService();
Recorder recorder = injector.getInstance(Recorder.class);
AsyncDatastoreService tracksMetrics = new InsightAsyncDatastoreService(raw, recorder);

These are the bindings you will need to create in your Guice module:

Bigquery bigquery() { 
	// your complicated code to generate an authenticated connection here

/** The bigquery project and dataset ids where you will write data */
InsightDataset insightDataset() {
	return new InsightDataset() {
		public String projectId() {
			return "objectify-insight-test";
		public String datasetId() {
			return "insight_example";

/** There must be a Queue bound with the name "insight" */
public Queue queue() {
	return QueueFactory.getQueue(Flusher.DEFAULT_QUEUE);

Creating an authenticated instance of Bigquery is not in the scope of this document. If you make it injectible, Guice will inject it into Insight. Insight also needs to know the project and dataset ids for bigquery, and the pull queue that will be used for aggregation.

Decide what to record

By default, Insight ignores everything. You can tell the Recorder to record specific kinds or to record everything. Recorder is a singleton; this configuration only needs to happen once:

Recorder recorder = injector.getInstance(Recorder.class);

// You can specify kinds individually

// If true, all kinds will be recorded

You can disable recording of codepoint hashes by calling:


Use Insight with Objectify

Assuming you have injected the Recorder into your ObjectifyFactory, override these methods:


The ObjectifyFactory uses an overridable method to obtain the low-level AsyncDatastoreService interface. Override this method and return your wrapper InsightAsyncDatastoreService:

	protected AsyncDatastoreService createRawAsyncDatastoreService(DatastoreServiceConfig cfg) {
		AsyncDatastoreService raw = super.createRawAsyncDatastoreService(cfg);
		return new InsightAsyncDatastoreService(raw, recorder);


This allows you to use the Collect annotation on POJO entity classes to enable recording. This is an alternative to registering kinds one-at-a-time by hand.

	public <T> void register(Class<T> clazz) {

		if (clazz.isAnnotationPresent(Collect.class))

This override can be skipped if you use Recorder.setRecordAll(true).


Insight has tunable parameters spread across several different singleton objects in the object graph. You can inject/fetch them in Guice and reset values, or (if you aren't using guice) set them as you construct the object graph manually.

Broken down by object:


Collector collector = injector.getInstance(Collector.class);
collector.setAgeThresholdMillis(1000 * 30);

The Collector is responsible for aggregating metrics and periodically flushing aggregations to the Flusher. Flushing occurs when the number of separate aggregations exceeds a threshold, or the oldest bucket hits an age threshold.

Note that age-threshold flushing occurs within the context of the next collection request; Insight does not create extra threads in your application.


Clock clock = injector.getInstance(Clock.class);
clock.setGranularityMillis(1000 * 600);

Most requests come in at fairly unique millisecond clock values. In order to get meaningful aggregation, we must 'round' clock values to something more granular. Coarser (higher) numbers provide better aggregation at the cost of less precisely knowing when activities happen.


TablePicker picker = injector.getInstance(TablePicker.class);
picker.setFormat(new SimpleDateFormat("'myprefix_'yyMMdd");

You can change the format of table names; be sure to include any prefix as a constant in the DateFormat.


Puller puller = injector.getInstance(Puller.class);

The Puller pulls batches of data off of the pull queue and pushes them to BigQuery. Since BigQuery is limited to how large a single request can be, you might need to adjust the batch size. The default is 20. If you get "request too large" errors, adjust this down.



As shown in the previous section codepoint generation can be completely disabled.

Regardless if you decide to keep it enabled, you can tweak it further by replacing the StackProducer. For example:

codepointer.setStackProducer(new AdvancedStackProducer());

You can modify what to include in the stacktrace used to generate the codepoint. You can get rid of the irrelevant classes from the plaform, filters, servlets, package names can be abbreviated, etc.

This affects the stacktrace dump seen once per instance, but it can be also achieved that a refactor - or an addition of a new filter - won't change every codepoint you have.


Uses the pre-1.0.5 behaviour, which keeps the stacktrace intact, and only removes the mutable parts from generated classes' names.


Removes every stacktrace element that is irrelevant. The resulting stacktrace should only contain business-wise important lines.

For example it removes: platform-specific servlets/filters, proxy/reflection related classes, guice injection related classes, endpoints-java related classes, gwt-rpc related classes, objectify/objectify-insight related classes, etc.

Although you can subclass FilteringStackProducer and override filterStack(Iterable<StackTraceElement> stack) on your own - which is the superclass of AdvancedStackProducer -, but most likely you can achieve the best result by using - or if you need any custom filtering by extending - this class.


Insight tracks all of the datastore operations used by Objectify, but does not track every operation you can possibly perform in the low-level API. In particular, it is possible to trigger read operations on List objects in such a way that Insight cannot determine statistics without potentially impacting the performance of your application. For example:

PreparedQuery pq = ds.prepare(query);
List<Entity> entities = pq.asList(fetchOpts).asList();
int size = entities.size();

Insight doesn't know what to do with this without explicitly iterating through the List.

As long as you iterate through results in the low-level API at least once, Insight will track statistics. Note that if you use Objectify, this limitation does not apply; Objectify always iterates the original List.


If you have questions, ask on the Objectify Google Group:


Released under the MIT License.


Huge thanks to BetterCloud ( for funding this project!