Database Schema

The time series index database schema is designed to serve as a general index and summary of time series data to address multiple use cases.

One of the main uses is to efficiently search for and access data by minimizing the need to read data unnecessarily. To this end, a row in the indexing schema represents a contiguous section, aka byte range, of a file containing data from the same time series. For miniSEED this means a section of a file containing data records from the same network, station, location, channel and quality. The data section may contain out of order records (although it is better if they are in time order) or time tears (gaps or overlaps) but it will be the same time series.

Other important use cases are to build various summaries of data holdings to report data availability, synchronize holdings, etc.

PostgreSQL schema version 1.1
SQLite schema version 1.1
PostgreSQL CREATE statement
Suggested PostgreSQL indexes
Suggested time series summary table

PostgreSQL schema version 1.1

Field	Postgres Type	Description
`id`	bigserial (bigint)	A row identifier, auto-incrementing
`network`	text	Network identifier, e.g. SEED network code
`station`	text	Station identifier, e.g. SEED station code
`location`	text	Location identifier, e.g. SEED location ID
`channel`	text	Channel identifier, e.g. SEED channel codes
`quality`	text	Quality identifier, e.g. SEED quality indicator. If NULL quality is unknown.
`version`	smallint	Data publication version. If NULL version is unknown.
`starttime`	timestamp w/TZ	Earliest time for data section with microsecond resolution
`endtime`	timestamp w/TZ	Latest time for data section with microsecond resolution
`samplerate`¹	numeric	Reference sample rate in samples/second
`filename`	text	Name of file containing data
`byteoffset`	bigint	Byte offset in file to beginning of data section
`bytes`	bigint	Total number of bytes in the data section
`hash`	text	MD5 hash of data section
`timeindex`²	hstore	A set of key-value pairs, where keys are epoch times and values are byte offsets
`timespans`³	numrange[]	Array of epoch time value ranges, the start and stop of each segment of data in section
`timerates`	numeric[]	Array of sample rates corresponding to the `timespans`. If NULL the reference samplerate value is appropriate for all spans.
`format`⁴	text	Identifier for data format. If NULL the format is miniSEED.
`filemodtime`	timestamp w/TZ	Modification time of the file
`updated`	timestamp w/TZ	Time of last update of the data section, changed when MD5 changes. If no existing time present, will be the same as the modification time.
`scanned`	timestamp w/TZ	Time of last scan and update to this entry.

1. About `samplerate`

This value is a reference, nominal sample rate for the time spans in the entry, which matches one of the spans. When this value is not appropriate the for all time spans the actual sample rates are in the timerates field.

2. About `timeindex`

The time index is a hash (aka unsorted list) of time values and byte offsets. The time values are stored as Unix epoch times with microsecond precision.

The time values are guaranteed to indicate the earliest byte offset at which data from that time is present in the data section. This allows software to potentially identify a smaller range of the file to read for a given data selection. An example hash index might look like:

(time) 1425531713.000000 => 4096 (byte offset)
(time) 1425536784.000999 => 8192 (byte offset)
(time) 1425542367.405000 => 24576 (byte offset)
...
                 latest => [0|1]

If a data section contains 24 hours of data (typical for an archive) but the request is only for one hour, this information can be used to narrow the byte range in a file to read for a given request.

A special member of the hash with key of latest, if present, will be set to either 0 or 1 indicating whether the index also identifies byte offsets of the latest timed data.

The guarantee that hash members indicate the earliest byte offset at which data for a specified time resides allows software to use the index to identify the best offset to start reading for a selected start time. If the index also represents latest times, the index may be used to identify the best offset to stop reading for a selected on end time.

3. About `timespans`

The timespans field is an array of ranges, where each range is a time span stored as epoch time values (with microsecond resolution). These spans represent the contiguous time series segments contained in the data section.

Only data records that contain time coverage (one or more samples and a non-zero sample rate) contribute to the time spans in the timespans field. For entries with no time coverage, e.g. log records, SOH records, etc., timespans will be NULL.

In some cases the starttime and endtime fields for a section can be different than the start and end times of spans in the timespans array. This can occur when there are record times with no coverage (0 samples or 0 sample rate) that are earlier or later than all spans with coverage.

4. About `dataformat`

The data format field contains a short identifier indicating the data format contained in the file (in the byte range for this entry). Defined data format identifiers are as follows:

NULL: same as ms
ms: miniSEED format, unspecified version
ms2: miniSEED format, as defined by SEED 2.x
sac: SAC, Seismic Analysis Code, format
gse2: GSE 2.x or IMS 1.x format

SQLite schema version 1.1

The fields in the SQLite flavor of the schema have the same meaning as the PostgreSQL schema, but due to limitations have different data types and no row 'id' is included. The typing and field structure for SQLite are described below.

Field	SQlite Type	Structure
`network`	TEXT	String
`station`	TEXT	String
`location`	TEXT	String
`channel`	TEXT	String
`quality`	TEXT	String (single character)
`version`	INTEGER	Integer
`starttime`	TEXT	String in format: 'YYYY-MM-DDTHH:MM:SS.ssssss', always UTC
`endtime`	TEXT	String in format: 'YYYY-MM-DDTHH:MM:SS.ssssss', always UTC
`samplerate`	REAL	Floating point number
`filename`	TEXT	String
`byteoffset`	INTEGER	Integer
`bytes`	INTEGER	Integer
`hash`	TEXT	String of MD5 hash
`timeindex`	TEXT	String in format: 'time1=>offset1,time2=>offset2,time3=>offset3,...,latest=>[0
`timespans`	TEXT	String in format: '[start1:end1],[start2:end2],[start3:end3],...'
`timerates`	TEXT	String in format: 'rate1,rate2,rate3,...'
`format`	TEXT	String
`filemodtime`	TEXT	String in format: 'YYYY-MM-DDTHH:MM:SS.ssssss', always UTC
`updated`	TEXT	String in format: 'YYYY-MM-DDTHH:MM:SS.ssssss', always UTC
`scanned`	TEXT	String in format: 'YYYY-MM-DDTHH:MM:SS.ssssss', always UTC

For timeindex and timespans values the time, start and end elements are in epoch time format (seconds since the Unix/POSIX epoch) with up to microsecond resolution.

For timeindex, the offset elements are offsets in bytes.

For timerates, the rate elements are floating point values.

PostgreSQL CREATE statement

A create statement and suggested indexes for PostgreSQL are below. For increased access efficiency with very large data sets, this table may also be partitioned on start time, or some other field that makes sense for the set.

CREATE TABLE tsindex
(
  id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  network text NOT NULL,
  station text NOT NULL,
  location text NOT NULL,
  channel text NOT NULL,
  quality text,
  version smallint,
  starttime timestamp with time zone NOT NULL,
  endtime timestamp with time zone NOT NULL,
  samplerate numeric NOT NULL,
  filename text NOT NULL,
  byteoffset bigint NOT NULL,
  bytes bigint NOT NULL,
  hash text,
  timeindex hstore,
  timespans numrange[],
  timerates numeric[],
  format text,
  filemodtime timestamp with time zone NOT NULL,
  updated timestamp with time zone NOT NULL,
  scanned timestamp with time zone NOT NULL
);

(updated statement 2021-01-01)

Suggested time series summary table

Some software packages that use a time series index database benefit greatly from a summary table. The suggested summary table can be (re-)created with these statements:

DROP TABLE IF EXISTS tsindex_summary;
CREATE TABLE tsindex_summary AS
  SELECT network,station,location,channel,
  min(starttime) AS earliest, max(endtime) AS latest, datetime('now') as updt
  FROM tsindex
  GROUP BY 1,2,3,4;

Alternatively, a materialized view with the same definition is a simple way to maintain the summary for repositories that change over time.

Schema version changes

Version 1.0 to 1.1

Change id from serial(int) to bigserial(bigint) for scalability.
Change fields network, station, location, channel, quality, filename, format and hash from varchar() to text, effectively removing length limit.
Change fields samplerate and timerates from numeric(10,6) to numeric, effectively removing lengths.
Change fields byteoffset and bytes from numeric(15,0) to bigint for scalability, efficiency and reduced size.
Add version (smallint) field.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database Schema

PostgreSQL schema version 1.1

1. About `samplerate`

2. About `timeindex`

3. About `timespans`

4. About `dataformat`

SQLite schema version 1.1

PostgreSQL CREATE statement

Suggested PostgreSQL indexes

Suggested time series summary table

Schema version changes

Version 1.0 to 1.1

Clone this wiki locally

Database Schema

PostgreSQL schema version 1.1

1. About samplerate

2. About timeindex

3. About timespans

4. About dataformat

SQLite schema version 1.1

PostgreSQL CREATE statement

Suggested PostgreSQL indexes

Suggested time series summary table

Schema version changes

Version 1.0 to 1.1

Clone this wiki locally

1. About `samplerate`

2. About `timeindex`

3. About `timespans`

4. About `dataformat`