Skip to content

Database Schema

Chad Trabant edited this page Oct 11, 2023 · 36 revisions

The time series index database schema is designed to serve as a general index and summary of time series data to address multiple use cases.

One of the main uses is to efficiently search for and access data by minimizing the need to read data unnecessarily. To this end, a row in the indexing schema represents a contiguous section, aka byte range, of a file containing data from the same time series. For miniSEED this means a section of a file containing data records from the same network, station, location, channel and quality. The data section may contain out of order records (although it is better if they are in time order) or time tears (gaps or overlaps) but it will be the same time series.

Other important use cases are to build various summaries of data holdings to report data availability, synchronize holdings, etc.

PostgreSQL schema version 1.1

Field Postgres Type Description
id bigserial (bigint) A row identifier, auto-incrementing
network text Network identifier, e.g. SEED network code
station text Station identifier, e.g. SEED station code
location text Location identifier, e.g. SEED location ID
channel text Channel identifier, e.g. SEED channel codes
quality text Quality identifier, e.g. SEED quality indicator. If NULL quality is unknown.
version smallint Data publication version. If NULL version is unknown.
starttime timestamp w/TZ Earliest time for data section with microsecond resolution
endtime timestamp w/TZ Latest time for data section with microsecond resolution
samplerate1 numeric Reference sample rate in samples/second
filename text Name of file containing data
byteoffset bigint Byte offset in file to beginning of data section
bytes bigint Total number of bytes in the data section
hash text MD5 hash of data section
timeindex2 hstore A set of key-value pairs, where keys are epoch times and values are byte offsets
timespans3 numrange[] Array of epoch time value ranges, the start and stop of each segment of data in section
timerates numeric[] Array of sample rates corresponding to the timespans. If NULL the reference samplerate value is appropriate for all spans.
format4 text Identifier for data format. If NULL the format is miniSEED.
filemodtime timestamp w/TZ Modification time of the file
updated timestamp w/TZ Time of last update of the data section, changed when MD5 changes. If no existing time present, will be the same as the modification time.
scanned timestamp w/TZ Time of last scan and update to this entry.

1. About samplerate

This value is a reference, nominal sample rate for the time spans in the entry, which matches one of the spans. When this value is not appropriate the for all time spans the actual sample rates are in the timerates field.

2. About timeindex

The time index is a hash (aka unsorted list) of time values and byte offsets. The time values are stored as Unix epoch times with microsecond precision.

The time values are guaranteed to indicate the earliest byte offset at which data from that time is present in the data section. This allows software to potentially identify a smaller range of the file to read for a given data selection. An example hash index might look like:

(time) 1425531713.000000 => 4096 (byte offset)
(time) 1425536784.000999 => 8192 (byte offset)
(time) 1425542367.405000 => 24576 (byte offset)
...
                 latest => [0|1]

If a data section contains 24 hours of data (typical for an archive) but the request is only for one hour, this information can be used to narrow the byte range in a file to read for a given request.

A special member of the hash with key of latest, if present, will be set to either 0 or 1 indicating whether the index also identifies byte offsets of the latest timed data.

The guarantee that hash members indicate the earliest byte offset at which data for a specified time resides allows software to use the index to identify the best offset to start reading for a selected start time. If the index also represents latest times, the index may be used to identify the best offset to stop reading for a selected on end time.

3. About timespans

The timespans field is an array of ranges, where each range is a time span stored as epoch time values (with microsecond resolution). These spans represent the contiguous time series segments contained in the data section.

Only data records that contain time coverage (one or more samples and a non-zero sample rate) contribute to the time spans in the timespans field. For entries with no time coverage, e.g. log records, SOH records, etc., timespans will be NULL.

In some cases the starttime and endtime fields for a section can be different than the start and end times of spans in the timespans array. This can occur when there are record times with no coverage (0 samples or 0 sample rate) that are earlier or later than all spans with coverage.

4. About dataformat

The data format field contains a short identifier indicating the data format contained in the file (in the byte range for this entry). Defined data format identifiers are as follows:

  • NULL: same as ms
  • ms: miniSEED format, unspecified version
  • ms2: miniSEED format, as defined by SEED 2.x
  • sac: SAC, Seismic Analysis Code, format
  • gse2: GSE 2.x or IMS 1.x format

SQLite schema version 1.1

The fields in the SQLite flavor of the schema have the same meaning as the PostgreSQL schema, but due to limitations have different data types and no row 'id' is included. The typing and field structure for SQLite are described below.

Field SQlite Type Structure
network TEXT String
station TEXT String
location TEXT String
channel TEXT String
quality TEXT String (single character)
version INTEGER Integer
starttime TEXT String in format: 'YYYY-MM-DDTHH:MM:SS.ssssss', always UTC
endtime TEXT String in format: 'YYYY-MM-DDTHH:MM:SS.ssssss', always UTC
samplerate REAL Floating point number
filename TEXT String
byteoffset INTEGER Integer
bytes INTEGER Integer
hash TEXT String of MD5 hash
timeindex TEXT String in format: 'time1=>offset1,time2=>offset2,time3=>offset3,...,latest=>[0
timespans TEXT String in format: '[start1:end1],[start2:end2],[start3:end3],...'
timerates TEXT String in format: 'rate1,rate2,rate3,...'
format TEXT String
filemodtime TEXT String in format: 'YYYY-MM-DDTHH:MM:SS.ssssss', always UTC
updated TEXT String in format: 'YYYY-MM-DDTHH:MM:SS.ssssss', always UTC
scanned TEXT String in format: 'YYYY-MM-DDTHH:MM:SS.ssssss', always UTC

For timeindex and timespans values the time, start and end elements are in epoch time format (seconds since the Unix/POSIX epoch) with up to microsecond resolution.

For timeindex, the offset elements are offsets in bytes.

For timerates, the rate elements are floating point values.

PostgreSQL CREATE statement

A create statement and suggested indexes for PostgreSQL are below. For increased access efficiency with very large data sets, this table may also be partitioned on start time, or some other field that makes sense for the set.

CREATE TABLE tsindex
(
  id bigint PRIMARY KEY GENERATED ALWAYS AS IDENTITY,
  network text NOT NULL,
  station text NOT NULL,
  location text NOT NULL,
  channel text NOT NULL,
  quality text,
  version smallint,
  starttime timestamp with time zone NOT NULL,
  endtime timestamp with time zone NOT NULL,
  samplerate numeric NOT NULL,
  filename text NOT NULL,
  byteoffset bigint NOT NULL,
  bytes bigint NOT NULL,
  hash text,
  timeindex hstore,
  timespans numrange[],
  timerates numeric[],
  format text,
  filemodtime timestamp with time zone NOT NULL,
  updated timestamp with time zone NOT NULL,
  scanned timestamp with time zone NOT NULL
);

(updated statement 2021-01-01)

Suggested PostgreSQL indexes

CREATE INDEX tsindex_filename_idx
  ON tsindex
  USING btree
  (filename);

CREATE INDEX tsindex_nslcq_idx
  ON tsindex
  USING btree
  (network, station, location, channel, quality);

CREATE INDEX tsindex_starttime_idx
  ON tsindex
  USING btree
  (starttime);

CREATE INDEX tsindex_endtime_idx
  ON tsindex
  USING btree
  (endtime);

CREATE INDEX tsindex_updated_idx
  ON tsindex
  USING btree
  (updated);

Suggested time series summary table

Some software packages that use a time series index database benefit greatly from a summary table. The suggested summary table can be (re-)created with these statements:

DROP TABLE IF EXISTS tsindex_summary;
CREATE TABLE tsindex_summary AS
  SELECT network,station,location,channel,
  min(starttime) AS earliest, max(endtime) AS latest, datetime('now') as updt
  FROM tsindex
  GROUP BY 1,2,3,4;

Alternatively, a materialized view with the same definition is a simple way to maintain the summary for repositories that change over time.

Schema version changes

Version 1.0 to 1.1

  • Change id from serial(int) to bigserial(bigint) for scalability.
  • Change fields network, station, location, channel, quality, filename, format and hash from varchar() to text, effectively removing length limit.
  • Change fields samplerate and timerates from numeric(10,6) to numeric, effectively removing lengths.
  • Change fields byteoffset and bytes from numeric(15,0) to bigint for scalability, efficiency and reduced size.
  • Add version (smallint) field.