# Storing and retrieving TS data

## Time series Storage

Time series data is a collection of observations made over time that are often used in many different applications such as financial analysis, environmental monitoring, and industrial control systems. The data is typically collected at regular intervals, and each observation is associated with a timestamp that indicates the time at which it was recorded. The storage and management of time series data are critical considerations for any application that relies on this data.

## Storage Formats

Time series data is usually stored in a tabular format, where each row corresponds to an observation, and each column represents a variable or attribute of that observation. The first column of the table is typically a timestamp column, which stores the time at which each observation was recorded.

### CSV
One common file format used to store time series data is the CSV (Comma Separated Values) file format. A CSV file is a plain text file that contains a table of data, with each row separated by a newline character, and each column separated by a comma. CSV files are simple and widely supported, but they may not be the most efficient format for storing large amounts of time series data.

Data in CSV files can be stored in a variety of formats, including comma-separated values, tab-separated values, and space-separated values. The CSV file format is a plain text file format that is used to store tabular data. CSV files are simple and widely supported, but they may not be the most efficient format for storing large amounts of time series data.

Example of a CSV file to store a time series
```csv
timestamp,value
2020-01-01 00:00:00,0.0
2020-01-01 00:01:00,0.1
2020-01-01 00:02:00,0.2
2020-01-01 00:03:00,0.3
2020-01-01 00:04:00,0.4
```


### XML
Another file format used for storing time series data is the XML (Extensible Markup Language) file format. XML is a text-based file format that is used to store structured data, such as time series data. XML files are human-readable and widely supported, but they may not be the most efficient format for storing large amounts of time series data.

Example of an XML file to store a time series
```xml
<?xml version="1.0" encoding="UTF-8"?>
<timeseries>
    <timestamp>2020-01-01 00:00:00</timestamp>
    <value>0.0</value>
</timeseries>
<timeseries>
    <timestamp>2020-01-01 00:01:00</timestamp>
    <value>0.1</value>
</timeseries>
<timeseries>
    <timestamp>2020-01-01 00:02:00</timestamp>
    <value>0.2</value>
</timeseries>
<timeseries>
    <timestamp>2020-01-01 00:03:00</timestamp>
    <value>0.3</value>
</timeseries>
```


### Excel
Another file format used for storing time series data is the Excel file format. Excel files are binary files that contain a table of data, with each row stored separately. Excel files are widely supported and can be used to store large amounts of time series data, but they are not human-readable and may not be the most efficient format for storing large amounts of time series data.


### JSON
Another file format used for storing time series data is the JSON (JavaScript Object Notation) file format. JSON is a text-based file format that is used to store structured data, such as time series data. JSON files are human-readable and widely supported, but they may not be the most efficient format for storing large amounts of time series data.

Example of a JSON file to store a time series
```json
[
    {
        "timestamp": "2020-01-01 00:00:00",
        "value": 0.0
    },
    {
        "timestamp": "2020-01-01 00:01:00",
        "value": 0.1
    },
    {
        "timestamp": "2020-01-01 00:02:00",
        "value": 0.2
    },
    {
        "timestamp": "2020-01-01 00:03:00",
        "value": 0.3
    }
]
```

### Parquet
Parquet is a columnar storage format that is used to store time series data. Parquet files are binary files that contain a table of data, with each column stored separately. Parquet files are efficient for storing large amounts of time series data, but they are not human-readable and may not be widely supported. E.g., see [here](https://parquet.apache.org/).


### HDF5
Another file format used for storing time series data is the HDF5 (Hierarchical Data Format version 5) file format. HDF5 is a binary file format that supports the storage of large and complex data sets, including time series data. HDF5 files can store metadata, data attributes, and data types, and are designed for efficient storage and retrieval of large datasets.

The structure of an HDF5 file is hierarchical, with groups and datasets. Groups are used to organize datasets into a hierarchy, and datasets are used to store the actual data. Datasets can be multidimensional arrays, and they can be compressed to reduce the size of the file. HDF5 files are efficient for storing large amounts of time series data, but they are not human-readable and may not be widely supported. Besides, the include metadata and data attributes can be used to store additional information about the data. HDF5 files can be accessed using a variety of programming languages, including Python, R, and MATLAB.

Further information about the HDF5 file format can be found [here](https://portal.hdfgroup.org/display/HDF5/HDF5).

Let us store a time series in a HDF5 file. We  will use the house consumption data set available in csv format on folder `data/house_consumption_TS/house_consumption.csv`. The data set contains measurements of electric power consumption in one household with a 15-minute sampling where the first column is the date and time, and the second column is the power consumption in kilowatts.

In [None]:
import pandas as pd
import h5py

# read the data from the CSV file
df = pd.read_csv('./data/house_consumption_TS/house_consumption.csv', header=0, parse_dates=['date'], index_col=['date'])

# resample the data to hourly intervals
df = df.resample('H').mean()

# convert the data to a NumPy array
data = df.values

# create a new HDF5 file
f = h5py.File('./data/house_consumption_TS/house_consumption.hdf5', 'w')

# create a dataset to store the time series data
dset = f.create_dataset('house_consumption', data=data)

# define the attributes of the dataset
dset.attrs['CLASS'] = 'TIMESERIES'
dset.attrs['NAME'] = 'HOUSE_CONSUMPTION'
dset.attrs['START'] = str(df.index.min())
dset.attrs['COUNT'] = len(data)
dset.attrs['FREQUENCY (hours)'] = 1

# close the file
f.close()

To use the data in the HDF5 file, we can open the file and read the data from the dataset.

In [None]:
import h5py

# open the HDF5 file
f = h5py.File('./data/house_consumption_TS/house_consumption.hdf5', 'r')

# read the dataset
dset = f['house_consumption']
data = dset[:]
print(data)

# close the file
f.close()


And the metadata can be accessed as follows:

In [None]:
import h5py

# open the HDF5 file
f = h5py.File('./data/house_consumption_TS/house_consumption.hdf5', 'r')

# read the dataset
dset = f['house_consumption']

# get the attributes
print('** attributes:', dset.attrs)

# since the `Attributes of HDF5 object` behaves like a dictionary, we can access the attributes as follows
print('** NAME attribute:', dset.attrs['NAME'])

# get the list of attributes
print('** attributes keys:', dset.attrs.keys())

# get the attributes values
print('** attributes values:', dset.attrs.values())

# or use the items() method to iterate over the attributes
print('** attributes items:')
for key, value in dset.attrs.items():
    print(key, ":",  value)

## Storage and Indexing Strategies

The storage and indexing of time series data are critical considerations for any application that relies on this data. The size of the dataset, the frequency of updates, and the types of queries that need to be supported can all impact the storage and indexing strategies that are used. Here are some common storage and indexing strategies for time series data:

- __Columnar Storage__ - a storage format that stores data by column instead of by row. In columnar storage, each column of data is stored separately, which can improve the efficiency of certain types of queries, such as aggregation and filtering. Columnar storage is particularly well-suited for time series data because queries are often performed on a subset of the columns, rather than on the entire dataset. Columnar storage is used in several popular time series databases, such as InfluxDB and TimescaleDB.

- __Compression__ - a technique used to reduce the size of data by encoding it in a more compact form. Compression can be particularly useful for time series data, which is often large and repetitive. There are several compression techniques that can be used for time series data, including delta encoding, run-length encoding, and gzip compression. Compression can reduce the amount of storage required for time series data, as well as improve the performance of queries by reducing the amount of data that needs to be read from disk.

- __Partitioning__ - a technique used to split a dataset into smaller, more manageable pieces. Partitioning can be particularly useful for time series data, which is often too large to fit in memory or to be queried efficiently as a single unit. Partitioning can be done by time, where the dataset is split into chunks based on time intervals (e.g., hourly, daily, weekly), or by other criteria, such as location or sensor. Partitioning can improve query performance by allowing queries to be executed on a subset of the data, rather than the entire dataset. Partitioning is used in several popular time series databases, such as OpenTSDB and KairosDB.

- __Indexing__ - a technique used to speed up queries by creating a data structure that maps query criteria to the location of the corresponding data in the dataset. Indexing is particularly useful for time series data, where queries often involve filtering on a specific time range or a subset of the data. There are several indexing techniques that can be used for time series data, including B-trees, bitmap indexes, and inverted indexes. Indexing can improve query performance by reducing the amount of data that needs to be scanned to retrieve the desired results. Indexing is used in several popular time series databases, such as OpenTSDB.

## Time Series Databases

Time series databases are specialized databases that are designed to store and query time series data efficiently. Time series databases use a combination of the storage and indexing strategies discussed above to provide fast and efficient storage and retrieval of time series data. Here are some popular time series databases:

- __InfluxDB__ - a popular open-source time series database that is designed for high write and query performance. InfluxDB uses a columnar storage format, compression, partitioning, and indexing to provide fast and efficient storage and retrieval of time series data. InfluxDB also supports a SQL-like query language called InfluxQL, which makes it easy to query and analyze time series data.

- __TimescaleDB__ - an open-source time series database that is built on top of PostgreSQL. TimescaleDB uses a columnar storage format, compression, partitioning, and indexing to provide fast and efficient storage and retrieval of time series data. TimescaleDB also supports a SQL-like query language, and provides a number of advanced features, such as automatic data retention policies and continuous aggregates.

- __OpenTSDB__ - a distributed time series database that is designed to handle large amounts of time series data. OpenTSDB uses a row-based storage format, partitioning, and indexing to provide fast and efficient storage and retrieval of time series data. OpenTSDB also provides a powerful query language that allows users to perform complex queries on time series data.

- __KairosDB__ - a time series database that is built on top of Apache Cassandra. KairosDB uses a columnar storage format, compression, partitioning, and indexing to provide fast and efficient storage and retrieval of time series data. KairosDB also provides a REST API that makes it easy to integrate with other applications.

Of course, relational databases (e.g., MySQL) or documental databases (MongoDB) can also be used, at least as an inital solutiom

## Conclusion

The storage and management of time series data are critical considerations for any application that relies on this data. Time series data is typically stored in a tabular format with a timestamp column and one or more columns containing the corresponding data values. Time series data can be stored in a variety of file formats, including CSV, HDF5, and netCDF. The storage and indexing of time series data can impact query performance and storage efficiency. Columnar storage, compression, partitioning, and indexing are all techniques that can be used to optimize the storage and retrieval of time series data. Time series databases, such as InfluxDB, TimescaleDB, OpenTSDB, and KairosDB, provide specialized support for storing and querying time series data efficiently. Overall, the optimal storage and management of time series data require a combination of techniques and tools that are tailored to the specific use case and requirements of the application.

In addition to the techniques and tools discussed above, there are several emerging technologies that are being used to manage and analyze time series data. For example, Apache Kafka is a distributed streaming platform that can be used to ingest, process, and store high-volume real-time data streams, including time series data. Apache Flink is a distributed stream processing framework that can be used to perform real-time analytics on data streams, including time series data. These technologies can be used in conjunction with time series databases and other tools to build end-to-end time series data pipelines that can handle large-scale data processing and analysis tasks.

In conclusion, time series data is an essential component of many applications, and the efficient storage and management of this data are critical to the success of these applications. Time series data can be stored in a variety of file formats, and the choice of storage format can impact query performance and storage efficiency. Several techniques, including columnar storage, compression, partitioning, and indexing, can be used to optimize the storage and retrieval of time series data. Time series databases provide specialized support for storing and querying time series data efficiently, and emerging technologies, such as Apache Kafka and Apache Flink, can be used to build end-to-end time series data pipelines that can handle large-scale data processing and analysis tasks.