# Row vs Column Storage

In InterSystems IRISÂ®, a relational table, such as the one shown here, is a logical abstraction. It does not reflect the underlying physical storage layout of the data.

![image info](https://github.com/grongierisc/iris-devslam/blob/master/misc/img/table_abstraction.png?raw=true)

## How data can actually be stored

The underlying physical storage layout of the data can be either row or column oriented. In row-oriented storage, the data for each row is stored together. In column-oriented storage, the data for each column is stored together.

### Row storage

In row storage, the data for each row is stored together. This is the default storage layout in InterSystems IRIS.

![image info](https://github.com/grongierisc/iris-devslam/blob/master/misc/img/table_row_storage.png?raw=true)

### Column storage

In column storage, the data for each column is stored together.

![image info](https://github.com/grongierisc/iris-devslam/blob/master/misc/img/table_col_storage.png?raw=true)

## Demo

In this demo, we will show the difference between row storage and column storage.
For that we will create 4 tables with the same data but with different storage layout.

* Demo.BankTransactionRow
  * A Table that store data in row
* Demo.BankTransactionColumn
  * A Table that store data in column
* Demo.BankTransactionIndex
  * A Table that store data in row but with an index in column
* Demo.BankTransactionMix
  * A Table that store data in row and in column

## Let's start

First we will import utils functions that will help us to generate data and to measure the time.

In [1]:
from utilsrowcolumn import * 

Then we will create the 4 tables.

In [2]:
list_tables = ["Demo.BankTransactionRow", "Demo.BankTransactionColumn","Demo.BankTransactionIndex","Demo.BankTransactionMix" ]

Clean up the database.

In [3]:
# init drop table if exists
print("init drop table if exists")
for table in list_tables:
    run_sql_query("DROP TABLE IF EXISTS %s" % table)

# drop table description
print("drop table description")
run_sql_query("DROP TABLE IF EXISTS Demo.BankTransactionDescription")

# create table description
print("create table description")
create_join_table()

init drop table if exists
drop table description
create table description
create a description table of debit and credit
insert data in description table


## Create the row storage

Not much to see here.
Basic DDL Statement.

![image info](https://github.com/grongierisc/iris-devslam/blob/master/misc/img/table_row_storage.png?raw=true)

In [4]:
sql_row = """
CREATE TABLE Demo.BankTransactionRow (
  AccountNumber INTEGER,
  TransactionDate DATE,
  Description VARCHAR(100),
  Amount NUMERIC(10,2),
  Type VARCHAR(10)
)
"""
run_sql_query(sql_row)

## Create the indexed row table

Same SQL statement that above, but a new index with the tag column.

![image info](https://github.com/grongierisc/iris-devslam/blob/master/misc/img/table_bitmap_columnar_index_row.png?raw=true)

In [5]:
# index column storage
sql_index = """
CREATE TABLE Demo.BankTransactionIndex (
  AccountNumber INTEGER,
  TransactionDate DATE,
  Description VARCHAR(100),
  Amount NUMERIC(10,2),
  Type VARCHAR(10)
)
"""
run_sql_query(sql_index)
# Create the index
run_sql_query("""CREATE COLUMNAR INDEX AmountIndex
ON Demo.BankTransactionIndex(Amount)""")

## Create the column storage

Pay attention to the tag : WITH STORAGETYPE = COLUMNAR

![image info](https://github.com/grongierisc/iris-devslam/blob/master/misc/img/table_col_storage.png?raw=true)

In [6]:
# column storage
sql_column = """
CREATE TABLE Demo.BankTransactionColumn (
  AccountNumber INTEGER,
  TransactionDate DATE,
  Description VARCHAR(100),
  Amount NUMERIC(10,2),
  Type VARCHAR(10)
)
WITH STORAGETYPE = COLUMNAR
"""
run_sql_query(sql_column)

## Finaly the mixed storage

Pay attention to the Amount column.

![image info](https://github.com/grongierisc/iris-devslam/blob/master/misc/img/table_mixed_query.png?raw=true)

In [7]:
# mix storage
sql_mix = """
CREATE TABLE Demo.BankTransactionMix (
  AccountNumber INTEGER,
  TransactionDate DATE,
  Description VARCHAR(100),
  Amount NUMERIC(10,2) WITH STORAGETYPE = COLUMNAR,
  Type VARCHAR(10)
)
"""
run_sql_query(sql_mix)

## Now we have to insert data in those table

It will be done in 2 steps:
* First we will generate data
* Second we will duplicate the data in the table

The first parameter is the number of rows to generate per table.

The second parameter is the number of duplication to generate.

The third is the list of table where the data will be inserted.

In [8]:
print("create data")
data = create_n_fake_data(64000,100,list_tables)


create data
99.9%
created 256,000 fake data in 33.35 seconds, number of rows per second : 7,677.31
100.0%build index
tune table

duplicated a total of 25,600,000 rows in 43.59 row per second : 587,329.33


## Summerias

In less that 3 minutes we have built a data set of **100 millions rows** :)

# Now start the demo

## let count the row in tables

In [9]:
# query count data
print("query count data")
for table in list_tables:
    print_sql_query(f"SELECT COUNT(*) FROM {table}")


query count data
SELECT COUNT(*) FROM Demo.BankTransactionRow :
[6464000]
SELECT COUNT(*) FROM Demo.BankTransactionColumn :
[6464000]
SELECT COUNT(*) FROM Demo.BankTransactionIndex :
[6464000]
SELECT COUNT(*) FROM Demo.BankTransactionMix :
[6464000]


## Query the top 100 000 datas for each tables

In [10]:
# query data
print("query data")
for table in list_tables:
    benchmark_sql_query("SELECT TOP 100000 * FROM %s " % table)


query data
number of rows : 100000
SELECT TOP 100000 * FROM Demo.BankTransactionRow  in 1.97 seconds, row per second : 50,668.98
number of rows : 100000
SELECT TOP 100000 * FROM Demo.BankTransactionColumn  in 1.73 seconds, row per second : 57,934.63
number of rows : 100000
SELECT TOP 100000 * FROM Demo.BankTransactionIndex  in 1.74 seconds, row per second : 57,413.62
number of rows : 100000
SELECT TOP 100000 * FROM Demo.BankTransactionMix  in 1.75 seconds, row per second : 57,091.37


## Let try aggregation

In [8]:
# benchmark aggregation
print("benchmark aggregation")
for table in list_tables:
    benchmark_sql_query("SELECT AVG(ABS(Amount)) FROM %s " % table)


benchmark aggregation
number of rows : 1
SELECT AVG(ABS(Amount)) FROM Demo.BankTransactionRow  in 0.58 seconds, row per second : 1.73
number of rows : 1
SELECT AVG(ABS(Amount)) FROM Demo.BankTransactionColumn  in 0.24 seconds, row per second : 4.20
number of rows : 1
SELECT AVG(ABS(Amount)) FROM Demo.BankTransactionIndex  in 0.23 seconds, row per second : 4.34
number of rows : 1
SELECT AVG(ABS(Amount)) FROM Demo.BankTransactionMix  in 0.24 seconds, row per second : 4.08


## Show case SQL join

In [12]:
# benchmark join
print("benchmark join")
for table in list_tables:
    benchmark_sql_query("""SELECT TOP 100000 * FROM %s t1 
        JOIN Demo.BankTransactionDescription t2 ON t1.Type = t2.Type""" % table)


benchmark join
number of rows : 100000
SELECT TOP 100000 * FROM Demo.BankTransactionRow t1 
        JOIN Demo.BankTransactionDescription t2 ON t1.Type = t2.Type in 2.33 seconds, row per second : 42,880.57
number of rows : 100000
SELECT TOP 100000 * FROM Demo.BankTransactionColumn t1 
        JOIN Demo.BankTransactionDescription t2 ON t1.Type = t2.Type in 1.88 seconds, row per second : 53,288.67
number of rows : 100000
SELECT TOP 100000 * FROM Demo.BankTransactionIndex t1 
        JOIN Demo.BankTransactionDescription t2 ON t1.Type = t2.Type in 1.88 seconds, row per second : 53,137.88
number of rows : 100000
SELECT TOP 100000 * FROM Demo.BankTransactionMix t1 
        JOIN Demo.BankTransactionDescription t2 ON t1.Type = t2.Type in 1.89 seconds, row per second : 52,900.02


## Bench Insert

In [3]:
# benchmark insert
print("benchmark insert")
for table in list_tables:
    start = time.time()
    print(f"for table {table}")
    create_n_fake_data(25000,0,[table])
    end = time.time()

benchmark insert
for table Demo.BankTransactionRow
99.9%
created 25,000 fake data in 3.16 seconds, number of rows per second : 7,918.27
for table Demo.BankTransactionColumn
99.9%
created 25,000 fake data in 4.97 seconds, number of rows per second : 5,030.16
for table Demo.BankTransactionIndex
99.9%
created 25,000 fake data in 3.57 seconds, number of rows per second : 7,004.54
for table Demo.BankTransactionMix
99.9%
created 25,000 fake data in 3.63 seconds, number of rows per second : 6,882.18


## Check table size

In [4]:
# table size
print("table size")
for table in list_tables:
    print_sql_query("SELECT * FROM bdb_sql.TableSize('%s')" % table)


table size
SELECT * FROM bdb_sql.TableSize('Demo.BankTransactionRow') :
['total', '264.07', '236.061']
SELECT * FROM bdb_sql.TableSize('Demo.BankTransactionColumn') :
['total', '151.67', '142.461']
SELECT * FROM bdb_sql.TableSize('Demo.BankTransactionIndex') :
['total', '291.07', '262.061']
SELECT * FROM bdb_sql.TableSize('Demo.BankTransactionMix') :
['total', '246.07', '222.061']
