# Exercise 03 - Columnar Vs Row Storage

- The columnar storage extension used here: 
    - cstore_fdw by citus_data [https://github.com/citusdata/cstore_fdw](https://github.com/citusdata/cstore_fdw)
- The data tables are the ones used by citus_data to show the storage extension


In [2]:
%load_ext sql
import os
import psycopg2
import pandas

## STEP 0 : Connect to the local database where Pagila is loaded

### Create the database

In [None]:
#!sudo -u postgres psql -c 'CREATE DATABASE reviews;'

#!wget http://examples.citusdata.com/customer_reviews_1998.csv.gz
#!wget http://examples.citusdata.com/customer_reviews_1999.csv.gz

#!gzip -d customer_reviews_1998.csv.gz 
#!gzip -d customer_reviews_1999.csv.gz 

#!mv customer_reviews_1998.csv /tmp/customer_reviews_1998.csv
#!mv customer_reviews_1999.csv /tmp/customer_reviews_1999.csv

### Connect to the database

In [4]:
DB_ENDPOINT = "127.0.0.1"
DB = 'reviews'
DB_USER = 'student'
DB_PASSWORD = 'student'
DB_PORT = '5432'

# postgresql://username:password@host:port/database
conn_string = "postgresql://{}:{}@{}:{}/{}" \
                        .format(DB_USER, DB_PASSWORD, DB_ENDPOINT, DB_PORT, DB)

%sql $conn_string

print(conn_string)

postgresql://student:student@127.0.0.1:5432/reviews


## STEP 1 :  Create a table with a normal  (Row) storage & load data

**TODO:** Create a table called customer_reviews_row with the column names contained in the `customer_reviews_1998.csv` and `customer_reviews_1999.csv` files.

In [None]:
%%sql
DROP TABLE IF EXISTS customer_reviews_row;
CREATE TABLE customer_reviews_row 
(
    customer_id TEXT,
    review_date DATE,
    review_rating INTEGER,
    review_votes INTEGER,
    review_helpful_votes INTEGER,
    product_id TEXT,
    product_title TEXT,
    product_sales_rank BIGINT,
    product_group TEXT,
    product_category TEXT,
    product_subcategory TEXT,
    similar_product_ids CHAR(10)[]
)

**TODO:** Use the [COPY statement](https://www.postgresql.org/docs/9.2/sql-copy.html) to populate the tables with the data in the `customer_reviews_1998.csv` and `customer_reviews_1999.csv` files. You can access the files in the `/tmp/` folder.

In [None]:
#%%sql
#COPY customer_reviews_row FROM '/tmp/customer_reviews_1998.csv' WITH CSV;
#COPY customer_reviews_row FROM '/tmp/customer_reviews_1999.csv' WITH CSV;

## STEP 2 :  Create a table with columnar storage & load data

First, load the extension to use columnar storage in Postgres.

In [None]:
#%%sql

#-- load extension first time after install
#CREATE EXTENSION IF NOT EXISTS citus;

#-- create server object
#CREATE SERVER cstore_server FOREIGN DATA WRAPPER cstore_fdw;

**TODO:** Create a `FOREIGN TABLE` called `customer_reviews_col` with the column names contained in the `customer_reviews_1998.csv` and `customer_reviews_1999.csv` files.

In [None]:
%%sql
-- create foreign table
DROP TABLE IF EXISTS customer_reviews_col;

-------------

CREATE EXTENSION IF NOT EXISTS citus;

CREATE TABLE customer_reviews_col
(
    customer_id TEXT,
    review_date DATE,
    review_rating INTEGER,
    review_votes INTEGER,
    review_helpful_votes INTEGER,
    product_id TEXT,
    product_title TEXT,
    product_sales_rank BIGINT,
    product_group TEXT,
    product_category TEXT,
    product_subcategory TEXT,
    similar_product_ids CHAR(10)[]
)
USING columnar;

**TODO:** Use the [COPY statement](https://www.postgresql.org/docs/9.2/sql-copy.html) to populate the tables with the data in the `customer_reviews_1998.csv` and `customer_reviews_1999.csv` files. You can access the files in the `/tmp/` folder.

In [None]:
%%sql 
COPY customer_reviews_col FROM '/tmp/customer_reviews_1998.csv' WITH CSV;
COPY customer_reviews_col FROM '/tmp/customer_reviews_1999.csv' WITH CSV;

## Step 3: Compare perfromamce

Now run the same query on the two tables and compare the run time. Which form of storage is more performant?

**TODO**: Write a query that calculates the average `review_rating` by `product_title` for all reviews in 1995. Sort the data by `review_rating` in descending order. Limit the results to 20.

First run the query on `customer_reviews_row`:

In [5]:
%%time
%%sql

SELECT customer_id, review_date, review_rating, product_id, product_title
FROM customer_reviews_row
ORDER BY review_rating DESC
LIMIT 1000;

 * postgresql://student:***@127.0.0.1:5432/reviews
1000 rows affected.
Wall time: 672 ms


customer_id,review_date,review_rating,product_id,product_title
ATVPDKIKX0DER,1996-05-15,5,0312857063,"Stone of Tears (Sword of Truth, Book 2)"
ATVPDKIKX0DER,1996-05-16,5,0399236295,"Redwall (Redwall, Book 1)"
ATVPDKIKX0DER,1996-05-13,5,0764104012,Speed Reading for Business (Barron's Business Success Guides)
ATVPDKIKX0DER,1996-05-15,5,0140254501,Bombardiers
ATVPDKIKX0DER,1996-05-16,5,0833552600,Redwall
ATVPDKIKX0DER,1996-05-16,5,0399214240,"Redwall (Redwall, Book 1)"
A1BLGX0V45IGHD,1996-05-13,5,0553283685,Hyperion
A3BYX62MZGD423,1996-05-13,5,0694520004,CELEBRATION OF DISCIPLINE
ATVPDKIKX0DER,1996-05-14,5,0486267652,Origami Sea Life (Origami)
ATVPDKIKX0DER,1996-05-14,5,0815323034,The New Arthurian Encyclopedia (Updated Paperback Edition)


 Then on `customer_reviews_col`:

In [6]:
%%time
%%sql

SELECT customer_id, review_date, review_rating, product_id, product_title
FROM customer_reviews_col
ORDER BY review_rating DESC
LIMIT 1000;

 * postgresql://student:***@127.0.0.1:5432/reviews
1000 rows affected.
Wall time: 804 ms


customer_id,review_date,review_rating,product_id,product_title
ATVPDKIKX0DER,1996-05-15,5,0312850093,The Eye of the World
ATVPDKIKX0DER,1996-05-16,5,0679861564,Peter and the Wolf
ATVPDKIKX0DER,1996-05-14,5,0449911616,"The Crystal Cave (Stewart, Mary, Arthurian Saga, Bk. 1,)"
A3HQM4EEJ6TLOK,1996-05-15,5,0691070466,Where to Watch Birds in South America
ATVPDKIKX0DER,1996-05-16,5,0671885286,The Great Reckoning
ATVPDKIKX0DER,1996-05-16,5,0670808490,Peter and the Wolf
AMOCE7OQDZEE3,1996-05-13,5,0671877224,DEATHKILLER
ATVPDKIKX0DER,1996-05-14,5,0440220467,The Commandos
ABY1XDBGGYYPI,1996-05-14,5,0785799524,Breathing Lessons
A3HQM4EEJ6TLOK,1996-05-15,5,0292770634,The Birds of South America


## Conclusion: We can see that the columnar storage is faster!