# Data Warehouse
A data warehouse is an OLAP database used specifically for reporting and data analytics. It usually consists of:
* Meta data: 
* Raw data:
* Summary data:

Data from many different sources (operational systems, other databases, APIs, flat files) are first loaded into a staging area and then loaded into the data warehouse. The data warehouse can then either output the data to **data marts** (smaller, more specialized data warehouses for specific company departments for example) or directly to end-users (analysts, reports, or machine learning models).

* OLAP vs. OLTP
* BigQuery
    - Cost
    - Partitions and Clustering
    - Best Practices
    - Internals
    - Machine Learning in BigQuery

## OLAP vs. OLTP
There are generally two types of database systems that serve different business purposes:
* **Online Transaction Processing (OLTP)** type databases control and run business operations in real time. They offer a view of day-to-day business transactions. They are designed for fast data updates and are usually *normalized* for efficiency. Regular backups are required to ensure business continuity and to satisfy legal and governance requirements. 

* **Online Analytical Processing (OLAP)** type databases support business planning, decision making, and analystics. They present a multi-dimensional view of business data. Data is periodically refreshed with scheduled, long-running batch jobs and tables are usually *denormalized* to support more efficient analytical queries. Lost data can be reloaded from OLTP databases and other sources in lieu of regular backups.

## Google BigQuery
BigQuery is a serverless data warehouse; from the perspective of the data engineer, there are no servers to manage or database software to install. The advantages of BigQuery are its scalability and high-availability:
* It has built-in features for machine learning, geospatial analysis, and business intelligence queries.
* It maximizes flexibility by separating the compute engine that runs queries from how it stores the data.

A few notes on BigQuery:
* BigQuery generally caches query results for improved performance. To turn this off, uncheck `⚙ More` > `Query Settings` > `☑ Use cached results`.
* BigQuery also provides several open source public datasets for exploration.

### BigQuery Cost
Understanding service costs, especially when using cloud services, is hugely important to data engineering. BigQuery pricing has two main components:
* **Compute pricing**: the cost to process queries. Compute pricing offer two pricing models:
    - On-Demand pricing (per TiB): The first 1TiB of query data processed per month is free, then $6.25 for each additional TiB (up to date pricing is available on BigQuery's [pricing page](https://cloud.google.com/bigquery/pricing).)
    - Capacity pricing (per slot-hour): Compute is charged by capacity measured in "slots" (virtual CPUs). You can use the BigQuery autoscaler or pre-purchase slot commitments, which are dedicated capacity that is always availabe for your workloads, at a lower price (up to date pricing is available on BigQuery's [pricing page](https://cloud.google.com/bigquery/pricing).). 
* Storage pricing: the cost to store data that is loaded into BigQuery. You pay for *active* storage and *long-term* storage.
    - Active storage includes any table or table partition that has been modified in the last 90 days.
    - Long-term storage includes any table or table parition that has not been modified in the last 90 days.

BigQuery pricing is often subject to change, so it's advised to always keep up to date on the latest pricing for your needs.

### External Tables
BigQuery allows you to create **external tables** from data that is not directly loaded into BigQuery such as Google Cloud Storage buckets, Google Cloud SQL database, or other cloud services. However note that BigQuery cannot determine the size or number of rows of external tables because the data itself is not within BiqQuery.

```SQL
-- Creating external table referring to gcs path
CREATE OR REPLACE EXTERNAL TABLE `taxi-rides-ny.nytaxi.external_yellow_tripdata`
OPTIONS (
  format = 'CSV',
  uris = ['gs://nyc-tl-data/trip data/yellow_tripdata_2019-*.csv', 'gs://nyc-tl-data/trip data/yellow_tripdata_2020-*.csv']
);
```

### Partitioning in BigQuery
Partitioning tables can improve query performance and compute cost depending on the characteristics of the queries and the partitioned dimension. You can partition tables based on:
* Time-unit column or time period
* Ingestion time (_PARTITIONTIME)
* Integer range
The max number of partitions allowed in BigQuery is 4,000. See the [docs](https://cloud.google.com/bigquery/docs/partitioned-tables).

To partition a table from an external table:
```SQL
-- Create a partitioned table from external table
CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitoned
PARTITION BY
  DATE(tpep_pickup_datetime) AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;
```

This might take some time as when data is partitioned from external tables, the data needs to be loaded from the external tables into BigQuery.

We can also query individual partitions:
```SQL
-- Let's look into the partitons
SELECT table_name, partition_id, total_rows
FROM `nytaxi.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'yellow_tripdata_partitoned'
ORDER BY total_rows DESC;
```

### Clustering in BigQuery
Clustering data can also improve performance and cost. Clustering is best done on dimensions with high cardinality (columns with relatively few unique values such as product type, market regions, pickups stations, etc.) The order of the columns is important as it determines the sort order of the data. Clustering can improve filter queries and aggregate queries. The max number of clustered columns is 4 in BigQuery. BigQuery will automatically recluster tables in the background to restore proper sort order at no extra cost.

To create a partitioned and clustered table:
```SQL
-- Creating a partition and cluster table
CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_partitoned_clustered
PARTITION BY DATE(tpep_pickup_datetime)
CLUSTER BY VendorID AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;
```

Compare the performance and cost of querying an external table vs a partitioned table vs a partitioned and clusted table:
```SQL
-- Create a non partitioned table from external table
CREATE OR REPLACE TABLE taxi-rides-ny.nytaxi.yellow_tripdata_non_partitoned AS
SELECT * FROM taxi-rides-ny.nytaxi.external_yellow_tripdata;

-- Impact of partition
-- Scanning 1.6GB of data
SELECT DISTINCT(VendorID)
FROM taxi-rides-ny.nytaxi.yellow_tripdata_non_partitoned
WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2019-06-30';

-- Scanning ~106 MB of DATA
SELECT DISTINCT(VendorID)
FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitoned
WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2019-06-30';

-- Impact of partition and cluster
-- Query scans 1.1 GB
SELECT count(*) as trips
FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitoned
WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2020-12-31'
  AND VendorID=1;

-- Query scans 864.5 MB
SELECT count(*) as trips
FROM taxi-rides-ny.nytaxi.yellow_tripdata_partitoned_clustered
WHERE DATE(tpep_pickup_datetime) BETWEEN '2019-06-01' AND '2020-12-31'
  AND VendorID=1;
```

Note: Tables with data of less than 1GiB in size don't show significant improvement with partitioning and clustering.

## BigQuery Best Practices
For cost reduction:
* Avoid `select *` to reduce amount of data returned (more compute).
* Price queries before running them.
* Use clustered or partitioned tables.
* Be careful with streaming inserts
* Materialize query results in stages.

For query performance:
* Filter on partitioned columns.
* Denormalize the data.
* Use nested or repeated columns (when denormalizing data).
* Use external data sources sparingly.
* Reduce data before using a `JOIN` operation.
* Do not treat `WITH` clauses as prepared statements.
* Avoid oversharding tables (don't split them up too much).
* Avoid JavaScript user-defined functions.
* Use approximate aggregation functions.
* Use `ORDER BY` last.
* Optimize join patterns.
* Place the table with the largest number of rows first, followed by the table with the fewest rows, and then the remaining tables by decreasing row size. 