# Introduction to Data Warehousing

The concept of Data Warehousing originated at IBM in the 80's. The goal of the initial research was to provide a framework to transfer data from operational systems to business intelligence departments, avoiding the cost and technical challenges of high redundancy.

## What will you learn in this course? 🧐🧐
This lecture will introduce the concept of data warehousing and why do we need it. Here's the outline:
* Why Analysts cannot work directly on business databases
* Data Warehouse VS Data Lake
* Data Warehouse VS traditional databases
    * Key differences*
* Cloud vendors
* Amazon Redshift
    * Reading from Redshift onto a PySpark DataFrame
    * Writing to Redshift from PySpark DataFrame


## Why Analysts cannot work directly on business databases 🤔🤔

Business databases must stay clean at all cost: allowing Data Analysis or Data Scientist to access it introduces a breach 

Moreover, most of the time, unstructured data (ie, not stored in any kind of databases) is required to do performant analysis. 

A Warehousing solution allows the company to aggregate and store its data needed for analysis, without altering the databases used for operations.

## Data Warehouse VS Data Lake 🗄️🆚🌊

You often hear both when discussing Big Data, however they are very different.

Data Lakes are a big pool of raw data, with no defined purposes: we store this unstructured data in prevision of future usage.

Data Warehouse holds **processed** and **structured** data, ready to be used for advanced analytics. 

Most of the time, data that ends up in the Warehouse was previously stored in the Lake. 

- Step 1: Data is collected and stored in its raw form in a Data Lake
- Step 2: Data is extracted from the Lake, cleaned and processed
- Step 3: Data is loaded in the warehouse, ready to be queried.

## Data Warehouse VS traditional databases 🗄️🗄️🗄️🆚🗄️

Roughly, a Data Warehouse **is** a relational database. It's just a little more than that.

### Key differences 🔑

1. The Warehouse can hold data from many databases
2. Any data stored in the Warehouse is stored for **analytics purposes only**
3. Data within a warehouse has been processed to simplify the analysis, and avoid the need for  SQL queries that spread on 300 lines
4. Whereas databases are optimized for extracting rows (or observations), data warehouses are optimized to have a performance boost on columns (or fields).

In a nutshell: warehouses are optimized for performant analysis.

 **A warehouse is the perfect candidate for `LOAD` destination in ETL pipelines.**

## Cloud vendors ☁️☁️

- BigQuery, owned by Google, and part of the Google Cloud Platform
- Redshift, owned by Amazon and part of the AWS platform
- Snowflake
- ...

As always when choosing between different vendors, the cost structure is one the most important aspects to check. For instance, BigQuery storage is **much** cheaper than Redshift, but querying data on Redshift is **free** whereas it costs about 5 dollars/TB on BigQuery. Depending on your need, one solution might be more suitable than the other.

## Amazon Redshift 🔴🔴

Redshift is the Data Warehousing solution from Amazon Web Services. As every services of the AWS family, Redshift is **Cloud-based**: you only pay for the compute and storage, and you don't have to take care of maintenance costs, or scaling the hardware to support an increasing load.

### Reading from Redshift onto a PySpark DataFrame 🔴➡✨

```
REDSHIFT_USER = 'YOUR_REDSHIFT_USERNAME'
REDSHIFT_PASSWORD = 'YOUR_REDSHIFT_PASSWORD'

redshift_path_full = "JDBC_LINK" # don't forget to replace "redshift" by "postgresql"
REDSHIFT_TABLE = 'NAME_OF_THE_TABLE'

properties = {"user": REDSHIFT_USER, "password": REDSHIFT_PASSWORD, "driver": "org.postgresql.Driver"}

table = sqlContext.read.jdbc(url=REDSHIFT_URL, table=REDSHIFT_TABLE, properties=properties)
```
        
Although this can be useful, it is also possible to query your database using the redshift query editor directly, which is most likely what data analysts and business analysts would be doing in a real-life context.

### Writing to Redshift from PySpark DataFrame ✨➡🔴
```
REDSHIFT_USER = 'YOUR_REDSHIFT_USERNAME'
REDSHIFT_PASSWORD = 'YOUR_REDSHIFT_PASSWORD'

redshift_path_full = "JDBC_LINK" # don't forget to replace "redshift" by "postgresql"
REDSHIFT_TABLE = 'NAME_OF_THE_TABLE'
```
As written in this [tutorial](https://github.com/databricks/spark-redshift/tree/master/tutorial), there are several modes you can choose from when loading data from Spark to Redshift. 

The 4 `mode` to choose from are:
  - `overwrite`: drop the table if it exists, then load the data in a new one
  - `append`: create the table if it does not exists, else append the data to the existing table
  - `error` (default) : create the table or raise an error if it exists
  - `ignore`: same as `overwrite`, but does nothing if table already exists

We ask you to use the `overwrite` mode here. Also, you need to set the option `tempformat` to `csv` because the default Avro format does not allow non letter characters (such as `_`) in columns names.
```
mode = "overwrite"
properties = {"user": REDSHIFT_USER, "password": REDSHIFT_PASSWORD, "driver": "org.postgresql.Driver"}

df_clean.write.jdbc(url=redshift_path_full, table=REDSHIFT_TABLE, mode=mode, properties=properties)
```
As Spark uses an S3 bucket to store the intermediary files, both Spark and Redshift needs to have access to the S3 bucket.

→ Ensure that the Redshift cluster has assumed an IAM role that gives it access to the `tempdir` S3 bucket (or use `forward_spark_s3_credentials` option)

→ By default, Spark uses the Avro format as an intermediary storage in S3. Using CSV can significantly improve loading performance, and also allow columns to have names with characters other than ASCII letters.

→ By default, every `string` column is loaded as a 256-byte length `VARCHAR` to Redshift. To gain performance or flexibility, it is possible to edit the default behavior by giving a `redshift_type` metadata to the DataFrame's column. See docs below for implementation in Scala and Python.

## Ressources 📚📚

[A nice article on Alooma's blog]([https://www.alooma.com/blog/database-vs-data-warehouse](https://www.alooma.com/blog/database-vs-data-warehouse))
[Amazon Redshift](https://docs.databricks.com/data/data-sources/aws/amazon-redshift.html#setting-a-custom-column-type)