# Introduction to Data Warehousing

The concept of Data Warehousing originated at IBM in the 80's. The goal of the initial research was to provide a framework to transfer data from operational systems to business intelligence departments, avoiding the cost and technical challenges of high redundancy.

## What will you learn in this course? 🧐🧐

This lecture will introduce the concept of data warehousing and why do we need it. Here's the outline:

* Why analysts cannot work directly on business databases?
* Data Warehouse VS Data Lake
* Data Warehouse VS traditional databases
    * Key differences
* Cloud vendors
* Amazon Redshift
    * Setup your own Redshift cluster
    * Tear down your Redshift cluster when you are done
* Using Redshift in PySpark
    * Writing to Redshift from PySpark DataFrame
    * Reading from Redshift onto a PySpark DataFrame

## Why analysts cannot work directly on business databases? 🤔🤔

Business databases must stay clean at all cost: allowing Data Analysis or Data Scientist to access it introduces a breach.

Moreover, most of the time, unstructured data (i.e., not stored in any kind of databases) is required to do performant analysis. 

A Warehousing solution allows the company to aggregate and store its data needed for analysis, without altering the databases used for operations.

## Data Warehouse VS Data Lake 🗄️🆚🌊

You often hear both when discussing Big Data, however they are very different.

Data Lakes are a big pool of raw data, with no defined purposes: we store this unstructured data in prevision of future usage.

Data Warehouse holds **processed** and **structured** data, ready to be used for advanced analytics. 

Most of the time, data that ends up in the Warehouse was previously stored in the Lake. 

- Step 1: Data is collected and stored in its raw form in a Data Lake,
- Step 2: Data is extracted from the Lake, cleaned and processed,
- Step 3: Data is loaded in the warehouse, ready to be queried.

## Data Warehouse VS traditional databases 🗄️🗄️🗄️🆚🗄️

Roughly, a Data Warehouse **is** a relational database. It's just a little more than that.

### Key differences 🔑

1. The Warehouse can hold data from many databases.
2. Any data stored in the Warehouse is stored for **analytics purposes only**.
3. Data within a warehouse has been processed to simplify the analysis, and avoid the need for SQL queries that spread on 300 lines.
4. Whereas databases are optimized for extracting rows (or observations), data warehouses are optimized to have a performance boost on columns (or fields).

In a nutshell: warehouses are optimized for performant analysis.

**A warehouse is the perfect candidate for `LOAD` destination in ETL pipelines.**

## Cloud vendors ☁️☁️

- BigQuery, owned by Google, and part of the Google Cloud Platform,
- Redshift, owned by Amazon and part of the AWS platform,
- Snowflake,
- ...

As always when choosing between different vendors, the cost structure is one the most important aspects to check. For instance, BigQuery storage is **much** cheaper than Redshift, but querying data on Redshift is **free** whereas it costs about 5 dollars/TB on BigQuery. Depending on your need, one solution might be more suitable than the other.

## Amazon Redshift 🔴🔴

Redshift is the Data Warehousing solution from Amazon Web Services. As every services of the AWS family, Redshift is **Cloud-based**: you only pay for the compute and storage, and you don't have to take care of maintenance costs, or scaling the hardware to support an increasing load.

### Creating your Redshift

Go to your AWS Console and look for Redshift. Check you are on a good location, here `Paris`. Click on _Create a cluster_!

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/Redshift001.png)

Be sure to **select the ⚠️ free tier ⚠️!**

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/Redshift002.png)

Enter a user (or leave the default one) and a password:

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/Redshift003.png)

Click on **Create a cluster**!

Wait ⏳ your cluster to be on stage:

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/Redshift004.png)

When the status is green 🟢 it means your cluster is ready! Well, almost. We are going to open our Redshift to the world, so as to push data easily from anywhere.

To do so click on your cluster, then on the panel choose _Actions_ and _Modify public access_.

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/Redshift005.png)

Activate the access:

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/Redshift006.png)

And that's it! 🎉

The only information we are going to need later on is located in the 👉 **URL JDBC**, located here:

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/Redshift007.png)

#### Error

If you fall on your this error:
 
![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/Redshift003b-error-cluster-subnet.png)
 
Go to Config, then Subnet group:

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/Redshift003b1-error-cluster-subnet.png)

Then create a Subnet group:

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/Redshift003b2-error-cluster-subnet.png)

Simply put a description, select a VPC (the default one is okay), disponibility zone, sub-network, click on Add a subnet and finally create the group.

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/Redshift003b3-error-cluster-subnet.png)

You are done! 🎉

### How to tear down your Redshift?

When you have finished working with your Redshift cluster we advise your to ⚠️ **tear down your Redshift cluster so as to avoid too much costs.** ⚠️

It is easy, just follow the following steps.

Go to your Redshift clusters panel and select your cluster. Then go to _Actions_ and click _Delete_:

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/RedshiftDown001.png)

👉 A prompt will ask you to confirm, ⚠️ **you have to deselect the _Create a instant snapshot_ option.** ⚠️ Otherwise you will be charged for this storage.

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/RedshiftDown002.png)

In the following screenshot, we are good to go:

![](https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/RedshiftDown003.png)

Finally, you can also delete the group subnet.

## Using Redshift in PySpark

### Writing to Redshift from PySpark DataFrame ✨➡🔴

Let's show you how to use Redshift with PySpark. First, we are creating a simple Dataframe:

In [None]:
import pandas as pd
import numpy as np

data_dict = {'a': [1,2,3], 'b': [2,3,4], 'c': [3,4,5], 'd':[np.NaN,0,1], 'e':["apple","banana","orange"]}

pandas_df = pd.DataFrame.from_dict(
    data_dict
)

df = spark.createDataFrame(pandas_df)

df.show()

Then you need to fill some informations:

> The `redshift_path_full` is the URL JDBC from the cluster panel we mentioned above 👆. Remember? 🙂

In [None]:
REDSHIFT_USER = 'YOUR_REDSHIFT_USERNAME'
REDSHIFT_PASSWORD = 'YOUR_REDSHIFT_PASSWORD'

REDSHIFT_FULL_PATH = "URL_JDBC" # don't forget to replace "redshift" by "postgresql"
                                # for example it'll look like:
                                # "jdbc:postgresql://redshift-cluster-1.csssws1edn9m.eu-west-3.redshift.amazonaws.com:5439/dev"
REDSHIFT_TABLE = 'NAME_OF_THE_TABLE'

We can then write to our Redshift:

In [None]:
mode = "overwrite"

properties = {"user": REDSHIFT_USER, "password": REDSHIFT_PASSWORD, "driver": "org.postgresql.Driver"}

df.write.jdbc(url=REDSHIFT_FULL_PATH, table=REDSHIFT_TABLE, mode=mode, properties=properties)

The 4 `mode` to choose from are:

- `overwrite`: drop the table if it exists, then load the data in a new one,
- `append`: create the table if it does not exists, else append the data to the existing table,
- `error` (default): create the table or raise an error if it exists,
- `ignore`: same as `overwrite`, but does nothing if table already exists.

### Reading from Redshift onto a PySpark DataFrame 🔴➡✨

We can read from our Redshift in few lines:

In [None]:
properties = {"user": REDSHIFT_USER, "password": REDSHIFT_PASSWORD, "driver": "org.postgresql.Driver"}

table = sqlContext.read.jdbc(url=redshift_path_full, table=REDSHIFT_TABLE, properties=properties)

table.show()

Although this can be useful, it is also possible to query your database using the Redshift query editor directly, which is most likely what data analysts and business analysts would be doing in a real-life context.

Congrats! 👏 You just created your first data warehouse using Redshift! Do not forget to 👉 **[tear down your Redshift cluster](#how-to-tear-down-your-redshift)** 👈 or you run the risk of being charged.

## Ressources 📚📚

- [A nice article on Alooma's blog](https://www.alooma.com/blog/database-vs-data-warehouse)
- [Amazon Redshift](https://docs.databricks.com/data/data-sources/aws/amazon-redshift.html#setting-a-custom-column-type)