## Why Data Lakes: Evolution of the Data Warehouse

### Evolution of the Data Warehouse
Q: Is there anything wrong with the data warehouse that we need something different?

No, data warehousing is a rather **mature field** with lots of cumulative experience over the years, **tried-and-true technologies**. **Dimensional modeling is still extremely relevant** to this day.

**For many organizations, a data warehouse is still the best way to go**, perhaps, the biggest change would be going from an on-premise deployment to a cloud deployment. 

Q: So, why do we need a data lake?

In recent years, many factors drove the evolution of the data warehouse, to name a few:
* The abundance of unstructured data (text, xml, json, logs, sensor data, images, voice, etc..)
* Unprecedented data volumes (social, IOT, machine-generated, etc..)
* The rise of Big Data technologies like HDFS, Spark, etc..
* New types of data analysis gaining momentum, e.g. predictive analytics, recommender systems, graph analytics, etc..
* Emergence of new roles like the data scientist

## Why Data Lakes: Unstructured & Big Data

### Abundance of Unstructured Data
Q: Can we have unstructured data in the data warehouse?

* Might be possible in the ETL process. FOr instance we might be able to distill sine elements from json data and put it in a tabular format.
* But later, we might decide we want to transform it differently, so deciding on a particular form of transformation is a **strong commitment without enough knowledge**. E.g we start by recording # of replicas in a facebook of comments and then we interested in the frequency of angry words.
* Some data is hard to put in a tabular format like **deep json structures**.
* Some data like text/pdf documents could be stored as "**blobs**" of data in a relational database but totally **useless useless processed to extract metrics** .
* The Hadoop file system (HDFS) made it possible to Peta Bytes of data on commodity hardware. **Much lower cost per TB** compared to MPP(Massively parallel processing) databases. 
* Associated processing tools starting from MapReduce, Pig, Hive, Impala, and Spark, to name a few, made it possible to **process this data at scale on the same hardware used for storage**.
* It is possible to make data analysis without inserting into a predefined schema. One can load a CSV file and make a query without creating a table, inserting the data in the table. Similarly one can process unstructured text. This approach is know as "**Schema-On-Read**"

## Why Data Lakes: New Roles & Advanced Analytics
* The data warehouse by design follows a **very well-architured** path yo make a **clean, consistent and performant model** that business users can easily use to gain insights and make decisions.
* As data became an asset of highest value (**Data is the new oil**), a role like the **data scientist** started to emerge seeking value from data
* The data scientist job is almost impossible conforming to a **single rigid representation of data**. He needs freedom to represent data, join data sets together, retrieve new external data sources and more.
* The type of analytics such as , e.g. **machine learning, natural language processing** need to access the raw data in forms totally different from a star schema.

### The Data Lake is the new Data Warehouse
* The data lake shares the goals of the data warehouse of supporting business insights beyond the day-today transactional data handling.
* The Data lake is new form of data warehouse that evolved to cope with:
    * The **variety of data formats** and structuring
    * The agile and ad-hoc nature of **data exploration** activities needed by new roles like **data scientist**
    * THe wide data spectrum data transformation needed by **advanced analytics** like machine learning, graph analytics, and recommender systems

## Big Data Effects: Low Costs, ETL Offloading
<img src="images/big_data_effects.png">

## Big Data Effects: Schema-on-Read
* Traditionally, data in a database has been much easier to process than data in plain files
* Big Data tools in the hadoop ecosystem e.g. Hive & SPark made it easy to work with a file as easy as it is to work with a database without:
    * ~Creating a database~
    * ~ Inserting the data into database~
* **Schema on-read**: as for the schema of a table (simple file on disk):
    * It is either inferred
    * Or specified and the data is not inserted into it, but upon read the data is checked against the specified schema
    
<img src="images/schema_on_read_1.png">
<img src="images/schema_on_read_2.png">
<img src="images/schema_on_read_3.png">
<img src="images/schema_on_read_4.png">

## Big Data Effects: (Un-/Semi-)Structured support
* Spark has the ability to read/write files in:
    * Text-based many formats, csv, josn, text
    * Binary formats such as Avro (saves space) and Parquet (columnar)
    * Compressed formats e.g. gzip & snappy
    
       
    dfLog = spark.read.text("data/NASA_access_log_Jul95.gz")
    
    dfRaw = spark.read.csv("data/news_worldnews.csv")