# Data Warehousing - Introduction and Concepts


A **data warehouse** is a centralized repository designed to store large volumes of data. 
Unlike a traditional database that primarily supports day-to-day transactions, a data warehouse is focused on **analysis and decision-making**.

- **Database vs. Data Warehouse**  
  - *Database*: Platform where operational data is created and stored.  
  - *Data Warehouse*: Usage of data for analysis, typically built on top of a database.  

Data warehouses do not generate data through transactions (e.g., online purchases, course registrations, employee hiring). 
Instead, they **collect and consolidate data** from various operational systems and external sources.



## Key Characteristics of a Data Warehouse (Bill Inmon’s Rules)

In 1990, **Bill Inmon**, known as the *Father of Data Warehousing*, defined four essential characteristics:

1. **Integrated**  
   - Data from multiple operational and external systems is combined into a unified format.  

2. **Subject-Oriented**  
   - Data is organized by subject areas (e.g., customers, products, sales), not by application.  

3. **Time-Variant**  
   - Contains **historical data** (not just current state), enabling trend analysis over time.  

4. **Non-Volatile**  
   - Data is **stable between refresh cycles**.  
   - New or updated data is loaded in periodic batches (e.g., nightly).  
   - Once loaded, data is not modified by transactional updates.  



## How Data Flows into a Data Warehouse

- Data is **copied, not moved** from operational systems.  
- Source systems continue to operate independently.  
- The warehouse is periodically refreshed (commonly once per day).  
- Data is often **restructured and reorganized** to optimize analytical queries.



# Why Build a Data Warehouse?

Organizations invest time, resources, and money into building a data warehouse for two main reasons:

1. **Data-Driven Decision Making**  
   - Decisions are based on reliable data, not just intuition or experience.  
   - Enables analysis across past, present, and predictive (future) perspectives.  
   - Even supports exploration of the *unknown* through advanced analytics.  

2. **One-Stop Shopping**  
   - All data is consolidated into a single repository.  
   - Eliminates the need to gather scattered data from multiple operational systems.  
   - Analysts can focus on *analysis*, not on repeatedly collecting and integrating data.  



## Data Warehousing and Business Intelligence (BI)

- **Business Intelligence (BI)** and **Data Warehousing** emerged around the same time (~1990).  
- They reinforced each other:  
  - BI popularized data warehousing by providing tools to extract value.  
  - Data warehouses fueled BI with integrated, historical data.  
- Together, they form the backbone of modern data-driven organizations.



# Data Warehouse vs. Data Lake

The terms **data warehouse** and **data lake** are often used interchangeably, but they serve different purposes.

### Data Warehouse
- Typically built on **relational databases** (e.g., Oracle, SQL Server, IBM Db2).  
- Stores **structured data** (numbers, dates, strings).  
- Data is loaded in periodic batches.  
- Sometimes uses **multidimensional databases (cubes)** for OLAP analysis.  

### Data Lake
- Built on **big data environments** (e.g., Hadoop, Spark, cloud-native storage).  
- Designed to handle the **3Vs of Big Data**:
  - **Volume** – Massive amounts of data.  
  - **Velocity** – Rapid ingestion of new and updated data.  
  - **Variety** – Structured, semi-structured (JSON, XML, text), and unstructured data (audio, video).  
- More flexible than traditional warehouses for handling modern data types.



## Complementary Use

- Data lakes are sometimes viewed as the *next generation* of data warehousing.  
- However, most organizations still maintain **robust warehouses** alongside **data lakes**.  
- SQL remains the backbone:  
  - Supports BI on both warehouses and lakes.  
  - Allows unified analysis across environments.  



## Unified Goal

Both **data warehouses** and **data lakes** ultimately exist to:  
- Enable **business intelligence**  
- Expand analytical capabilities  
- Drive **data-driven decision making**
