# **Outline**

- [**1. Data Warehouse**](#1.-Data-Warehouse)
- [**Homework**](#-Homework)

# **1. Data Warehouse**

Before entering into the concept of Data Warehouse, first lets understand what is OLAP (Online Analytical Processing) and OLTP (Online Transaction Processing). They are two distinct types of data processing systems used in the context of data warehousing and database management, each designed to support different kinds of workloads and user requirements. Summarizing the differences between OLAP and OLTP:

**OLTP**

OLTP systems are designed to manage transaction-oriented applications. They are optimized for handling a large number of short, atomic (indivisible) transactions. These systems are commonly used in day-to-day operations of businesses, such as sales transactions, banking, etc.

**OLAP**

OLAP systems are designed for complex queries and analyses, rather than transaction processing. They support decision-making and strategic planning by facilitating the manipulation and analysis of large volumes of data from different perspectives.


|                    | OLTP                                            | OLAP                                              |
|--------------------|-------------------------------------------------|---------------------------------------------------|
| **Purpose**        | Control and run essential business operations in real time | Plan, solve problems, support decisions, discover hidden insights |
| **Data updates**   | Short, fast updates initiated by user           | Data periodically refreshed with scheduled, long-running batch jobs |
| **Database design**| Normalized databases for efficiency             | Denormalized databases for analysis               |
| **Space requirements** | Generally small if historical data is archived | Generally large due to aggregating large datasets |
| **Backup and recovery** | Regular backups required to ensure business continuity and meet legal and governance requirements | Lost data can be reloaded from OLTP database as needed in lieu of regular backups |
| **Productivity**   | Increases productivity of end users                        | Increases productivity of business managers, data analysts, and executives |
| **Data view**      | Lists day-to-day business transactions                     | Multi-dimensional view of enterprise data                  |
| **User examples**  | Customer-facing personnel, clerks, online shoppers         | Knowledge workers such as data analysts, business analysts, and executives |

The data wherehouse is a system used for reporting and data analysis, and is considered a core component of business intelligence. Data wherehouses are central repositories of integrated data from one or more disparate sources. They store current and historical data and are used for creating trending reports for senior management reporting such as annual and quarterly comparisons.


Data warehouse is associated with OLAP. It is designed to support the OLAP processes, which are essential for performing complex queries and analyses. The data warehouse provides a centralized repository of integrated data from multiple sources. Data within a data warehouse is structured specifically for query and analysis purposes, often using denormalized to optimize for read operations and analytical queries rather than transactional speed.

Consider the following diagram to understand the concept of Data Warehouse:

<center>
<img src="figures/wherehouse-diagram.png" alt="drawing"/>
</center>


- **Data Source** : Data sources are systems that provide data to the data warehouse. These can be operational systems (or transactional system), relational databases, flat files (text files, CSVs..), or other sources of data. 

- **Staging Area** : Is a temporary storage space where data from different sources is collected before being processed and loaded into the data warehouse.

- **Data Warehouse** : The central component of the diagram is the data warehouse itself, which is typically a relational database designed for query and analysis rather than transaction processing. inside the warehouse, the diagram shows three types of data:

    - Meta Data: This is data about data. It contains information about the structure, formatting, and relationships within the warehouse.

    - Summary Data: Aggregated or calculated data, often precomputed to speed up common queries.

    - Raw Data: The detailed data transferred from the staging area into the warehouse without any aggregation or summarization.

- **Data Marts**: are subsets of the data warehouse and are usually oriented to a specific business line or team. They contain a subset of the warehouse data relevant to a particular group or business function.  The diagram lists three examples: purchasing, sales, and inventory.

- **Users**: Finally, the diagram shows the end users of the data warehouse and data marts.

# **Homework**

<b><u>Important Note:</b></u> <p> For this homework we will be using the 2022 Green Taxi Trip Record Parquet Files from the New York
City Taxi Data found here: </br> https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page </br>
If you are using orchestration such as Mage, Airflow or Prefect do not load the data into Big Query using the orchestrator.</br> 
Stop with loading the files into a bucket. </br></br>
<u>NOTE:</u> You will need to use the PARQUET option files when creating an External Table</br>

<b>SETUP:</b></br>
Create an external table using the Green Taxi Trip Records Data for 2022. </br>
Create a table in BQ using the Green Taxi Trip Records for 2022 (do not partition or cluster this table). </br>
</p>


## Question 1:
Question 1: What is count of records for the 2022 Green Taxi Data??
- 65,623,481
- **840,402**
- 1,936,423
- 253,647

To get the count of records for the 2022 Green Taxi Data, we can use the following query:


```sql
    SELECT COUNT(*) AS total_records
    FROM `de-bootcamp-414215.taxi_data.green_taxi_external_2022`;
```
the result is the following:


<center>
<img src="figures/question-1.png" alt="drawing"/>
</center>



## Question 2:
Write a query to count the distinct number of PULocationIDs for the entire dataset on both the tables.</br> 
What is the estimated amount of data that will be read when this query is executed on the External Table and the Table?

- **0 MB for the External Table and 6.41MB for the Materialized Table**
- 18.82 MB for the External Table and 47.60 MB for the Materialized Table
- 0 MB for the External Table and 0MB for the Materialized Table
- 2.14 MB for the External Table and 0MB for the Materialized Table


Create a materialized table from the external one using:

```sql
    CREATE OR REPLACE TABLE `de-bootcamp-414215.taxi_data.materialized_green_taxi_2022`
    AS
    SELECT *
    FROM `de-bootcamp-414215.taxi_data.green_taxi_external_2022`;
```

Then, we can use the following query to get the count of distinct PULocationIDs for the entire dataset on both the tables:

```sql
    SELECT COUNT(DISTINCT PULocationID) AS distinct_PULocationIDs
    FROM `de-bootcamp-414215.taxi_data.green_taxi_external_2022`;
```

then for the materialized table:

```sql
    SELECT COUNT(DISTINCT PULocationID) AS distinct_PULocationIDs
    FROM `de-bootcamp-414215.taxi_data.materialized_green_taxi_2022`;
```
The result should be the following:

<center>
<img src="figures/question-2.png" alt="drawing"/>
</center>


## Question 3:
How many records have a fare_amount of 0?
- 12,488
- 128,219
- 112
- **1,622**

To get the number of records that have a fare_amount of 0, we can use the following query:

```sql
    SELECT COUNT(*) AS records_with_zero_fare
    FROM `de-bootcamp-414215.taxi_data.green_taxi_external_2022`
    WHERE fare_amount = 0;
```

the result should be the following:

<center>
<img src="figures/question-3.png" alt="drawing"/>
</center>

## Question 4:
What is the best strategy to make an optimized table in Big Query if your query will always order the results by PUlocationID and filter based on lpep_pickup_datetime? (Create a new table with this strategy)
- Cluster on lpep_pickup_datetime Partition by PUlocationID
- **Partition by lpep_pickup_datetime  Cluster on PUlocationID**
- Partition by lpep_pickup_datetime and Partition by PUlocationID
- Cluster on by lpep_pickup_datetime and Cluster on PUlocationID

To create a partitioned table and then cluster it, we can use the following query:

```sql
    CREATE TABLE `de-bootcamp-414215.taxi_data.green_taxi_partitioned_2022`
    PARTITION BY DATE(lpep_pickup_datetime)
    CLUSTER BY PUlocationID
    AS
    SELECT *
    FROM `de-bootcamp-414215.taxi_data.green_taxi_external_2022`;
```

the result should bb the following:


<center>
<img src="figures/question-4.png" alt="drawing"/>
</center>


## Question 5:
Write a query to retrieve the distinct PULocationID between lpep_pickup_datetime
06/01/2022 and 06/30/2022 (inclusive)</br>

Use the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 4 and note the estimated bytes processed. What are these values? </br>

Choose the answer which most closely matches.</br> 

- 22.82 MB for non-partitioned table and 647.87 MB for the partitioned table
- **12.82 MB for non-partitioned table and 1.12 MB for the partitioned table**
- 5.63 MB for non-partitioned table and 0 MB for the partitioned table
- 10.31 MB for non-partitioned table and 10.31 MB for the partitioned table

Partitioned the materialized table as in question 4:

```sql
CREATE TABLE `de-bootcamp-414215.taxi_data.materialized_green_taxi_partitioned_2022`
PARTITION BY DATE(lpep_pickup_datetime)
CLUSTER BY PUlocationID
AS
SELECT *
FROM `de-bootcamp-414215.taxi_data.materialized_green_taxi_2022`;
```

Then, we can use the following query to get the count of distinct PULocationIDs between lpep_pickup_datetime 06/01/2022 and 06/30/2022 (inclusive) for the partitioned and non partitioned materialized tables:

```sql
    SELECT COUNT(DISTINCT PULocationID) AS distinct_PULocationIDs
    FROM `de-bootcamp-414215.taxi_data.materialized_green_taxi_2022`
    WHERE lpep_pickup_datetime BETWEEN '2022-06-01' AND '2022-06-30';
```
and

```sql
    SELECT COUNT(DISTINCT PULocationID) AS distinct_PULocationIDs
    FROM `de-bootcamp-414215.taxi_data.materialized_green_taxi_partitioned_2022`
    WHERE lpep_pickup_datetime BETWEEN '2022-06-01' AND '2022-06-30';
```
the result should be the following:

<center>
<img src="figures/question-5.png" alt="drawing"/>
</center>


## Question 6: 
Where is the data stored in the External Table you created?

- Big Query
- **GCP Bucket**
- Big Table
- Container Registry


## Question 7:
It is best practice in Big Query to always cluster your data:
- True
- **False**