GitHub - mohaas33/data_warehouse

Question 1. Counting records

SELECT COUNT(*) as total_records
FROM
  `de-course-485418.taxi_data.yellow_tripdata_2024_non_partitioned`;

What is count of records for the 2024 Yellow Taxi Data?

65,623
840,402
20,332,093
85,431,289

Question 2. Data read estimation

Write a query to count the distinct number of PULocationIDs for the entire dataset on both the tables.

-- Count the distinct number of PULocationIDs for the entire dataset
SELECT COUNT(DISTINCT PULocationID) AS distinct_pu_locations FROM de-course-485418.taxi_data.yellow_tripdata_2024_non_partitioned

SELECT COUNT(DISTINCT PULocationID) AS distinct_pu_locations FROM de-course-485418.taxi_data.yellow_tripdata_2024

What is the estimated amount of data that will be read when this query is executed on the External Table and the Table?

18.82 MB for the External Table and 47.60 MB for the Materialized Table
0 MB for the External Table and 155.12 MB for the Materialized Table
2.14 GB for the External Table and 0MB for the Materialized Table
0 MB for the External Table and 0MB for the Materialized Table

Question 3. Understanding columnar storage

Write a query to retrieve the PULocationID from the table (not the external table) in BigQuery. Now write a query to retrieve the PULocationID and DOLocationID on the same table.

Why are the estimated number of Bytes different?

BigQuery is a columnar database, and it only scans the specific columns requested in the query. Querying two columns (PULocationID, DOLocationID) requires reading more data than querying one column (PULocationID), leading to a higher estimated number of bytes processed.
BigQuery duplicates data across multiple storage partitions, so selecting two columns instead of one requires scanning the table twice, doubling the estimated bytes processed.
BigQuery automatically caches the first queried column, so adding a second column increases processing time but does not affect the estimated bytes scanned.
When selecting multiple columns, BigQuery performs an implicit join operation between them, increasing the estimated bytes processed

Question 4. Counting zero fare trips

How many records have a fare_amount of 0?

128,210
546,578
20,188,016
8,333

Question 5. Partitioning and clustering

What is the best strategy to make an optimized table in Big Query if your query will always filter based on tpep_dropoff_datetime and order the results by VendorID (Create a new table with this strategy)

-- Creating a partition and cluster table
CREATE OR REPLACE TABLE de-course-485418.taxi_data.yellow_tripdata_2024_partitioned_clustered
PARTITION BY DATE(tpep_pickup_datetime)
CLUSTER BY VendorID AS
SELECT * FROM de-course-485418.taxi_data.yellow_tripdata_2024;

Partition by tpep_dropoff_datetime and Cluster on VendorID
Cluster on by tpep_dropoff_datetime and Cluster on VendorID
Cluster on tpep_dropoff_datetime Partition by VendorID
Partition by tpep_dropoff_datetime and Partition by VendorID

Question 6. Partition benefits

Write a query to retrieve the distinct VendorIDs between tpep_dropoff_datetime 2024-03-01 and 2024-03-15 (inclusive)

Use the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 5 and note the estimated bytes processed. What are these values?

Choose the answer which most closely matches.

12.47 MB for non-partitioned table and 326.42 MB for the partitioned table
310.24 MB for non-partitioned table and 26.84 MB for the partitioned table (28.83 MB for me)
5.87 MB for non-partitioned table and 0 MB for the partitioned table
310.31 MB for non-partitioned table and 285.64 MB for the partitioned table

Question 7. External table storage

Where is the data stored in the External Table you created?

Big Query
Container Registry
GCP Bucket
Big Table

Question 8. Clustering best practices

It is best practice in Big Query to always cluster your data:

True
False

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
load_yellow_taxi_data.py		load_yellow_taxi_data.py
yellow_taxi_sql_test.sql		yellow_taxi_sql_test.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Question 1. Counting records

Question 2. Data read estimation

Question 3. Understanding columnar storage

Question 4. Counting zero fare trips

Question 5. Partitioning and clustering

Question 6. Partition benefits

Question 7. External table storage

Question 8. Clustering best practices

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Question 1. Counting records

Question 2. Data read estimation

Question 3. Understanding columnar storage

Question 4. Counting zero fare trips

Question 5. Partitioning and clustering

Question 6. Partition benefits

Question 7. External table storage

Question 8. Clustering best practices

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages