BigQuery data warehouse for yellow taxi trip data analysis.
CREATE OR REPLACE EXTERNAL TABLE `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips_external`
OPTIONS (
format = 'PARQUET',
uris = ['gs://taxi-data-bucket-484819-a6/*.parquet']
);CREATE OR REPLACE TABLE `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips`
AS
SELECT *
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips_external`;20,332,093 total records in yellow taxi trips dataset.
Data read estimate for distinct pickup locations query:
- External table: 0 MB (stored in GCP bucket)
- Materialized table: 155.12 MB
External table query:
SELECT COUNT(DISTINCT PULocationID) as distinct_pickup_locations
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips_external`;Materialized table query:
SELECT COUNT(DISTINCT PULocationID) as distinct_pickup_locations
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips`;BigQuery is a columnar database that only scans the specific columns requested in the query. Querying two columns requires reading more data than querying one column.
One column query (155 MB):
SELECT PULocationID
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips`;Two column query (310 MB):
SELECT PULocationID, DOLocationID
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips`;The WHERE clause forces BigQuery to read specific columns regardless of the SELECT clause. This query returned 8,333 records where fare_amount equals zero.
SELECT count(*)
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips`
WHERE fare_amount = 0;For queries that filter by tpep_dropoff_datetime and order by VendorID, the optimal approach is to partition by date and cluster by VendorID:
CREATE OR REPLACE TABLE `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips_optimized`
PARTITION BY DATE(tpep_dropoff_datetime)
CLUSTER BY VendorID
AS
SELECT *
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips`;Without partitioning/clustering (310.24 MB):
SELECT DISTINCT VendorID
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips`
WHERE tpep_dropoff_datetime >= '2024-03-01'
AND tpep_dropoff_datetime < '2024-03-16'
ORDER BY VendorID;With partitioning/clustering (26.84 MB):
SELECT DISTINCT VendorID
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips_optimized`
WHERE tpep_dropoff_datetime >= '2024-03-01'
AND tpep_dropoff_datetime < '2024-03-16'
ORDER BY VendorID;Result: ~91% reduction in data scanned with proper optimization.
External table data is stored in the GCP bucket, which is why BigQuery shows zero estimate bytes. The storage costs are incurred at the bucket level, not BigQuery.
Clustering is NOT always best practice. It helps when:
- You frequently filter or aggregate on specific columns
- Queries use specific column patterns repeatedly
- Working with high-cardinality columns
- Table size exceeds 1 GB
SELECT COUNT(*) often shows 0 MB processed because BigQuery can answer it using table metadata, without scanning the actual data.
SELECT count(*) from `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips`;