data-warehouse

BigQuery data warehouse for yellow taxi trip data analysis.

Setup and Data Structures

Create external table from yellow taxi data

CREATE OR REPLACE EXTERNAL TABLE `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips_external`
OPTIONS (
  format = 'PARQUET',
  uris = ['gs://taxi-data-bucket-484819-a6/*.parquet']
);

Create regular table in BQ

CREATE OR REPLACE TABLE `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips`
AS
SELECT *
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips_external`;

Key Findings and Analysis

1. Total Record Count

20,332,093 total records in yellow taxi trips dataset.

2. External Table vs Materialized Table - Query Performance

Data read estimate for distinct pickup locations query:

External table: 0 MB (stored in GCP bucket)
Materialized table: 155.12 MB

External table query:

SELECT COUNT(DISTINCT PULocationID) as distinct_pickup_locations
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips_external`;

Materialized table query:

SELECT COUNT(DISTINCT PULocationID) as distinct_pickup_locations
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips`;

3. Columnar Database - Column Selection Impact

BigQuery is a columnar database that only scans the specific columns requested in the query. Querying two columns requires reading more data than querying one column.

One column query (155 MB):

SELECT PULocationID
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips`;

Two column query (310 MB):

SELECT PULocationID, DOLocationID
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips`;

4. WHERE Clause Column Scanning

The WHERE clause forces BigQuery to read specific columns regardless of the SELECT clause. This query returned 8,333 records where fare_amount equals zero.

SELECT count(*)
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips`
WHERE fare_amount = 0;

5. Partitioning and Clustering Strategy

For queries that filter by tpep_dropoff_datetime and order by VendorID, the optimal approach is to partition by date and cluster by VendorID:

CREATE OR REPLACE TABLE `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips_optimized`
PARTITION BY DATE(tpep_dropoff_datetime)
CLUSTER BY VendorID
AS
SELECT * 
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips`;

6. Performance Comparison - Optimized vs Non-Optimized

Without partitioning/clustering (310.24 MB):

SELECT DISTINCT VendorID
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips`
WHERE tpep_dropoff_datetime >= '2024-03-01' 
  AND tpep_dropoff_datetime < '2024-03-16'
ORDER BY VendorID;

With partitioning/clustering (26.84 MB):

SELECT DISTINCT VendorID
FROM `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips_optimized`
WHERE tpep_dropoff_datetime >= '2024-03-01' 
  AND tpep_dropoff_datetime < '2024-03-16'
ORDER BY VendorID;

Result: ~91% reduction in data scanned with proper optimization.

7. External Table Storage

External table data is stored in the GCP bucket, which is why BigQuery shows zero estimate bytes. The storage costs are incurred at the bucket level, not BigQuery.

8. When to Use Clustering

Clustering is NOT always best practice. It helps when:

You frequently filter or aggregate on specific columns
Queries use specific column patterns repeatedly
Working with high-cardinality columns
Table size exceeds 1 GB

9. COUNT(*) Query Optimization

SELECT COUNT(*) often shows 0 MB processed because BigQuery can answer it using table metadata, without scanning the actual data.

SELECT count(*) from `curious-helix-484819-a6.taxi_dataset.yellow_taxi_trips`;

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
load_yellow_taxi_data.py		load_yellow_taxi_data.py
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-warehouse

Setup and Data Structures

Create external table from yellow taxi data

Create regular table in BQ

Key Findings and Analysis

1. Total Record Count

2. External Table vs Materialized Table - Query Performance

3. Columnar Database - Column Selection Impact

4. WHERE Clause Column Scanning

5. Partitioning and Clustering Strategy

6. Performance Comparison - Optimized vs Non-Optimized

7. External Table Storage

8. When to Use Clustering

9. COUNT(*) Query Optimization

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

data-warehouse

Setup and Data Structures

Create external table from yellow taxi data

Create regular table in BQ

Key Findings and Analysis

1. Total Record Count

2. External Table vs Materialized Table - Query Performance

3. Columnar Database - Column Selection Impact

4. WHERE Clause Column Scanning

5. Partitioning and Clustering Strategy

6. Performance Comparison - Optimized vs Non-Optimized

7. External Table Storage

8. When to Use Clustering

9. COUNT(*) Query Optimization

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages