A comprehensive Claude Code skill for working with Google BigQuery via the command line using the bq tool.
This skill provides complete guidance for using the BigQuery CLI (bq) effectively, including:
- Query Operations - Execute SQL, estimate costs with dry-run, parameterized queries
- Data Loading - Load from CSV/JSON/Avro/Parquet with schema management
- Data Export - Export tables to GCS with compression and sharding
- Resource Management - Create/list/delete datasets, tables, views
- Cost Optimization - Always dry-run first, partitioning strategies, clustering
- Schema Design - Inline and JSON formats, types, nested records
- Best Practices - Partitioning, clustering, cost controls, performance tips
Copy the skill to your Claude skills directory:
mkdir -p ~/.claude/skills/bigquery-cli
cp SKILL.md ~/.claude/skills/bigquery-cli/The skill will be automatically available in your next Claude Code session.
You need the BigQuery CLI (bq) tool installed as part of Google Cloud SDK:
# Install Google Cloud SDK
# https://cloud.google.com/sdk/docs/install
# Verify installation
bq version
# Authenticate
gcloud auth login
# Set default project
gcloud config set project YOUR_PROJECT_IDComplete command reference with flags for:
- Query operations (query, dry-run, parameters)
- Data loading (CSV, JSON, Avro, Parquet)
- Data export (formats, compression, sharding)
- Resource management (list, show, create, delete)
- Job management (cancel, show, filter)
# ALWAYS estimate cost first
bq query --dry_run 'SELECT ...'
# Review bytes to be processed
# Calculate: (bytes / 1TB) Γ $6.25
# Run query if acceptable
bq query 'SELECT ...'
# OR add safety limit
bq query --maximum_bytes_billed=10000000000000 'SELECT ...'# Create partitioned table for scale
bq mk --table \
--time_partitioning_type=DAY \
--time_partitioning_field=event_date \
--clustering_fields=user_id,region \
--require_partition_filter=true \
ds.events \
schema.json- Cost estimation β Query execution
- Data loading pipeline with validation
- Large table export with sharding
- Partitioned table creation for billions of rows
- Always use --dry_run for queries scanning >1TB
- Partition large tables by date/timestamp
- Cluster by common filter columns
- Use wildcards for large exports (>1GB limit)
- Compress exports with GZIP or SNAPPY
- Check authentication before running commands
| Mistake | Why It's Wrong | Correct Approach |
|---|---|---|
| No dry-run for large queries | Unexpected costs | Always bq query --dry_run first |
| SELECT * on huge tables | Scans all columns | Select only needed columns |
| Single file for large exports | 1GB limit per file | Use wildcard: gs://bucket/file_*.json.gz |
| No partitioning on large tables | Expensive full scans | Use --time_partitioning_type=DAY |
| Loading without --skip_leading_rows | Header becomes data | Use --skip_leading_rows=1 for CSVs |
# Estimate cost before running
bq query --dry_run 'SELECT user_id, COUNT(*) FROM dataset.events GROUP BY user_id'
# Run with safety limit
bq query --maximum_bytes_billed=5000000000000 'SELECT ...'# Autodetect schema (fast)
bq load --autodetect --skip_leading_rows=1 \
dataset.table \
gs://bucket/data.csv
# Explicit schema (production)
bq load --skip_leading_rows=1 \
dataset.table \
gs://bucket/data.csv \
user_id:STRING,event_type:STRING,timestamp:TIMESTAMP,data:STRING# Export with compression and sharding
bq extract \
--compression=GZIP \
--destination_format=NEWLINE_DELIMITED_JSON \
dataset.large_table \
'gs://bucket/export_*.json.gz'# Partition + cluster for scale
bq mk --table \
--time_partitioning_type=DAY \
--time_partitioning_field=event_date \
--clustering_fields=user_id \
dataset.events \
event_id:STRING,user_id:STRING,event_date:DATE,data:JSONThis skill was created following Test-Driven Development (TDD) principles for documentation:
- Tested agents WITHOUT the skill
- Identified baseline knowledge: strong command syntax
- Documented gaps: completeness, best practices, common mistakes
- Wrote comprehensive reference skill
- Added quick reference tables
- Included best practices and workflows
- Created common mistakes section
- Verified skill improves completeness
- No loopholes found (reference skill)
- Ready for deployment
See the /docs directory for detailed testing documentation.
Check authentication and project setup before running commands.
Always dry-run queries first to estimate costs before execution.
Comprehensive command reference with all flags and options.
Strategies for optimizing large tables (billions of rows).
Complete guidance for CSV, JSON, Avro, Parquet with compression.
Red flags table to catch issues before they happen.
Cost optimization, performance tuning, schema design tips.
Cost savings:
- Dry-run prevents accidental multi-thousand dollar queries
- Partitioning reduces scan costs by 10-100x
- Clustering adds 20-40% additional savings
Performance:
- Partitioning + clustering: queries 10-100x faster
- Proper schema: faster loads and queries
- Columnar formats: 5-10x faster loads than CSV
Contributions are welcome! If you find issues or have suggestions:
- Test the change following TDD principles
- Document any new baseline failures
- Update the skill to address them
- Verify improvements
- Submit a pull request
MIT License - feel free to use and modify as needed.
Created by Jakob He
- Initial release
- Complete command reference
- Cost estimation workflow
- Partitioning and clustering guide
- Best practices and common mistakes
- Schema format documentation
- Real-world examples