# Query bucket logs using BigQuery

This notebook walks you through how to use bigquery to analyze our usage logging using bigquery.

We are using bigquery for several reasons:
1. **Saves us time:** Uploading logs from a bucket to bigquery is _fast_, waaaay faster than trying to `gcloud storage rsync` files to a VM or to your laptop.
2. **Existing tooling:** Google has some existing tooling and data analysis products built around bigquery that we can use if we want to

In [None]:
%%bash
tablename=data_cache_logs.natcap-data-cache-analysis-$(date +'%Y-%m-%d')
rowcount=$(bq show --format=prettyjson "$tablename" | jq '.numRows|tonumber')
if [[ "$rowcount" != "0" ]]
then
    echo "This table already has $rowcount rows in it; not loading anything further."
    echo "If you would like to use this table, please delete it and re-run this cell."
else
    bq load --skip_leading_rows=1 "$tablename" \
          "gs://natcap-data-access-logs/natcap-data-cache/natcap-data-cache_usage_*" \
          ./cloud_storage_usage_schema_v0.json
fi

In [None]:
from google.cloud import bigquery

tablename = os.environ['tablename']
print(tablename)

def _query_to_dataframe(query):
    client = bigquery.Client()
    return client.query_and_wait(query).to_dataframe()


In [None]:
# Top 25 downloads by data volume
query = f"""
SELECT cs_object, SUM(sc_bytes)/1e9 as gigabytes, count(cs_object) as http_requests
  FROM `sdss-natcap-gef-ckan.{tablename}`
  where sc_status >= 200 and sc_status < 300 
  group by cs_object 
  order by gigabytes desc 
  LIMIT 25 
"""
_query_to_dataframe(query)

In [None]:
# Total egress for all data in the table
query = f"""
SELECT SUM(sc_bytes)/1e12 as terabytes
  FROM `sdss-natcap-gef-ckan.{tablename}`
  where sc_status >= 200 and sc_status < 300
"""
_query_to_dataframe(query)

# After you're done

Once you've finished working with BigQuery data, be sure to delete the table you created!
Bigquery doesn't have a way to easily delete rows, and tables are generally append-only, so deleting the table is the only way.

Example (but change the table name to your table name):

```shell
bq rm data_cache_logs.natcap-data-cache-analysis-2025-06-04
```
