# Query bucket logs using BigQuery

This notebook walks you through how to use bigquery to analyze our usage logging using bigquery.

We are using bigquery for several reasons:
1. **Saves us time:** Uploading logs from a bucket to bigquery is _fast_, waaaay faster than trying to `gcloud storage rsync` files to a VM or to your laptop.
2. **Existing tooling:** Google has some existing tooling and data analysis products built around bigquery that we can use if we want to

NOTE: If you have not yet logged in to Google Cloud using the `gcloud` command, be sure to run the following before running the other cells:

```bash
gcloud auth login --project=sdss-natcap-gef-ckan
```

In [None]:
import os
import json

from datetime import date
from google.cloud import bigquery
from google.cloud.exceptions import NotFound

client = bigquery.Client()
tablename = f"sdss-natcap-gef-ckan.data_cache_logs.natcap-data-cache-analysis-{date.today().strftime("%Y-%m-%d")}"
print(f"Considering the table {tablename}")
_create_table = False
try:
    table = client.get_table(tablename)
except (NotFound, ValueError):
    with open('cloud_storage_usage_schema_v0.json') as schema_file:
        schema = json.load(schema_file)
    table = bigquery.Table(tablename, schema=schema)
    table = client.create_table(table)  # this actually makes the API request
    print(f"Created table {tablename}")

if table.num_rows > 0:
    print("Table already has some stuff in it.")
    print(f"If this is incorrect, please delete the table {tablename} and re-run this cell.")
else:
    print("Loading table from GCS")
    job_config = bigquery.LoadJobConfig(
        schema=schema,
        skip_leading_rows=1,
        source_format=bigquery.SourceFormat.CSV
    )

    client.load_table_from_uri(
        "gs://natcap-data-access-logs/natcap-data-cache/natcap-data-cache_usage_*",
        tablename, job_config=job_config
    )
    print("Finished loading from usage CSVs, reloading table")
    table = bigquery.Table(tablename, schema=schema)
    print("Table finished loading")

In [None]:
# A convenience function to query bigquery and convert the output to a dataframe.

def _query_to_dataframe(query):
    return client.query_and_wait(query).to_dataframe()


In [None]:
# Top 25 downloads by data volume
query = f"""
SELECT cs_object, SUM(sc_bytes)/1e9 as gigabytes, count(cs_object) as http_requests
  FROM `sdss-natcap-gef-ckan.{tablename}`
  where sc_status >= 200 and sc_status < 300 
  group by cs_object 
  order by gigabytes desc 
  LIMIT 25 
"""
_query_to_dataframe(query)

In [None]:
# Total egress for all data in the table
query = f"""
SELECT SUM(sc_bytes)/1e12 as terabytes
  FROM `sdss-natcap-gef-ckan.{tablename}`
  where sc_status >= 200 and sc_status < 300
"""
_query_to_dataframe(query)

# After you're done

Once you've finished working with BigQuery data, be sure to delete the table you created!
Bigquery doesn't have a way to easily delete rows, and tables are generally append-only, so deleting the table is the only way.

Example (but change the table name to your table name):

```shell
bq rm data_cache_logs.natcap-data-cache-analysis-2025-06-04
```
