# Image Metadata (EXIF) Extraction with BigFrames

This notebook demonstrates how to extract EXIF metadata from images stored in Google Cloud Storage using BigQuery DataFrames (BigFrames) user-defined functions (UDFs).

## Setup

Please provide your project ID and location. The notebook uses the default BigFrames connection and a sample dataset name by default.

In [1]:
import bigframes.pandas as bpd
import bigframes.bigquery as bbq

# @title Configuration
PROJECT_ID = "bigframes-dev" # @param {type:"string"}
LOCATION = "us" # @param {type:"string"}

# Dataset where the UDF will be created.
DATASET_ID = "bigframes_samples" # @param {type:"string"}

# A BigQuery connection is required for the UDF to access Google Cloud Storage.
# "bigframes-default-connection" is the default connection created by BigFrames.
CONNECTION_ID = "bigframes-default-connection" # @param {type:"string"}

# Construct the canonical connection ID
FULL_CONNECTION_ID = f"{PROJECT_ID}.{LOCATION}.{CONNECTION_ID}"

# Initialize BigFrames
bpd.options.bigquery.project = PROJECT_ID
bpd.options.bigquery.location = LOCATION

## Define the EXIF Extraction UDF

We will define a BigQuery remote UDF that takes a BigQuery `ObjectRef` runtime JSON string, downloads the image, and extracts EXIF data using the `Pillow` library.


In [2]:
@bpd.udf(
    input_types=[str],
    output_type=str,
    dataset=DATASET_ID,
    name="extract_exif",
    bigquery_connection=FULL_CONNECTION_ID,
    packages=["pillow", "requests"],
    max_batching_rows=8192,
    container_cpu=0.33,
    container_memory="512Mi"
)
def extract_exif(src_obj_ref_rt: str) -> str:
    import io
    import json
    from PIL import ExifTags, Image
    import requests
    from requests import adapters
    session = requests.Session()
    session.mount("https://", adapters.HTTPAdapter(max_retries=3))
    src_obj_ref_rt_json = json.loads(src_obj_ref_rt)
    src_url = src_obj_ref_rt_json["access_urls"]["read_url"]
    response = session.get(src_url, timeout=30)
    bts = response.content
    image = Image.open(io.BytesIO(bts))
    exif_data = image.getexif()
    exif_dict = {}
    if exif_data:
        for tag, value in exif_data.items():
            tag_name = ExifTags.TAGS.get(tag, tag)
            exif_dict[tag_name] = value
    return json.dumps(exif_dict)

  return global_session.with_default_session(


## Extract EXIF from Images

Now we can use this function on a BigFrames Series of image URIs.

In [None]:
# Create a Multimodal DataFrame from the sample image URIs
exif_image_df = bpd.from_glob_path(
    "gs://bigframes_blob_test/images_exif/*",
    name="blob_col",
)

# Generate a JSON string containing the runtime information (including signed read URLs)
# This allows the UDF to download the images from Google Cloud Storage
access_urls = exif_image_df["blob_col"].blob.get_runtime_json_str(mode="R")

# Apply the BigQuery Python UDF to the runtime JSON strings
# We cast to string to ensure the input matches the UDF's signature
exif_json = access_urls.astype(str).apply(extract_exif)

# Parse the resulting JSON strings back into a structured JSON type for easier access
actual = bbq.parse_json(exif_json)

actual

instead of using `db_dtypes` in the future when available in pandas
(https://github.com/pandas-dev/pandas/issues/60958) and pyarrow.
instead of using `db_dtypes` in the future when available in pandas
(https://github.com/pandas-dev/pandas/issues/60958) and pyarrow.
  return prop(*args, **kwargs)
change in future versions.
