In [1]:
# Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# A Guide to Array and Struct Data Types in BigQuery DataFrames

# Set up your environment

To get started, follow the instructions in the notebooks within the `getting_started` folder to set up your environment.  Once your environment is ready, you can import the necessary packages by running the following code:

In [2]:
import bigframes.pandas as bpd
import bigframes.bigquery as bbq
import pyarrow as pa

In [3]:
REGION = "US"  # @param {type: "string"}

bpd.options.display.progress_bar = None
bpd.options.bigquery.location = REGION

# Array Data Types

In BigQuery, an [array](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#array_type) (also called a repeated column) is an ordered list of zero or more elements of the same data type. Arrays cannot contain other arrays or `NULL` elements.

BigQuery DataFrames map BigQuery array types to `pandas.ArrowDtype(pa.list_())`. The following code examples illustrate how to work with array columns in BigQuery DataFrames.

## Create DataFrames with array columns

Create a DataFrame in BigQuery DataFrames from local sample data. Use a list of lists to create a column with the `list<int64>[pyarrow]` dtype, which corresponds to the `ARRAY<INT64>` type in BigQuery.

In [4]:
df = bpd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Scores': [[95, 88, 92], [78, 81], [82, 89, 94, 100]],
})
df

Unnamed: 0,Name,Scores
0,Alice,[95 88 92]
1,Bob,[78 81]
2,Charlie,[ 82 89 94 100]


In [5]:
df.dtypes

Name                 string[pyarrow]
Scores    list<item: int64>[pyarrow]
dtype: object

## Operate on array data

While pandas offers vectorized operations and lambda expressions for array manipulation, BigQuery DataFrames leverages the computational power of BigQuery itself. You can access a variety of native BigQuery array operations, such as [`array_agg`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_agg) and [`array_length`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_length), through the [`bigframes.bigquery`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery) package (abbreviated as `bbq` in the following code samples).

In [6]:
# Find the length in each array.
bbq.array_length(df['Scores'])

0    3
1    2
2    4
Name: Scores, dtype: Int64

In [7]:
# Find the length of each array with list accessor
df['Scores'].list.len()

0    3
1    2
2    4
Name: Scores, dtype: Int64

In [8]:
# Find the second element in each array with list accessor
df['Scores'].list[1]

0    88
1    81
2    89
Name: Scores, dtype: Int64

In [9]:
# Transforms array elements into individual rows, preserving original order when in ordering
# mode. If an array has multiple elements, exploded rows are ordered by the element's index
# within its original array.
scores = df['Scores'].explode()
scores

0     95
0     88
0     92
1     78
1     81
2     82
2     89
2     94
2    100
Name: Scores, dtype: Int64

In [10]:
# Adjust the scores.
adj_scores = scores + 5.0
adj_scores

0    100.0
0     93.0
0     97.0
1     83.0
1     86.0
2     87.0
2     94.0
2     99.0
2    105.0
Name: Scores, dtype: Float64

In [11]:
# Aggregate adjusted scores back into arrays.
adj_scores_arr = bbq.array_agg(adj_scores.groupby(level=0))
adj_scores_arr

0         [100.  93.  97.]
1                [83. 86.]
2    [ 87.  94.  99. 105.]
Name: Scores, dtype: list<item: double>[pyarrow]

In [12]:
# Add adjusted scores into the DataFrame. This operation requires an implicit join 
# between the two tables, necessitating a unique index in the DataFrame (guaranteed 
# in the default ordering and index mode).
df['NewScores'] = adj_scores_arr
df

Unnamed: 0,Name,Scores,NewScores
0,Alice,[95 88 92],[100. 93. 97.]
1,Bob,[78 81],[83. 86.]
2,Charlie,[ 82 89 94 100],[ 87. 94. 99. 105.]


# Struct Data Types

In BigQuery, a [struct](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#struct_type) (also known as a `record`) is a collection of ordered fields, each with a defined data type (required) and an optional field name. BigQuery DataFrames maps BigQuery struct types to the pandas equivalent, `pandas.ArrowDtype(pa.struct())`. This section provides practical code examples illustrating how to use struct columns with BigQuery DataFrames.

## Create DataFrames with struct columns 

Create a DataFrame with an `Address` struct column by using dictionaries for the data and setting the dtype to `struct<City: string, State: string>[pyarrow]`.

In [13]:
names = bpd.Series(['Alice', 'Bob', 'Charlie'])
address = bpd.Series(
    [
        {'City': 'New York', 'State': 'NY'},
        {'City': 'San Francisco', 'State': 'CA'},
        {'City': 'Seattle', 'State': 'WA'}
    ],
    dtype=bpd.ArrowDtype(pa.struct(
         [('City', pa.string()), ('State', pa.string())]
    )))

df = bpd.DataFrame({'Name': names, 'Address': address})
df



Unnamed: 0,Name,Address
0,Alice,"{'City': 'New York', 'State': 'NY'}"
1,Bob,"{'City': 'San Francisco', 'State': 'CA'}"
2,Charlie,"{'City': 'Seattle', 'State': 'WA'}"


In [14]:
df.dtypes

Name                                    string[pyarrow]
Address    struct<City: string, State: string>[pyarrow]
dtype: object

## Operate on struct data

Similar to pandas, BigQuery DataFrames provides a [`StructAccessor`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.operations.structs.StructAccessor). Use the methods provided in this accessor to manipulate struct data.

In [15]:
# Return the dtype object of each child field of the struct.
df['Address'].struct.dtypes()

City     string[pyarrow]
State    string[pyarrow]
dtype: object

In [16]:
# Extract a child field as a Series
city = df['Address'].struct.field("City")
city

0         New York
1    San Francisco
2          Seattle
Name: City, dtype: string

In [17]:
# Extract all child fields of a struct as a DataFrame.
address_df = df['Address'].struct.explode()
address_df

Unnamed: 0,City,State
0,New York,NY
1,San Francisco,CA
2,Seattle,WA
