In [1]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# A Guide to Array and Struct Data Types in BigQuery DataFrames

# Set up your environment

Please refer to the notebooks in the `getting_started` folder for instructions on setting up your environment. Once your environment is ready, run the following code to import the necessary packages for working with BigFrames arrays:

In [17]:
import bigframes.pandas as bpd
import bigframes.bigquery as bbq
import pyarrow as pa

In [18]:
REGION = "US"  # @param {type: "string"}
bpd.options.display.progress_bar = None
bpd.options.bigquery.location = REGION


# Array Data Types

In BigQuery, an [array](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#array_type), also referred to as a `repeated` column, is an ordered list of zero or more non-array elements. These elements must be of the same data type, and arrays cannot contain other arrays. Furthermore, query results cannot include arrays with `NULL` elements.

BigFrames DataFrames, inheriting these properties, map BigQuery array types to `pandas.ArrowDtype(pa.list_())`. This section provides code examples demonstrating how to effectively work with array columns within BigFrames DataFrames.

## Create DataFrames with array columns 

Let's create a sample BigFrames DataFrame where the `Scores` column holds array data of type `list<int64>[pyarrow]`:

In [4]:
df = bpd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Scores': [[95, 88, 92], [78, 81], [82, 89, 94, 100]],
})
df

Unnamed: 0,Name,Scores
0,Alice,[95 88 92]
1,Bob,[78 81]
2,Charlie,[ 82 89 94 100]


In [5]:
df.dtypes

Name                 string[pyarrow]
Scores    list<item: int64>[pyarrow]
dtype: object

## CRUD operations for array data

While Pandas offers vectorized operations and lambda expressions to manipulate array data, BigFrames leverages BigQuery's computational power. BigFrames introduces the [`bigframes.bigquery`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery) package to provide access to a variety of native BigQuery array operations, such as [array_agg](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_agg), [array_length](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.bigquery#bigframes_bigquery_array_length), and others. This module allows you to seamlessly perform create, read, update, and delete (CRUD) operations on array data within your BigFrames DataFrames.

Let's delve into how you can utilize these functions to effectively manipulate array data in BigFrames.

In [6]:
# Find the length in each array
bbq.array_length(df['Scores'])

0    3
1    2
2    4
Name: Scores, dtype: Int64

In [7]:
# Explode array elements into rows
scores = df['Scores'].explode()
scores

0     95
0     88
0     92
1     78
1     81
2     82
2     89
2     94
2    100
Name: Scores, dtype: Int64

In [8]:
# Adjuste the scores
adj_scores = (scores + 5) / 105.0 * 100.0
adj_scores

0    95.238095
0    88.571429
0    92.380952
1    79.047619
1    81.904762
2    82.857143
2     89.52381
2    94.285714
2        100.0
Name: Scores, dtype: Float64

In [9]:
# Aggregate adjusted scores back into arrays
adj_scores_arr = bbq.array_agg(adj_scores.groupby(level=0))
adj_scores_arr

0                [95.23809524 88.57142857 92.38095238]
1                            [79.04761905 81.9047619 ]
2    [ 82.85714286  89.52380952  94.28571429 100.  ...
Name: Scores, dtype: list<item: double>[pyarrow]

In [10]:
# Incorporate adjusted scores into the DataFrame
df['NewScores'] = adj_scores_arr
df

Unnamed: 0,Name,Scores,NewScores
0,Alice,[95 88 92],[95.23809524 88.57142857 92.38095238]
1,Bob,[78 81],[79.04761905 81.9047619 ]
2,Charlie,[ 82 89 94 100],[ 82.85714286 89.52380952 94.28571429 100. ...


# Struct Data Types

In BigQuery, an [struct](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#struct_type) (also known as a `record`) is a collection of ordered fields, each with a defined data type (required) and an optional field name. BigFrames maps BigQuery struct types to the Pandas equivalent, `pandas.ArrowDtype(pa.struct())`. In this section, we'll explore practical code examples illustrating how to work with struct columns within your BigFrames DataFrames.

## Create DataFrames with struct columns 

Let's create a sample BigFrames DataFrame where the `Address` column holds struct data of type `struct<City: string, State: string>[pyarrow]`:

In [11]:
names = bpd.Series(['Alice', 'Bob', 'Charlie'])
address = bpd.Series(
    [
        {'City': 'New York', 'State': 'NY'},
        {'City': 'San Francisco', 'State': 'CA'},
        {'City': 'Seattle', 'State': 'WA'}
    ],
    dtype=bpd.ArrowDtype(pa.struct(
         [('City', pa.string()), ('State', pa.string())]
    )))

df = bpd.DataFrame({'Name': names, 'Address': address})
df



Unnamed: 0,Name,Address
0,Alice,"{'City': 'New York', 'State': 'NY'}"
1,Bob,"{'City': 'San Francisco', 'State': 'CA'}"
2,Charlie,"{'City': 'Seattle', 'State': 'WA'}"


In [12]:
df.dtypes

Name                                    string[pyarrow]
Address    struct<City: string, State: string>[pyarrow]
dtype: object

## CRUD operations for struct data

Similar to Pandas, BigFrames provides a [`StructAccessor`](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.operations.structs.StructAccessor) to streamline the manipulation of struct data. Let's explore how you can utilize this feature for efficient CRUD operations on your nested struct columns.

In [13]:
# Return the dtype object of each child field of the struct.
df['Address'].struct.dtypes()

City     string[pyarrow]
State    string[pyarrow]
dtype: object

In [14]:
# Extract a child field as a Series
city = df['Address'].struct.field("City")
city

0         New York
1    San Francisco
2          Seattle
Name: City, dtype: string

In [15]:
# Extract all child fields of a struct as a DataFrame.
address_df = df['Address'].struct.explode()
address_df

Unnamed: 0,City,State
0,New York,NY
1,San Francisco,CA
2,Seattle,WA
