In [1]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Get started with BigQuery DataFrames

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/getting_started_bq_dataframes.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/getting_started_bq_dataframes.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/getting_started_bq_dataframes.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/bigquery/import?url=https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/getting_started/getting_started_bq_dataframes.ipynb">
      <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTW1gvOovVlbZAIZylUtf5Iu8-693qS1w5NJw&s" alt="BQ logo" width="35">
      Open in BQ Studio
    </a>
  </td>
</table>

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.10

## Overview

Use this notebook to get started with BigQuery DataFrames, including setup, installation, and basic tutorials.

BigQuery DataFrames provides a Pythonic DataFrame and machine learning (ML) API powered by the BigQuery engine.

* `bigframes.pandas` provides a pandas-like API for analytics.
* `bigframes.ml` provides a scikit-learn-like API for ML.

Learn more about [BigQuery DataFrames](https://cloud.google.com/python/docs/reference/bigframes/latest).

### Objective

In this tutorial, you learn how to install BigQuery DataFrames, load data into a BigQuery DataFrames DataFrame, and inspect and manipulate the data using pandas and a custom Python function, running at BigQuery scale.

The steps include:

- Creating a BigQuery DataFrames DataFrame: Access data from a local CSV to create a BigQuery DataFrames DataFrame.
- Inspecting and manipulating data: Use pandas to perform data cleaning and preparation on the DataFrame.
- Deploying a custom function: Deploy a [remote function ](https://cloud.google.com/bigquery/docs/remote-functions)that runs a scalar Python function at BigQuery scale.

### Dataset

This tutorial uses the [```penguins``` table](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=ml_datasets&t=penguins) (a BigQuery public dataset), which contains data on a set of penguins including species, island of residence, weight, culmen length and depth, flipper length, and sex.

The same dataset is also stored in a public Cloud Storage bucket as a CSV file so that you can use it to try ingesting data from a local environment.

### Costs

This tutorial uses billable components of Google Cloud:

* BigQuery (storage and compute)
* Cloud Functions

Learn about [BigQuery storage pricing](https://cloud.google.com/bigquery/pricing#storage),
[BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models),
and [Cloud Functions pricing](https://cloud.google.com/functions/pricing),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages, which are required to run this notebook:

In [2]:
!pip install bigframes



### Colab only

Uncomment and run the following cell to restart the kernel:

In [3]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

Complete the tasks in this section to set up your environment.

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Click here](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com,bigqueryconnection.googleapis.com,cloudfunctions.googleapis.com,run.googleapis.com,artifactregistry.googleapis.com,cloudbuild.googleapis.com,cloudresourcemanager.googleapis.com) to enable the following APIs:

  * BigQuery API
  * BigQuery Connection API
  * Cloud Functions API
  * Cloud Run API
  * Artifact Registry API
  * Cloud Build API
  * Cloud Resource Manager API

4. If you are running this notebook locally, install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

If you don't know your project ID, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113).

In [4]:
PROJECT_ID = ""  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

Updated property [core/project].


To take a quick anonymous survey, run:
  $ gcloud survey



#### Set the region

You can also change the `REGION` variable used by BigQuery. Learn more about [BigQuery regions](https://cloud.google.com/bigquery/docs/locations#supported_locations).

In [5]:
REGION = "US"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you might have to manually authenticate. Follow the relevant instructions below.

**Vertex AI Workbench**

Do nothing, you are already authenticated.

**Local JupyterLab instance**

Uncomment and run the following cell:

In [6]:
# ! gcloud auth login

**Colab**

Uncomment and run the following cell:

In [7]:
# from google.colab import auth
# auth.authenticate_user()

### Import libraries

In [8]:
import bigframes.pandas as bpd


### Set BigQuery DataFrames options

In [9]:
# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bpd.options.bigquery.project = PROJECT_ID

# Note: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bpd.options.bigquery.location = REGION

If you want to reset the location of the created DataFrame or Series objects, reset the session by executing `bpd.close_session()`. After that, you can reuse `bpd.options.bigquery.location` to specify another location.

## See the power of BigQuery DataFrames first-hand

BigQuery DataFrames enables you to interact with datasets of any size, so that you can explore, transform, and understand even your biggest datasets using familiar tools like pandas and scikit-learn.

For example, take the BigQuery sample table `bigquery-samples.wikipedia_pageviews.200809h`, which is ~60 GB is size. This is not a dataset you'd likely be able process in pandas without extra infrastructure.

With BigQuery DataFrames, however, computation is handled by BigQuery's highly scalable compute engine, meaning you can focus on doing data science without hitting size limitations.

If you'd like to try creating a BigQuery DataFrames DataFrame from this table, uncomment and run the next cell to load the table using the `read_gbq` method.

> Note: Keep in mind that running these operations will count against your monthly [free tier allowance in BigQuery](https://cloud.google.com/bigquery/pricing#free-tier).

In [10]:
# bq_df_sample = bpd.read_gbq("bigquery-samples.wikipedia_pageviews.200809h")

No problem! BigQuery DataFrames makes a DataFrame, `bq_df_sample`, containing the entirety of the source table of data.

Uncomment and run the following cell to see pandas in action over your new BigQuery DataFrames DataFrame.

This code uses regex to filter the DataFrame to include only rows with Wikipedia page titles containing the word "Google", sums the total views by page title, and then returns the top 100 results.

In [11]:
# bq_df_sample[bq_df_sample.title.str.contains(r"[Gg]oogle")]\
# .groupby(['title'], as_index=False)['views'].sum(numeric_only=True)\
# .sort_values('views', ascending=False)\
# .head(100)

In addition to giving you access to pandas, BigQuery DataFrames also enables you to build ML models, run inference, and deploy and run your own Python functions at scale. You'll see examples throughout this and other notebooks in this GitHub repo.

Now you'll move to the smaller `penguins` dataset for the remainder of this getting started guide.

## Create a BigQuery DataFrames DataFrame

You can create a BigQuery DataFrames DataFrame by reading data from any of the following locations:

* A local data file
* Data stored in a BigQuery table
* A data file stored in Cloud Storage
* An in-memory pandas DataFrame

The following sections show how to use the first two options.

### Create a DataFrame from a local file

Use the instructions in the following sections to create a BigQuery DataFrames DataFrame from a local file.


#### Get the CSV file

First, copy and paste the following link into a new browser window to download the CSV file of the penguin data to your local machine:

> http://storage.googleapis.com/cloud-samples-data/vertex-ai/bigframe/penguins.csv

Next, upload the local CSV file to your notebook environment, using the relevant instructions for your environment:

**Vertex AI Workbench or a local JupyterLab instance**

1. Follow these [directions](https://jupyterlab.readthedocs.io/en/latest/user/files.html#uploading-and-downloading) to upload the file from your machine to your notebook environment by using the UI.
2. Uncomment the next cell, set the variable `fn` to match the path to your file, and then run the cell.

In [12]:
# BigQuery DataFrames can read directly from GCS.
fn = 'gs://cloud-samples-data/vertex-ai/bigframe/penguins.csv'

# Or from a local file.
# fn = 'penguins.csv'

**Colab**

Uncomment and run the following cell:

In [13]:
# from google.colab import files
# uploaded = files.upload()
# for fn in uploaded.keys():
#  print('User uploaded file "{name}" with length {length} bytes'.format(
#      name=fn, length=len(uploaded[fn])))

#### Create a DataFrame

Create a BigQuery DataFrames DataFrame from the uploaded CSV file:

In [14]:
# If order is not important, use the "bigquery" engine to
# allow BigQuery DataFrames to read directly from GCS.
df_from_local = bpd.read_csv(fn, engine="bigquery")

Take a look at the first few rows of the DataFrame:

In [15]:
df_from_local.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Gentoo penguin (Pygoscelis papua),Biscoe,50.5,15.9,225,5400,MALE
1,Gentoo penguin (Pygoscelis papua),Biscoe,45.1,14.5,215,5000,FEMALE
2,Adelie Penguin (Pygoscelis adeliae),Torgersen,41.4,18.5,202,3875,MALE
3,Adelie Penguin (Pygoscelis adeliae),Torgersen,38.6,17.0,188,2900,FEMALE
4,Gentoo penguin (Pygoscelis papua),Biscoe,46.5,14.8,217,5200,FEMALE


### Ingest data from a DataFrame to a BigQuery table

BigQuery DataFrames lets you create a BigQuery table from a BigQuery DataFrames DataFrame on-the-fly.

First, create a BigQuery dataset to house the table. Choose a name for your dataset, or keep the suggestion of `birds`.

In [16]:
DATASET_ID = "birds"  # @param {type:"string"}

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_ID)
dataset = bigquery.Dataset(PROJECT_ID + "." + DATASET_ID)
dataset.location = REGION
dataset = client.create_dataset(dataset, exists_ok=True)
print(f"Dataset {dataset.dataset_id} created.")

Dataset birds created.


Next, use the `to_gbq` method to create a BigQuery table from the DataFrame:

In [17]:
df_from_local.to_gbq(
    PROJECT_ID + "." + DATASET_ID + ".penguins",
    if_exists="replace",
)

'swast-scratch.birds.penguins'

### Create a DataFrame from BigQuery data
You can create a BigQuery DataFrames DataFrame from a BigQuery table by using the `read_gbq` method and referencing either an entire table or a SQL query.

Create a BigQuery DataFrames DataFrame from the BigQuery table you created in the previous section, and view a few rows:

In [18]:
query_or_table = f"""{PROJECT_ID}.{DATASET_ID}.penguins"""
bq_df = bpd.read_gbq(query_or_table)
bq_df.head()

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Gentoo penguin (Pygoscelis papua),Biscoe,50.5,15.9,225,5400,MALE
1,Gentoo penguin (Pygoscelis papua),Biscoe,45.1,14.5,215,5000,FEMALE
2,Adelie Penguin (Pygoscelis adeliae),Torgersen,41.4,18.5,202,3875,MALE
3,Adelie Penguin (Pygoscelis adeliae),Torgersen,38.6,17.0,188,2900,FEMALE
4,Gentoo penguin (Pygoscelis papua),Biscoe,46.5,14.8,217,5200,FEMALE


## Inspect and manipulate data in BigQuery DataFrames

### Using pandas

You can use pandas as you normally would on the BigQuery DataFrames DataFrame, but calculations happen in the BigQuery query engine instead of your local environment. There are 150+ pandas functions supported in BigQuery DataFrames. You can view the list in [the documentation](https://cloud.google.com/python/docs/reference/bigframes/latest).

To see this in action, inspect one of the columns (or series) of the BigQuery DataFrames DataFrame:

In [19]:
bq_df["body_mass_g"].head(10)

0    5400
1    5000
2    3875
3    2900
4    5200
5    3725
6    2975
7    4150
8    5300
9    4150
Name: body_mass_g, dtype: Int64

Compute the mean of this series:

In [20]:
average_body_mass = bq_df["body_mass_g"].mean()
print(f"average_body_mass: {average_body_mass}")

average_body_mass: 4201.754385964917


Calculate the mean `body_mass_g` by `species` using the `groupby` operation:

In [21]:
bq_df[["species", "body_mass_g"]].groupby(by=bq_df["species"]).mean(numeric_only=True).head()

Unnamed: 0_level_0,body_mass_g
species,Unnamed: 1_level_1
Adelie Penguin (Pygoscelis adeliae),3700.662252
Chinstrap penguin (Pygoscelis antarctica),3733.088235
Gentoo penguin (Pygoscelis papua),5076.01626


You can confirm that the calculations were run in BigQuery by clicking "Open job" from the previous cells' output. This takes you to the BigQuery console to view the SQL statement and job details.

### Using custom functions

Running your own Python functions (or being able to bring your packages) and using them at scale is a challenge many data scientists face. BigQuery DataFrames makes it easy to deploy [remote functions](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.pandas#bigframes_pandas_remote_function) that run scalar Python functions at BigQuery scale. These functions are persisted as [BigQuery remote functions](https://cloud.google.com/bigquery/docs/remote-functions) that you can then re-use.

Running the cell below creates a custom function using the `remote_function` method. This function categorizes a value into one of two buckets: >= 4000 or <4000.

> Note: Creating a function requires a [BigQuery connection](https://cloud.google.com/bigquery/docs/remote-functions#create_a_remote_function). This code assumes a pre-created connection named `bigframes-default-connection`. If
the connection is not already created, BigQuery DataFrames attempts to create one assuming the [necessary APIs
and IAM permissions](https://cloud.google.com/python/docs/reference/bigframes/latest/bigframes.pandas#bigframes_pandas_remote_function) are set up in the project.

This cell takes a few minutes to run because it creates the BigQuery connection (if applicable) and deploys the Cloud Function.

In [22]:
@bpd.remote_function([float], str)
def get_bucket(num):
  if not num: return "NA"
  boundary = 4000
  return "at_or_above_4000" if num >= boundary else "below_4000"

The custom function is deployed as a Cloud Function, and is then integrated with BigQuery as a remote function.

Save both of the function names so that you can clean them up at the end of this notebook.

In [23]:
CLOUD_FUNCTION_NAME = format(get_bucket.bigframes_cloud_function)
print("Cloud Function Name " + CLOUD_FUNCTION_NAME)
REMOTE_FUNCTION_NAME = format(get_bucket.bigframes_remote_function)
print("Remote Function Name " + REMOTE_FUNCTION_NAME)

Cloud Function Name projects/swast-scratch/locations/us-central1/functions/bigframes-71a76285da23f28be467ed16826f7276
Remote Function Name swast-scratch._63cfa399614a54153cc386c27d6c0c6fdb249f9e.bigframes_71a76285da23f28be467ed16826f7276


Apply the custom function to the BigQuery DataFrames DataFrame to bucketize the `body_mass_g` value of the penguins:

In [24]:
bq_df = bq_df.assign(body_mass_bucket=bq_df['body_mass_g'].apply(get_bucket))
bq_df[['body_mass_g', 'body_mass_bucket']].head(10)

Unnamed: 0,body_mass_g,body_mass_bucket
0,5400,at_or_above_4000
1,5000,at_or_above_4000
2,3875,below_4000
3,2900,below_4000
4,5200,at_or_above_4000
5,3725,below_4000
6,2975,below_4000
7,4150,at_or_above_4000
8,5300,at_or_above_4000
9,4150,at_or_above_4000


## Summary and next steps

You've created BigQuery DataFrames DataFrames, and inspected and manipulated data with pandas and custom remote functions at BigQuery scale and speed.

Learn more about BigQuery DataFrames in the [documentation](https://cloud.google.com/python/docs/reference/bigframes/latest) and find more sample notebooks in the [GitHub repo](https://github.com/googleapis/python-bigquery-dataframes/tree/main/notebooks), including an introductory notebook for `bigframes.ml`.

### Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can uncomment the remaining cells and run them to delete the individual resources you created in this tutorial:

In [None]:
# Delete the temporary cloud artifacts created during the bigframes session 
bpd.close_session()

In [25]:
# # Delete the BigQuery dataset
# from google.cloud import bigquery
# client = bigquery.Client(project=PROJECT_ID)
# client.delete_dataset(
#  DATASET_ID, delete_contents=True, not_found_ok=True
# )
# print("Deleted dataset '{}'.".format(DATASET_ID))

In [26]:
# # Delete the BigQuery Connection
# from google.cloud import bigquery_connection_v1 as bq_connection
# client = bq_connection.ConnectionServiceClient()
# CONNECTION_ID = f"projects/{PROJECT_ID}/locations/{REGION}/connections/bigframes-default-connection"
# client.delete_connection(name=CONNECTION_ID)
# print("Deleted connection '{}'.".format(CONNECTION_ID))

In [27]:
# # Delete the Cloud Function
# ! gcloud functions delete {CLOUD_FUNCTION_NAME} --quiet

In [28]:
# # Delete the Remote Function
# REMOTE_FUNCTION_NAME = REMOTE_FUNCTION_NAME.replace(PROJECT_ID + ".", "")
# ! bq rm --routine --force=true {REMOTE_FUNCTION_NAME}