In [1]:
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# BigQuery DataFrames ML: Drug Name Generation

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/googleapis/python-bigquery-dataframes/blob/main/notebooks/generative_ai/bq_dataframes_ml_drug_name_generation.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/generative_ai/bq_dataframes_ml_drug_name_generation.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/googleapis/python-bigquery-dataframes/blob/main/notebooks/generative_ai/bq_dataframes_ml_drug_name_generation.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
  <td>
    <a href="https://pantheon.corp.google.com/bigquery?project=bigframes-dev&ws=!1m7!1m6!12m5!1m3!1sbigframes-dev!2sus-central1!3s4da57cb0-e53d-4bcb-bbe4-d0ad3982648e!2e2">
      <img src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTW1gvOovVlbZAIZylUtf5Iu8-693qS1w5NJw&s" alt="BQ logo" width="35">
      Open in BQ Studio
    </a>
  </td> 
</table>

**_NOTE_**: This notebook has been tested in the following environment:

* Python version = 3.9

## Overview

The goal of this notebook is to demonstrate an enterprise generative AI use case. A marketing user can provide information about a new pharmaceutical drug and its generic name, and receive ideas on marketing-oriented brand names for that drug.

Learn more about [BigQuery DataFrames](https://cloud.google.com/bigquery/docs/dataframes-quickstart).

### Objective

In this tutorial, you learn about Generative AI concepts such as prompting and few-shot learning, as well as how to use BigFrames ML for performing these tasks simply using an intuitive dataframe API.

The steps performed include:

1. Ask the user for the generic name and usage for the drug.
1. Use `bigframes` to query the FDA dataset of over 100,000 drugs, filtered on the brand name, generic name, and indications & usage columns.
1. Filter this dataset to find prototypical brand names that can be used as examples in prompt tuning.
1. Create a prompt with the user input, general instructions, examples and counter-examples for the desired brand name.
1. Use the `bigframes.ml.llm.GeminiTextGenerator` to generate choices of brand names.

### Dataset

This notebook uses the [FDA dataset](https://cloud.google.com/blog/topics/healthcare-life-sciences/fda-mystudies-comes-to-google-cloud) available at [`bigquery-public-data.fda_drug`](https://console.cloud.google.com/bigquery?ws=!1m4!1m3!3m2!1sbigquery-public-data!2sfda_drug).

### Costs

This tutorial uses billable components of Google Cloud:

* BigQuery (compute)
* BigQuery ML

Learn about [BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models),
and [BigQuery ML pricing](https://cloud.google.com/bigquery/pricing#bqml),
and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook.

In [2]:
# !pip install -U --quiet bigframes

### Colab only: Uncomment the following cell to restart the kernel.

In [3]:
# # Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

### Import libraries

In [4]:
import bigframes.pandas as bpd
from bigframes.ml.llm import GeminiTextGenerator
from IPython.display import Markdown

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [5]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [6]:
# from google.colab import auth

# auth.authenticate_user()

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the BigQuery API](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com).

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [7]:
# Please fill in these values.
PROJECT_ID = ""  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

[1;31mERROR:[0m (gcloud.config.set) argument VALUE: Must be specified.
Usage: gcloud config set SECTION/PROPERTY VALUE [optional flags]
  optional flags may be  --help | --installation

For detailed information on this command and its flags, run:
  gcloud config set --help


#### BigFrames configuration

Next, we will specify a [BigQuery connection](https://cloud.google.com/bigquery/docs/working-with-connections). If you already have a connection, you can simplify provide the name and skip the following creation steps.



In [8]:
# Please fill in these values.
LOCATION = "us"  # @param {type:"string"}

We will now try to use the provided connection, and if it doesn't exist, create a new one. We will also print the service account used.

### Initialize BigFrames client

Here, we set the project configuration based on the provided parameters.

In [9]:
# Note: The project option is not required in all environments.
# On BigQuery Studio, the project ID is automatically detected.
bpd.options.bigquery.project = PROJECT_ID

# Note: The location option is not required.
# It defaults to the location of the first table or query
# passed to read_gbq(). For APIs where a location can't be
# auto-detected, the location defaults to the "US" location.
bpd.options.bigquery.location = LOCATION

## Generate a name

Let's start with entering a generic name and description of the drug.

In [10]:
GENERIC_NAME = "Entropofloxacin"  # @param {type:"string"}
USAGE = "Entropofloxacin is a fluoroquinolone antibiotic that is used to treat a variety of bacterial infections, including: pneumonia, streptococcus infections, salmonella infections, escherichia coli infections, and pseudomonas aeruginosa infections It is taken by mouth or by injection. The dosage and frequency of administration will vary depending on the type of infection being treated. It should be taken for the full course of treatment, even if symptoms improve after a few days. Stopping the medication early may increase the risk of the infection coming back."  # @param {type:"string"}
NUM_NAMES = 10  # @param {type:"integer"}
TEMPERATURE = 0.5  # @param {type: "number"}

We can now create a prompt string, and populate it with the name and description.

In [11]:
zero_shot_prompt = f"""Provide {NUM_NAMES} unique and modern brand names in Markdown bullet point format. Do not provide any additional explanation.

Be creative with the brand names. Don't use English words directly; use variants or invented words.

The generic name is: {GENERIC_NAME}

The indications and usage are: {USAGE}."""

print(zero_shot_prompt)

Provide 10 unique and modern brand names in Markdown bullet point format. Do not provide any additional explanation.

Be creative with the brand names. Don't use English words directly; use variants or invented words.

The generic name is: Entropofloxacin

The indications and usage are: Entropofloxacin is a fluoroquinolone antibiotic that is used to treat a variety of bacterial infections, including: pneumonia, streptococcus infections, salmonella infections, escherichia coli infections, and pseudomonas aeruginosa infections It is taken by mouth or by injection. The dosage and frequency of administration will vary depending on the type of infection being treated. It should be taken for the full course of treatment, even if symptoms improve after a few days. Stopping the medication early may increase the risk of the infection coming back..


Next, let's create a helper function to predict with our model. It will take a string input, and add it to a temporary BigFrames DataFrame. It will also return the string extracted from the response DataFrame.

In [12]:
def predict(prompt: str, temperature: float = TEMPERATURE) -> str:
    # Create dataframe
    input = bpd.DataFrame(
        {
            "prompt": [prompt],
        }
    )

    # Return response
    return model.predict(input, temperature=temperature).ml_generate_text_llm_result.iloc[0]

We can now initialize the model, and get a response to our prompt!

In [22]:
# Define the model
model = GeminiTextGenerator()

# Invoke LLM with prompt
response = predict(zero_shot_prompt, temperature = TEMPERATURE)

# Print results as Markdown
Markdown(response)

- Etherealox
- Zenithrox
- Aureox
- Lucentrox
- Aethrox
- Luminex
- Elysirox
- Quasarox
- Novaflux
- Arcanox

We're off to a great start! Let's see if we can refine our response.

## Few-shot learning

Let's try using [few-shot learning](https://paperswithcode.com/task/few-shot-learning). We will provide a few examples of what we're looking for along with our prompt.

Our prompt will consist of 3 parts:
* General instructions (e.g. generate $n$ brand names)
* Multiple examples
* Information about the drug we'd like to generate a name for

Let's walk through how to construct this prompt.

Our first step will be to define how many examples we want to provide in the prompt.

In [23]:
# Specify number of examples to include

NUM_EXAMPLES = 3  # @param {type:"integer"}

Next, let's define a prefix that will set the overall context.

In [24]:
prefix_prompt = f"""Provide {NUM_NAMES} unique and modern brand names in Markdown bullet point format, related to the drug at the bottom of this prompt.

Be creative with the brand names. Don't use English words directly; use variants or invented words.

First, we will provide {NUM_EXAMPLES} examples to help with your thought process.

Then, we will provide the generic name and usage for the drug we'd like you to generate brand names for.
"""

print(prefix_prompt)

Provide 10 unique and modern brand names in Markdown bullet point format, related to the drug at the bottom of this prompt.

Be creative with the brand names. Don't use English words directly; use variants or invented words.

First, we will provide 3 examples to help with your thought process.

Then, we will provide the generic name and usage for the drug we'd like you to generate brand names for.



Our next step will be to include examples into the prompt.

We will start out by retrieving the raw data for the examples, by querying the BigQuery public dataset.

In [25]:
# Query 3 columns of interest from drug label dataset
df = bpd.read_gbq("bigquery-public-data.fda_drug.drug_label",
                  columns=["openfda_generic_name", "openfda_brand_name", "indications_and_usage"])

# Exclude any rows with missing data
df = df.dropna()

# Drop duplicate rows
df = df.drop_duplicates()

# Print values
df.head()

Unnamed: 0,openfda_generic_name,openfda_brand_name,indications_and_usage
0,BENZALKONIUM CHLORIDE,meijer kids,Use - hand washing to decrease bacteria on skin
3,"OCTINOXATE, TITANIUM DIOXIDE",CD DIORSKIN STAR Studio Makeup Spectacular Bri...,Uses Helps prevent sunburn. If used as directe...
4,TRIAMCINOLONE ACETONIDE,Triamcinolone Acetonide,INDICATIONS AND USAGE Triamcinolone Acetonide ...
5,"BACITRACIN ZINC, NEOMYCIN SULFATE, POLYMYXIN B...",Triple Antibiotic,First aid to help prevent infection in minor c...
6,RISPERIDONE,Risperidone,1. INDICATIONS AND USAGE Risperidone is an aty...


Let's now filter the results to remove atypical names.

In [26]:
# Remove names with spaces
df = df[df["openfda_brand_name"].str.find(" ") == -1]

# Remove names with 5 or fewer characters
df = df[df["openfda_brand_name"].str.len() > 5]

# Remove names where the generic and brand name match (case-insensitive)
df = df[df["openfda_generic_name"].str.lower() != df["openfda_brand_name"].str.lower()]

Let's take `NUM_EXAMPLES` samples to include in the prompt.

In [27]:
# Take a sample and convert to a Pandas dataframe for local usage.
df_examples = df.sample(NUM_EXAMPLES, random_state=3).to_pandas()

df_examples

Unnamed: 0,openfda_generic_name,openfda_brand_name,indications_and_usage
81748,AMPICILLIN SODIUM,Ampicillin,INDICATIONS AND USAGE Ampicillin for Injection...
730,AZTREONAM,Cayston,1 INDICATIONS AND USAGE CAYSTON® is indicated ...
71763,TERAZOSIN HYDROCHLORIDE,Terazosin,INDICATIONS AND USAGE Terazosin capsules are i...


Let's now convert the data to a JSON structure, to enable embedding into a prompt. For consistency, we'll capitalize each example brand name.

In [28]:
examples = [
    {
        "brand_name": brand_name.capitalize(),
        "generic_name": generic_name,
        "usage": usage,
    }
    for brand_name, generic_name, usage in zip(
        df_examples["openfda_brand_name"],
        df_examples["openfda_generic_name"],
        df_examples["indications_and_usage"],
    )
]

print(examples)

[{'brand_name': 'Ampicillin', 'generic_name': 'AMPICILLIN SODIUM', 'usage': 'INDICATIONS AND USAGE Ampicillin for Injection, USP is indicated in the treatment of infections caused by susceptible strains of the designated organisms in the following conditions: Respiratory Tract Infections caused by Streptococcus pneumoniae. Staphylococcus aureus (penicillinase and nonpenicillinase-producing), H. influenzae, and Group A beta-hemolytic streptococci. Bacterial Meningitis caused by E. coli, Group B streptococci, and other Gram-negative bacteria (Listeria monocytogenes, N. meningitidis). The addition of an aminoglycoside with ampicillin may increase its effectiveness against Gram-negative bacteria. Septicemia and Endocarditis caused by susceptible Gram-positive organisms including Streptococcus spp., penicillin G-susceptible staphylococci, and enterococci. Gram-negative sepsis caused by E. coli, Proteus mirabilis and Salmonella spp. responds to ampicillin. Endocarditis due to enterococcal st

We'll create a prompt template for each example, and view the first one.

In [29]:
example_prompt = ""
for example in examples:
    example_prompt += f"Generic name: {example['generic_name']}\nUsage: {example['usage']}\nBrand name: {example['brand_name']}\n\n"

example_prompt

'Generic name: AMPICILLIN SODIUM\nUsage: INDICATIONS AND USAGE Ampicillin for Injection, USP is indicated in the treatment of infections caused by susceptible strains of the designated organisms in the following conditions: Respiratory Tract Infections caused by Streptococcus pneumoniae. Staphylococcus aureus (penicillinase and nonpenicillinase-producing), H. influenzae, and Group A beta-hemolytic streptococci. Bacterial Meningitis caused by E. coli, Group B streptococci, and other Gram-negative bacteria (Listeria monocytogenes, N. meningitidis). The addition of an aminoglycoside with ampicillin may increase its effectiveness against Gram-negative bacteria. Septicemia and Endocarditis caused by susceptible Gram-positive organisms including Streptococcus spp., penicillin G-susceptible staphylococci, and enterococci. Gram-negative sepsis caused by E. coli, Proteus mirabilis and Salmonella spp. responds to ampicillin. Endocarditis due to enterococcal strains usually respond to intravenous

Finally, we can create a suffix to our prompt. This will contain the generic name of the drug, its usage, ending with a request for brand names.

In [30]:
suffix_prompt = f"""Generic name: {GENERIC_NAME}
Usage: {USAGE}
Brand names:"""

print(suffix_prompt)

Generic name: Entropofloxacin
Usage: Entropofloxacin is a fluoroquinolone antibiotic that is used to treat a variety of bacterial infections, including: pneumonia, streptococcus infections, salmonella infections, escherichia coli infections, and pseudomonas aeruginosa infections It is taken by mouth or by injection. The dosage and frequency of administration will vary depending on the type of infection being treated. It should be taken for the full course of treatment, even if symptoms improve after a few days. Stopping the medication early may increase the risk of the infection coming back.
Brand names:


Let's pull it altogether into a few shot prompt.

In [31]:
# Define the prompt
few_shot_prompt = prefix_prompt + example_prompt + suffix_prompt

# Print the prompt
print(few_shot_prompt)

Provide 10 unique and modern brand names in Markdown bullet point format, related to the drug at the bottom of this prompt.

Be creative with the brand names. Don't use English words directly; use variants or invented words.

First, we will provide 3 examples to help with your thought process.

Then, we will provide the generic name and usage for the drug we'd like you to generate brand names for.
Generic name: AMPICILLIN SODIUM
Usage: INDICATIONS AND USAGE Ampicillin for Injection, USP is indicated in the treatment of infections caused by susceptible strains of the designated organisms in the following conditions: Respiratory Tract Infections caused by Streptococcus pneumoniae. Staphylococcus aureus (penicillinase and nonpenicillinase-producing), H. influenzae, and Group A beta-hemolytic streptococci. Bacterial Meningitis caused by E. coli, Group B streptococci, and other Gram-negative bacteria (Listeria monocytogenes, N. meningitidis). The addition of an aminoglycoside with ampicilli

Now, let's pass our prompt to the LLM, and get a response!

In [42]:
response = predict(few_shot_prompt)

Markdown(response)

- **Aerion:** (Derived from "aer" meaning air)
- **Aquazone:** (Combining "aqua" for water and "zone" for area)
- **Biosphere:** (Inspired by the concept of a self-contained ecosystem)
- **Celestial:** (Evoking the vastness and healing power of the universe)
- **Ethereal:** (Conveying a sense of lightness and transcendence)
- **Luminary:** (From "lumen" meaning light, symbolizing hope and healing)
- **Quasar:** (Inspired by the powerful and distant cosmic objects)
- **Sanctuary:** (Creating a sense of safety and refuge)
- **Zenith:** (Reaching the highest point or peak)
- **Zephyr:** (Named after the gentle west wind, representing a calming and soothing effect)

# Bulk generation

Let's take these experiments to the next level by generating many names in bulk. We'll see how to leverage BigFrames at scale!

We can start by finding drugs that are missing brand names. There are approximately 4,000 drugs that meet this criteria. We'll put a limit of 100 in this notebook.

In [43]:
# Query 3 columns of interest from drug label dataset
df_missing = bpd.read_gbq("bigquery-public-data.fda_drug.drug_label",
                          columns=["openfda_generic_name", "openfda_brand_name", "indications_and_usage"])

# Exclude any rows with missing data
df_missing = df_missing.dropna()

# Include rows in which openfda_brand_name equals openfda_generic_name
df_missing = df_missing[df_missing["openfda_generic_name"] == df_missing["openfda_brand_name"]]

# Limit the number of rows for demonstration purposes
df_missing = df_missing.head(100)

# Print values
df_missing.head()

Unnamed: 0,openfda_generic_name,openfda_brand_name,indications_and_usage
89,MEPHITIS MEPHITICA,MEPHITIS MEPHITICA,INDICATIONS Condition listed above or as direc...
105,ONDANSETRON,ONDANSETRON,"1 INDICATIONS AND USAGE Ondansetron Injection,..."
124,CLOFARABINE,CLOFARABINE,1 INDICATIONS AND USAGE Clofarabine injection ...
273,ACETAMINOPHEN AND DIPHENHYDRAMINE HYDROCHLORIDE,ACETAMINOPHEN AND DIPHENHYDRAMINE HYDROCHLORIDE,Uses Temporary relief of occasional headaches ...
284,OFLOXACIN,OFLOXACIN,INDICATIONS AND USAGE To reduce the developmen...


We will create a column `prompt` with a customized prompt for each row.

In [44]:
df_missing["prompt"] = (
    "Provide a unique and modern brand name related to this pharmaceutical drug."
    + "Don't use English words directly; use variants or invented words. The generic name is: "
    + df_missing["openfda_generic_name"]
    + ". The indications and usage are: "
    + df_missing["indications_and_usage"]
    + "."
)

We'll create a new helper method, `batch_predict()` and query the LLM. The job may take a couple minutes to execute.

In [46]:
def batch_predict(
    input: bpd.DataFrame, temperature: float = TEMPERATURE
) -> bpd.DataFrame:
    return model.predict(input, temperature=temperature).ml_generate_text_llm_result


response = batch_predict(df_missing["prompt"])

Let's check the results for one of our responses!

In [50]:
# Pick a sample
k = 0

# Gather the prompt and response details
prompt_generic = df_missing["openfda_generic_name"].iloc[k]
prompt_usage = df_missing["indications_and_usage"].iloc[k]
response_str = response.iloc[k]

# Print details
print(f"Generic name: {prompt_generic}")
print(f"Brand name: {prompt_usage}")
print(f"Response: {response_str}")

Generic name: MEPHITIS MEPHITICA
Brand name: INDICATIONS Condition listed above or as directed by the physician
Response: **Ephemeral** (Latin root: "ephemerus," meaning "lasting for a day")

**Aetheria** (Greek root: "aither," meaning "upper air, sky")

**Zenithar** (Combination of "zenith" and "pharma")

**Celestian** (Latin root: "celestial," meaning "heavenly")

**Astralux** (Combination of "astral" and "lux," meaning "light")


Congratulations! You have learned how to use generative AI to jumpstart the creative process.

You've also seen how BigFrames can manage each step of the process, including gathering data, data manipulation, and querying the LLM.