<a href="https://colab.research.google.com/github/leohpark/Files/blob/main/Gemini_Document_Search_and_Semantic_Retrieval_with_Attributed_Question_Answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Document search with embeddings
(Google's original notebook that this is based on)

<table class="tfo-notebook-buttons" align="left">
      <td>
    <a target="_blank" href="https://ai.google.dev/examples/doc_search_emb"><img src="https://ai.google.dev/static/site-assets/images/docs/notebook-site-button.png" height="32" width="32" />View on Generative AI</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/examples/doc_search_emb.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/google/generative-ai-docs/blob/main/site/en/examples/doc_search_emb.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

## Overview

This example demonstrates how to use the Gemini API to create embeddings so that you can perform document search. You will use the Python client library to build a word embedding that allows you to compare search strings, or questions, to document contents.

In this tutorial, you'll use embeddings to perform document search over a set of documents to ask questions related to the Google Car.

## Prerequisites

You can run this quickstart in Google Colab.

To complete this quickstart on your own development environment, ensure that your envirmonement meets the following requirements:

-  Python 3.9+
-  An installation of `jupyter` to run the notebook.

## Setup

First, download and install the Gemini API Python library.

## Install Packages and imports

Leo: I've added langchain, tiktoken, and some pdf processing dependencies so that this notebook can process any PDF, from a file or URL.

In [None]:
!pip install -U -q google.generativeai

In [None]:
import textwrap
import numpy as np
import pandas as pd

import google.generativeai as genai
import google.ai.generativelanguage as glm

# Used to securely store your API key
from google.colab import userdata

from IPython.display import Markdown
import os

In [None]:
!pip install -U -q langchain tiktoken unstructured==0.11.2 pdf2image pdfminer.six pikepdf pypdf unstructured_pytesseract unstructured_inference

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.7 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.7 MB[0m [31m1.8 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/1.7 MB[0m [31m10.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from langchain.document_loaders import UnstructuredPDFLoader, OnlinePDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import tiktoken
import pdfminer
import pikepdf
import pypdf
import unstructured_pytesseract
import unstructured_inference
from unstructured.partition.pdf import partition_pdf

Over the weekend while writing this, I encountered a problem where the package manager was installing a version of Unstructured 0.11.4 that did not contain the `unstructured.partitions.pdf.py` file. I later verified that the version of unstructured available from PyPl is missing this particular file from the Dec 14 release. As a result, installing from source is necessary to get the full package with PDF support. I'm sure this will get fixed soon. Alternately, you can downgrade to the previous release version of unstructured, provided you know that that is 0.11.2. My packages above reflect this downgrade. The latest unstructured can be installed from git using the code below.

In [None]:
pip install git+https://github.com/Unstructured-IO/unstructured.git

## Get your Google AI Studio API Key

Before you can use the Gemini API, you must first obtain an API key. If you don't already have one, create a key with one click in Google AI Studio.

<a class="button button-primary" href="https://makersuite.google.com/app/apikey" target="_blank" rel="noopener noreferrer">Get an API key</a>

In Colab, add the key to the secrets manager under the "🔑" in the left panel. Give it the name `API_KEY`.

Once you have the API key, pass it to the SDK. You can do this in two ways:

* Put the key in the `GOOGLE_API_KEY` environment variable (the SDK will automatically pick it up from there).
* Pass the key to `genai.configure(api_key=...)`

In [None]:
from google.colab import userdata
API_KEY = userdata.get('API_KEY')

genai.configure(api_key=API_KEY)

## Embedding generation

In this section, you will see how to generate embeddings for a piece of text using the embeddings from the Gemini API.


### API changes to Embeddings with model embedding-001

For the new embeddings model, embedding-001, there is a new task type parameter and the optional title (only valid with task_type=`RETRIEVAL_DOCUMENT`).

These new parameters apply only to the newest embeddings models.The task types are:

Task Type | Description
---       | ---
RETRIEVAL_QUERY	| Specifies the given text is a query in a search/retrieval setting.
RETRIEVAL_DOCUMENT | Specifies the given text is a document in a search/retrieval setting.
SEMANTIC_SIMILARITY	| Specifies the given text will be used for Semantic Textual Similarity (STS).
CLASSIFICATION	| Specifies that the embeddings will be used for classification.
CLUSTERING	| Specifies that the embeddings will be used for clustering.

This is just a sample embedding created from a portion of the recent Warhol SCOTUS decision. Run this cell to make sure that your API key is working and correctly being passed from the notebook userdata. Note the `title` parameter. This is a new optional parameter that is only used with the `task_type` is `retrieval_document`. It's not yet clear whether this is incorporated into the embedding, or how. Since we are only retrieving from a knowledge base containing a single document, it likely doesn't matter much here.

In [None]:
title = "ANDY WARHOL FOUNDATION FOR VISUAL ARTS, INC. v. GOLDSMITH"
sample_text = ("""In 2016, petitioner Andy Warhol Foundation for the Visual Arts, Inc.
(AWF) licensed to Condé Nast for $10,000 an image of “Orange
Prince”—an orange silkscreen portrait of the musician Prince created
by pop artist Andy Warhol—to appear on the cover of a magazine com-
memorating Prince. Orange Prince is one of 16 works now known as
the Prince Series that Warhol derived from a copyrighted photograph
taken in 1981 by respondent Lynn Goldsmith, a professional photog-
rapher.""")

model = 'models/embedding-001'
embedding = genai.embed_content(model=model,
                                content=sample_text,
                                task_type="retrieval_document",
                                title=title)

print(embedding)

{'embedding': [-0.004976866, -0.04624615, -0.047661625, -0.04056859, 0.10324767, -0.020868372, 0.023740215, 0.018633293, 0.06901656, 0.0022520986, 0.047237497, 0.05694859, -0.00045273575, -0.0042082793, 0.0348532, -0.06980623, 0.027102608, 0.045155425, -0.012675372, -0.09266169, -0.01417843, 0.03684274, 0.02121998, -0.019039992, -0.028201852, -0.014396963, 0.026175383, -0.08828838, -0.011369659, 0.048632525, -0.042363264, 0.02496338, -0.029759683, 0.061684623, 0.02667723, -0.06739606, -0.017340023, 0.037661646, -0.018567342, 0.020785995, -0.01977513, -0.01946769, -0.013868994, -0.052813265, 0.029303534, 0.016044145, -0.008170931, 0.0526105, -0.010348585, -0.044506326, 0.0566009, -0.05120147, 0.04071312, -0.05216053, -0.01364342, -0.01434613, 0.037146643, -0.040675074, -0.06251755, 0.0070602167, -0.028874312, 0.040811654, -0.024771893, -0.009799687, -0.01386678, -0.029561162, -0.023884322, 0.036670666, 0.003180297, 0.008733685, -0.04203412, -0.030888088, 0.05821382, -0.044809062, -0.007

## OAuth Shenanigans

In order to send a call to the GenerativeServicesClient for a semantic retrieval call, you need an OAuth token set up. I don't really know what this means, but I used my Google Cloud Platform account to create one using the instructions in this Colab notebook. https://colab.research.google.com/github/google/generative-ai-docs/blob/main/site/en/docs/semantic_retriever.ipynb#scrollTo=P719DMtK8t-p

I'm not super familiar with this, but I used GCP to create the OAuth Service Account, then from that Service Account, generate a Key, which is downloaded as a JSON file, then uploaded to the Colab notebook and used authenticate the Semantic Retriever request.

We will require this OAuth setup to use the semantic retrieval and complete the AQA "Attributed Question Answering" API call. If you don't want to mess with that model, you can skip these steps.

In [None]:
!pip install -U google-auth-oauthlib

In [None]:
# Rename the uploaded file to `service_account_secret.json` OR
# Change the variable `service_account_file_name` in the code below.
service_account_file_name = '/content/gen-lang-client-0086316029-a94fc89b076e.json'

from google.oauth2 import service_account

credentials = service_account.Credentials.from_service_account_file(service_account_file_name)

scoped_credentials = credentials.with_scopes(
    ['https://www.googleapis.com/auth/cloud-platform', 'https://www.googleapis.com/auth/generative-language.retriever'])

In [None]:
generative_service_client = glm.GenerativeServiceClient(credentials=scoped_credentials)
retriever_service_client = glm.RetrieverServiceClient(credentials=scoped_credentials)
permission_service_client = glm.PermissionServiceClient(credentials=scoped_credentials)

## Helper functions for processing PDFs
The following functions are used to load and partition PDFs. There's separate loaders depending on whether you are using a local file, or a URL to a PDF online.

In [None]:
#utility functions. yes, this is OpenAI's tokenizer. No, Google doesn't seem to provide one.
def tiktoken_len(text, base='cl100k_base'):
  tokenizer = tiktoken.get_encoding(base)
  tokens = tokenizer.encode(
      text,
      disallowed_special=()
  )
  return len(tokens)

def doc_loader(file_path):
    loader = UnstructuredPDFLoader(file_path)
    loader_doc = loader.load()
    doc_content = loader_doc[0].page_content[:]
    doc_tokens = tiktoken_len(doc_content)
    doc_name = doc_name = file_path.split('/')[-1]  # Get the name of the document

    return loader_doc, doc_content, doc_tokens, doc_name

def online_pdf_loader(url):
    loader = OnlinePDFLoader(url)
    loader_doc = loader.load()
    doc_content = loader_doc[0].page_content[:]
    doc_tokens = tiktoken_len(doc_content)
    doc_name = url.split('/')[-1]
    return loader_doc, doc_content, doc_tokens, doc_name


def text_splitter(doc, max_tokens=600, overlap_tokens=50):
  text_splitter = RecursiveCharacterTextSplitter(
      chunk_size = int(max_tokens), #chunk_s, # number of units per chunk
      chunk_overlap = int(overlap_tokens), # number of units of overlap
      length_function = tiktoken_len, #use tokens as chunking unit instead of characters.
      separators=['\n\n', '\n', ' '] # our chosen operators for separating
      )
  texts = text_splitter.split_text(doc)
  return texts

def chunks_from_file(pdf, max_tokens, overlap_tokens):
  loader_doc, doc_content, doc_tokens, doc_name = doc_loader(pdf)
  chunks = text_splitter(doc_content, max_tokens, overlap_tokens)
  return chunks, doc_name

def chunks_from_url(url, max_tokens, overlap_tokens):
  loader_doc, doc_content, doc_tokens, doc_name = online_pdf_loader(url)
  chunks = text_splitter(doc_content, max_tokens, overlap_tokens)
  return chunks, doc_name

## Add user-defined info here.

❗️ If you are uploading a PDF file, then update the path string for `my_pdf`. Click on the Folder icon on the left-hand edge of the screen, and drag your PDF
into the side-panel. The, right-click on the name of the doc and choose "copy path" to get the location of your doc. Paste the result below.

❗️ If you are using a URL to a PDF, then paste your URL in the quotes after `my_url` and the OnlinePDFLoader will download the PDF and process it for chunking.

You can configure chunk size here for the embeddings as well. I am recommending 600 token chunks because the current beta state of Google Semantic Retriever recommends token lengths around 300. I think that's too short for legal RAG, so I'm trying to be reasonable by limiting our chunks to 2x that value.

However, if you are not using Semantic Retriever and/or want to test something specific, you can modify your chunk parameters here.

In [None]:
# Enter arguments for your Document embedding
# This notebook is written assuming you will either upload a file OR provide a URL, but not both.

my_pdf = "/content/Caniglia v. Strom 2021 Opinion.pdf"
my_url = "https://www.supremecourt.gov/opinions/22pdf/21-869_87ad.pdf"

max_tokens = 600
overlap_tokens = 50

In [None]:
# @title If you uploaded a file, run this cell
chunks, doc_name = chunks_from_file(my_pdf, max_tokens, overlap_tokens)

In [None]:
# @title If you are using a pdf URL, run this cell
chunks, doc_name = chunks_from_url(my_url, max_tokens, overlap_tokens)

## Building an embeddings database

Let's take a look at a few of these Chunks. You will use the Gemini API to create embeddings of each of the documents. Turn them into a dataframe for better visualization.

Also, the chunking operation also extracts a value for doc_name, which is just the name of the document from the file or URL. Since this value is used as the `title` parameter for the embeddings-001 model, we might want to take a moment to add a more descriptive title now.

If the name of your file wasn't terribly descriptive, add a more descriptive one below. Again, the documentation doesn't indicate what this title does, so not sure what to expect. If it is used as metadata, or if some portion of the embedding space is reserved for document title.

In [None]:
# Look at a chunk or two
chunks[0:1]

['(Slip Opinion)\n\nOCTOBER TERM, 2022\n\nSyllabus\n\nNOTE: Where it is feasible, a syllabus (headnote) will be released, as is being done in connection with this case, at the time the opinion is issued. The syllabus constitutes no part of the opinion of the Court but has been prepared by the Reporter of Decisions for the convenience of the reader. See United States v. Detroit Timber & Lumber Co., 200 U. S. 321, 337.\n\nSUPREME COURT OF THE UNITED STATES\n\nSyllabus\n\nANDY WARHOL FOUNDATION FOR THE VISUAL ARTS, INC. v. GOLDSMITH ET AL.\n\nCERTIORARI TO THE UNITED STATES COURT OF APPEALS FOR THE SECOND CIRCUIT\n\nNo. 21–869. Argued October 12, 2022—Decided May 18, 2023']

In [None]:
# Provide a better document title for embeddings, if you want.
doc_name = "(better title)"

Organize the contents of the dictionary into a dataframe for better visualization.

## These functions build our local "vector database"
The first function is the Google AI sample function for getting an embedding vector using the Gemini family "embeddings-001" model. I'm being a bit redundant here in requiring a `model` argument, given that there's only one embedding model currently available on Google AI Studio.

Below, we are creating a dataframe consisting of our original chunks, then adding a column for embedding vectors, and aqa_ids to use later. Our code is processing the dataframe row-by-row, meaning the embeddings are being processed one at a time. If your PDF was extremely long, and you have many hundreds, or thousands, of chunks then this step may take a few minutes. Ordinarily, you can batch embedding requests to send 30-100 at once, subject to the rate limits of the provider.

In [None]:
# Get the embeddings of each text and add to an embeddings column in the dataframe
def get_embedding(chunk, doc_name, model="models/embedding-001"):
  # title field is optional.
  return genai.embed_content(model=model,
                             content=chunk,
                             task_type="retrieval_document",
                             title=doc_name)["embedding"]

def embed_to_dataframe(chunks, doc_name, model):
    # Create the dataframe
    df = pd.DataFrame(chunks, columns=['Docs'])

    # Apply the embed_fn function to each row in the 'Docs' column
    df['Embeddings'] = df['Docs'].apply(lambda x: get_embedding(x, doc_name, model))

    # Add aqa_id column
    df['aqa_id'] = df.index.map(lambda x: '{0:03d}'.format(x))

    return df

In [None]:
#this step creates the embeddings
model = 'models/embedding-001'
df = embed_to_dataframe(chunks, doc_name, model)

Now we can take a peek at our completed Knowledge base, which is a pandas dataframe with a "Docs" column for text chunks, "Embeddings" from the Gemini `embeddings-001`, which appear to be 768 dimensional vector blobs of numbers, and then `aqa_id` values which we will use for inline grounding passages for our Semantic Retrieval with Attributed Question Answering.

In [None]:
df

Unnamed: 0,Docs,Embeddings,aqa_id
0,"(Slip Opinion)\n\nOCTOBER TERM, 2022\n\nSyllab...","[0.026983079, -0.02162463, -0.03576854, -0.029...",000
1,CERTIORARI TO THE UNITED STATES COURT OF APPEA...,"[-0.004097693, -0.041530993, -0.051666588, -0....",001
2,1\n\n2\n\nANDY WARHOL FOUNDATION FOR VISUAL AR...,"[-0.00965148, -0.081810914, -0.06743688, -0.04...",002
3,(a) AWF contends that the Prince Series works ...,"[-0.025916703, -0.034734536, -0.053024877, -0....",003
4,Cite as: 598 U. S. ____ (2023)\n\nSyllabus\n\n...,"[-0.002050875, -0.051198956, -0.052557763, -0....",004
...,...,...,...
81,Consider as one example the reclining nude. Pr...,"[-0.025912184, 0.0056529893, -0.062337235, -0....",081
82,"Take a look at one last example, from a modern...","[0.008020435, -0.044162907, -0.07065879, -0.01...",082
83,33\n\n34 ANDY WARHOL FOUNDATION FOR VISUAL ART...,"[-0.0037841713, -0.04654758, -0.07466252, -0.0...",083
84,"Cite as: 598 U. S. ____ (2023)\n\nKAGAN, J., d...","[-0.017232843, -0.030867845, -0.061245237, -0....",084


Use the `retrieve docs` function to calculate the dot products, and then sort the dataframe from the largest to smallest dot product value to retrieve the relevant passages out of the database.

I modified this function to accept a top_k parameter, then, at the bottom of each returned Doc text, I'm appending a "Source Ref" string consisting of the Doc name and the dataframe index, mostly as an experiment to see how well the LLM utilizes in-context source data (spoiler: in general it doesn't work that well). Remember that this is being injected into the returned doc text, and not part of the vectors themselves.

In [None]:
#modified function to take a top_k parameter, perform a sort function on the dataframe, and return the top_k results as a list
#modified to include a "Source" attribution based on the dataframe indices pre-argsort
def retrieve_docs(query, dataframe, top_k):
    """
    Compute the dot product similarities between the query and each document in the dataframe,
    and return the top_k best matches with original DataFrame indices.
    """
    # Generate the embedding for the query
    query_embedding = genai.embed_content(model='models/embedding-001',
                                          content=query,
                                          task_type="retrieval_query")

    # Compute dot products
    dot_products = np.dot(np.stack(dataframe['Embeddings']), query_embedding["embedding"])

    # Get original indices and sort by dot product values
    sorted_indices = np.argsort(dot_products)[::-1]
    original_indices = dataframe.index[sorted_indices]

    return original_indices[:top_k]

# Format and return the corresponding texts with original indices
def docs_with_sources(query, dataframe, top_k=3):
    rag_docs = retrieve_docs(query, dataframe, top_k)
    top_k_docs = []
    for idx in rag_docs:
        document = dataframe.loc[idx, 'Docs']
        source_ref = f"\n\n(Source: {doc_name} #{idx})"
        top_k_docs.append(document + source_ref)

    return top_k_docs

In [None]:
#Now let's ask a question for our retrieval function
query = "How does transformative use factor into the court's fair use analysis?"

In [None]:
#Try a few different questions and see how the retrieved docs change.
rag_docs = docs_with_sources(query, df, 3)
for doc in rag_docs:
  print(doc, "\n\n")

## Question and Answering Application

Let's try to use the text generation API to create a Q & A system. Input your own custom data below to create a simple question and answering example. You will still use the dot product as a metric of similarity.

I'm not in love with this prompt, but this is the suggested format from the Google AI folks. I've adjusted it slightly for tone suitable for a legal audience, but modify this as you see fit, and see how it changes Gemini's answer style.

This function formats and composes the retrieved rag_docs into a prompt suitable for Gemini Pro. This giant string blob that ends with a "\n\nDelimiter" seems awfully familiar. Now what other model uses this format? Oh, right, Anthropic Claude. *Very interesting* to see that they've configured the Google AI studio endpoint to behave this way...

#🤔

In [None]:
def make_prompt(query, rag_docs):
    """
    Create a prompt using the query and a list of relevant passages.
    """
    # Escape and format each passage, then join them into a single string
    formatted_passages = " ".join([passage.replace("'", "").replace('"', "").replace("\n", " ") for passage in rag_docs])

    prompt = textwrap.dedent("""\
    You are a helpful and informative bot that answers questions using text from the reference passages included below. \
    Provide a detailed report, being comprehensive, including all relevant background information. \
    Your audience is legal professionals and analysts who need in-depth, details legal analysis and answers \
    So provide competent, accurate answers with "Source" attribution whenever possible \
    If some information in the passages are irrelevant to the answer, you may ignore them. \
    Begin your answer by restating the question, then providing your answer after. \
    QUESTION: '{query}'
    PASSAGES: '{formatted_passages}'

    ANSWER:
    """).format(query=query, formatted_passages=formatted_passages)

    return prompt

In [None]:
#take a look at the prompt results here
prompt = make_prompt(query, rag_docs)
prompt

'You are a helpful and informative bot that answers questions using text from the reference passages included below.     Provide a detailed report, being comprehensive, including all relevant background information.     Your audience is legal professionals and analysts who need in-depth, details legal analysis and answers     So provide competent, accurate answers with "Source" attribution whenever possible     If some information in the passages are irrelevant to the answer, you may ignore them.     Begin your answer by restating the question, then providing your answer after.     QUESTION: \'How does transformative use factor into the court\'s fair use analysis?\'\nPASSAGES: \'15  16 ANDY WARHOL FOUNDATION FOR VISUAL ARTS, INC.  v. GOLDSMITH Opinion of the Court  in the sense that copying is socially useful ex post. Many secondary works add something new. That alone does not render such uses fair. Rather, the first factor (which is just one factor in a larger analysis) asks “whether 

## Let's do some class assignments to set up the LLM calls

For more information regarding Google's LLM parameters, look at their documentation here: https://ai.google.dev/api/python/google/generativeai/GenerationConfig

The first cell is just showing us the Gemini Pro information from the SDK. Next, we are defining `config` and `model` to set the values we will use for our API call.

The line including `answer = model.generate_content()` is sending off the API request to Gemini Pro.

In [None]:
#first, let's just take a look at the specs and default config values for Gemini Pro
my_model = genai.get_model('models/gemini-pro')
print(my_model)

Model(name='models/gemini-pro',
      base_model_id='',
      version='001',
      display_name='Gemini Pro',
      description='The best model for scaling across a wide range of tasks',
      input_token_limit=30720,
      output_token_limit=2048,
      supported_generation_methods=['generateContent', 'countTokens'],
      temperature=0.9,
      top_p=1.0,
      top_k=1)


In [None]:
config = genai.GenerationConfig(
    candidate_count = 1,
    stop_sequences = None,
    max_output_tokens = 1200,
    temperature = 0,
    top_p = 1.0,
    top_k = 1
    )
model = genai.GenerativeModel(
    model_name = 'gemini-pro',
    generation_config = config)

In [None]:
answer = model.generate_content(prompt)
Markdown(answer.text)

How does transformative use factor into the court's fair use analysis?

The court's fair use analysis considers whether the use of a copyrighted work is transformative, meaning it adds something new, with a further purpose or different character, altering the copyrighted work with new expression, meaning, or message. The more transformative the use, the more likely it is to be considered fair.

The court has held that transformative uses are favored because they stimulate creativity and fulfill the objective of copyright law. In Campbell v. Acuff-Rose Music, Inc., the court found that a rap song that used a sample of a copyrighted song was a transformative use because it added new expression and meaning to the original work. In Google LLC v. Oracle America, Inc., the court found that Google's use of Sun Microsystems' Java API in its Android operating system was a transformative use because it created a new and innovative software platform.

The court's transformative use analysis is a flexible one that takes into account the specific facts of each case. However, the court has made it clear that transformative uses are favored and that they can play an important role in promoting creativity and innovation.

Source: Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994); Google LLC v. Oracle America, Inc., 593 U.S. ___ (2021).

Let's wrap the RAG retrieval, Prompt formatting, and LLM Call into a single function so that it's a bit easier to work with. I've also just exploded the entire config parameter set according to the documentation in case you want to tweak a value, or change which arguments the function takes.

In [None]:
def gemini_rag_answer(query, top_k, model='gemini-pro', temperature=0, max_tokens=1200):
    rag_docs = docs_with_sources(query, df, top_k)
    prompt = make_prompt(query,rag_docs)
    config = genai.GenerationConfig(
        candidate_count = 1,
        stop_sequences = None,
        max_output_tokens = max_tokens,
        temperature = temperature,
        top_p = 1.0,
        top_k = 1
        )
    model = genai.GenerativeModel(
        model_name = model,
        generation_config = config)

    return model.generate_content(prompt)

Now, test out the rag pipeline with a few different questions/answers. I've configured the function to take a query argument and top_k. The remainder will autofill with reasonable defaults. The function also returns the entire answer object in case you want to poke at it further.

Try modifying top_k to see when answers get better, or worse. The function above doesn't have any guardrails around context length, so even though Gemini Pro has a ~30,000 input token limit (about 50 docs if you used my recommended chunk size of 600), at some point, the calls will error out due to context length limits. You will probably see that answer quality stops improving long before then.

In [None]:
my_question = "From the analysis of Campbell v. Acuff Music, how did the opinion distinguish parody from satire?"

In [None]:
the_answer = gemini_rag_answer(my_question, 2)
Markdown(the_answer.text)

How did the opinion in Campbell v. Acuff Music distinguish parody from satire?

The Campbell v. Acuff Music opinion distinguished parody from satire by explaining that parody requires mimicking an original work to make its point and thus has some claim to use the creation of its victim's imagination. In contrast, satire can stand on its own and requires justification for borrowing. (Source: Campbell v. Acuff-Rose Music, Inc., 510 U.S. 569 (1994))

## We can go further! Semantic Retrieval and Attributed Question Answering!

Google released this with Gemini, too! Details are sparse on Semantic Retriever, and it doesn't appear that the API documentation is yet available (at the time of this writing). It appears to be a hybrid Embeddings+Retrieval platform that work together to produce "Attributed Question Answering".

My understanding is something like this (I might be wrong): You can set up several document stores consisting of 10,000 documents each, with 20 fields of custom metadata. They are presumably vectorized and indexed. Then, you can send a Query, and get grounding passages returned. The grounding passages are given to a specialized LLM API endpoint called `aqa`, which returns a special response object that contains:
- A generated answer
- The most relevant document
- An `answerable_probability` value corresponding to the estimated confidence that the passage actually answers the question.

I am not setting up a doc store here today, but Semantic Query and Attributed Question Answering can also be used with "inline passages", which are a sequence of docs being sent to the API for evaluation. The model will similarly produce an answer based on the most relevant doc, and return it, as well as the confidence score.

In [None]:
#reuses our Pandas dataframe to construct "inline grounding passages" for the AQA model
def get_inline_passages(query, dataframe, top_k=1):
    """
    Compute the dot product similarities between the query and each document in the dataframe,
    and return the top_k best matches with original DataFrame indices.
    """
    top_k_original_indices = retrieve_docs(query, dataframe, top_k)

    grounding_passages = glm.GroundingPassages()
    for idx in top_k_original_indices:
      passage_bit = glm.Content(parts=[glm.Part(text=dataframe.loc[idx, 'Docs'])])
      id_bit = dataframe.loc[idx, 'aqa_id']
      grounding_passages.passages.append(glm.GroundingPassage(content=passage_bit, id=id_bit))

    return grounding_passages

def aqa_model_call(query, inline_passages):
  query_content = glm.Content(parts=[glm.Part(text=query)])
  req = glm.GenerateAnswerRequest(model='models/aqa',
                                  contents=[query_content],
                                  inline_passages=inline_passages,
                                  #answer_styles to try are 'ABSTRACTIVE', 'VERBOSE', 'EXTRACTIVE'
                                  answer_style='VERBOSE')
  aqa_response = generative_service_client.generate_answer(req)
  return aqa_response

def aqa_query(query, dataframe, top_k=3):
  grounding_passages = get_inline_passages(query, dataframe, top_k)
  get_answer = aqa_model_call(query, grounding_passages)
  return get_answer

The AQA Model is included in the Gemini family, but it's a special purpose model that's not intended for text generation, or chat. More significantly for us is that it has an input token limit of around 7000, meaning that the maximum top_k value before you are exceeding the token limit should be around 11.

In [None]:
first_query = "Who is Andy Warhol and what is he famous for?"
second_query = "Why did the court discuss Campbell's Soup cans?"
third_query = "Did the court answer whether or not trademark fair use factored into the court's conclusion?"

In [None]:
first_aqa_answer = aqa_query(first_query, df, 6)
first_aqa_answer

answer {
  content {
    parts {
      text: "Andy Warhol was a famous American artist who is known for his work in pop art. He used silkscreens to create his paintings, which often depicted celebrities and everyday objects. Warhol was also a pioneer in the use of appropriation, which is the act of taking an existing image and using it as the basis for a new work of art. Some of Warhol\'s most famous works include the Campbell\'s Soup Cans series, the Marilyn Monroe series, and the Elvis Presley series."
    }
  }
  finish_reason: STOP
  grounding_attributions {
    content {
      parts {
        text: "3\n\n4\n\nANDY WARHOL FOUNDATION FOR VISUAL ARTS, INC. v. GOLDSMITH KAGAN, J., dissenting\n\nundermines creative freedom. I respectfully dissent.2\n\nI A\n\nAndy Warhol is the avatar of transformative copying. Cf. Google, 593 U. S., at ___\342\200\223___ (slip op., at 24\342\200\22325) (selecting Warhol, from the universe of creators, to illustrate what transformative copying is). In h

In [None]:
#parameters are (query, dataframe, top_k)
second_aqa_answer = aqa_query(second_query, df, 5)
second_aqa_answer

answer {
  content {
    parts {
      text: "The court discussed Campbell\342\200\231s Soup Cans to illustrate the difference between transformative and non-transformative uses. Campbell\342\200\231s use of the Campbell\342\200\231s Soup can was transformative because it used the image of the soup can to create a new work of art that commented on consumerism. In contrast, the Warhol Foundation\342\200\231s use of Goldsmith\342\200\231s photograph of Prince was not transformative because it did not create a new work of art. The Warhol Foundation simply copied Goldsmith\342\200\231s photograph and added some new colors and shapes."
    }
  }
  finish_reason: STOP
  grounding_attributions {
    content {
      parts {
        text: "The Court\342\200\231s decision in Campbell is instructive. In holding that par- ody may be fair use, the Court explained that \342\200\234parody has an obvious claim to transformative value\342\200\235 because \342\200\234it can provide social benefit, by sh

In [None]:
third_answer = aqa_query(third_query, df, 4)
third_answer

answer {
  content {
    parts {
      text: "**No, the court did not specifically address trademark fair use in its opinion.** However, the court did discuss the concept of fair use in general, and its decision in this case appears to be consistent with the principles of trademark fair use.\n\nUnder trademark law, fair use allows for the use of trademarks in certain circumstances, such as for parody, criticism, or news reporting. In order for a use to be considered fair, it must be transformative, meaning that it must add something new to the original work and not simply be a copy of it. The use must also be non-commercial or only minimally commercial.\n\nIn the case of Andy Warhol Foundation for the Visual Arts, Inc. v. Goldsmith, the court found that the Andy Warhol Foundation\'s use of Goldsmith\'s photograph was not fair use. The court reasoned that the Foundation\'s use of the photograph was commercial and that it did not add anything new to the original work."
    }
  }
  finish