# RAG with BigQuery 
## 参考にした記事
https://cloud.google.com/blog/ja/products/ai-machine-learning/rag-with-bigquery-and-langchain-in-cloud
## 参考にしたNotebook
https://github.com/GoogleCloudPlatform/generative-ai/blob/b5c2d85557d877bc99bf18fdf549423dc54bb108/gemini/use-cases/retrieval-augmented-generation/rag_qna_langchain_bigquery_vector_search.ipynb

# 準備

In [None]:
# Install LangChain and Google Cloud BigQuery
!pip install --upgrade --quiet tiktoken langchain langchain_google_vertexai google-cloud-bigquery pypdf langchain_community langchain_google_community 

# For testing part
!pip install --upgrade db-dtypes pandas


# Installing gcloud command if needed 
# !brew install google-cloud-sdk

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

In [1]:
!gcloud auth application-default login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login&state=sxNfwzQvadWys8stjetY6QNmcPGecK&access_type=offline&code_challenge=XylFauBTAC4DCm3_VEZ12fqmohRNSYfG5F-6OMfc6Bo&code_challenge_method=S256


Credentials saved to file: [/Users/kotaro/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).

Quota project "ml-session" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning the resource.


In [2]:
from google.cloud import bigquery
import pandas as pd

In [3]:
PROJECT_ID = "ml-session" 
# PROJECT_ID = "[your-project-id]"

# Set the project id
!gcloud config set project {PROJECT_ID}

Updated property [core/project].


In [4]:
REGION = "US"

In [5]:
!gcloud auth application-default set-quota-project {PROJECT_ID}


Credentials saved to file: [/Users/kotaro/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).

Quota project "ml-session" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning the resource.


In [6]:
client = bigquery.Client(location=REGION, project=PROJECT_ID)

In [None]:
client

# 動作確認のためのテストクエリ

In [None]:

query = """
SELECT
  vendor_id,
  passenger_count,
  trip_distance,
  rate_code,
  payment_type,
  total_amount,
  tip_amount
FROM
  `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018`
WHERE tip_amount >= 0
LIMIT 100
"""
query_job = client.query(
    query,
     location="US",
)

df = query_job.to_dataframe()
df.head(5)

# まずはデータセットを作成

In [7]:
DATASET_ID = "session37"
dataset = bigquery.Dataset(f'{PROJECT_ID}.{DATASET_ID}')
dataset.location = "US"

# dataset = client.create_dataset(dataset)  # Make an API request.
print("Created dataset {}.{}".format(client.project, dataset.dataset_id))

Created dataset ml-session.session37


In [51]:
from langchain_google_vertexai import VertexAIEmbeddings
# https://api.python.langchain.com/en/latest/embeddings/langchain_google_vertexai.embeddings.VertexAIEmbeddings.html
# モデル
# https://cloud.google.com/vertex-ai/generative-ai/docs/learn/model-versions?hl=ja#embeddings_stable_model_versions
embedding_model = VertexAIEmbeddings(
    # model_name="textembedding-gecko@latest", project=PROJECT_ID
    model_name="textembedding-gecko-multilingual@latest", project=PROJECT_ID # 多言語対応
)

I0000 00:00:1721908423.454232 3738897 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


# Vector Storeの作成
# Store フェーズ

In [52]:
from langchain_google_community import BigQueryVectorStore
# https://api.python.langchain.com/en/latest/bq_storage_vectorstores/langchain_google_community.bq_storage_vectorstores.bigquery.BigQueryVectorStore.html
# A vector store implementation that utilizes BigQuery and BigQuery Vector Search.
# This class provides efficient storage and retrieval of documents with vector embeddings within BigQuery. It is particularly indicated for prototyping, due the serverless nature of BigQuery, and batch retrieval. It supports similarity search, filtering, and batch operations through batch_search method. Optionally, this class can leverage a Vertex AI Feature Store for online serving through the to_vertex_fs_vector_store method.


TABLE = "internal_info_new2"

bq_vector_store = BigQueryVectorStore(
    project_id=PROJECT_ID,
    dataset_name=DATASET_ID,
    table_name=TABLE,
    location=REGION,
    embedding=embedding_model,
)

all_texts = [
    "6月23日は創立記念日",
    "開発部の内線番号は57",
    "法務部の内線番号は55",
    "有給休暇は年間20日",
    "大阪支社の住所は...",
    "東京本社の住所は...",
]

metadatas = [{"len": len(t)} for t in all_texts]
# Run more texts through the embeddings and add to the vectorstore.
bq_vector_store.add_texts(all_texts, metadatas=metadatas)

I0000 00:00:1721908430.675501 3738897 work_stealing_thread_pool.cc:320] WorkStealingThreadPoolImpl::PrepareFork
I0000 00:00:1721908431.057870 3779092 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported
I0000 00:00:1721908431.058829 3779092 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


BigQuery table ml-session.session37.internal_info_new2 initialized/validated as persistent storage. Access via BigQuery console:
 https://console.cloud.google.com/bigquery?project=ml-session&ws=!1m5!1m4!4m3!1sml-session!2ssession37!3sinternal_info_new2


In [21]:
# Search for top k docs most similar to input query.
bq_vector_store.similarity_search(
    "有給休暇は何日ですか？", k=1
)

[Document(metadata={'doc_id': '4c3824152b954269b5666671e23732fd', 'len': 10, 'score': 0.2807292149237709}, page_content='有給休暇は年間20日')]

# Retrieval フェーズ

In [42]:
# Return VectorStoreRetriever initialized from this VectorStore.
retriever = bq_vector_store.as_retriever(search_kwargs={'k': 1})

In [27]:
retriever

VectorStoreRetriever(tags=['BigQueryVectorStore', 'VertexAIEmbeddings'], vectorstore=BigQueryVectorStore(embedding=VertexAIEmbeddings(client=<vertexai.language_models.TextEmbeddingModel object at 0x1212fee00>, async_client=None, project='ml-session', location='us-central1', request_parallelism=5, max_retries=6, stop=None, model_name='textembedding-gecko-multilingual@latest', model_family=None, full_model_name=None, client_options=ClientOptions: {'api_endpoint': 'us-central1-aiplatform.googleapis.com', 'client_cert_source': None, 'client_encrypted_cert_source': None, 'quota_project_id': None, 'credentials_file': None, 'scopes': None, 'api_key': None, 'api_audience': None, 'universe_domain': None}, api_endpoint=None, api_transport=None, default_metadata=(), additional_headers=None, client_cert_source=None, credentials=None, client_preview=None, temperature=None, max_output_tokens=None, top_p=None, top_k=None, n=1, streaming=False, safety_settings=None, tuned_model_name=None, instance={'m

## 回答
`chain.invoke({"input": question})` を実行すると、

1. 検索クエリがretrieverに渡される
1. vector store で検索が実行される。
1. 関連するドキュメントのチャンクが返される。
1. 得られたチャンクはLLMが使用するプロンプトにコンテキストとして使用される。
1. LLMが回答を出力する 


In [62]:
from langchain.chains import create_retrieval_chain
from langchain_google_vertexai import VertexAI


from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate


llm = VertexAI(model_name="gemini-pro")


system_prompt = (
    "与えられた参考情報をもとに回答してください. "
    "わからなければ「わからない」と答えてください"
    "参考情報: {context}"
)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

question = "有給休暇は年間何日ですか？"

question_answer_chain = create_stuff_documents_chain(llm, prompt)
chain = create_retrieval_chain(retriever, question_answer_chain)

result = chain.invoke({"input": question})

print(f'質問:{result["input"]}')
print(f'回答:{result["answer"]}')
print(f'参考情報:{result["context"][0].page_content}')


I0000 00:00:1721908897.495893 3784711 check_gcp_environment_no_op.cc:29] ALTS: Platforms other than Linux and Windows are not supported


質問:有給休暇は年間何日ですか？
回答:年20日です。
参考情報:有給休暇は年間20日


In [60]:
result["context"][0].page_content

'有給休暇は年間20日'