# Vectara | Vectara

> [Vectara](https://vectara.com/)は、ドキュメントのインデックス作成と問い合わせのための使いやすいAPIを提供する信頼できるGenAIプラットフォームです。
>
> > [Vectara](https://vectara.com/) is the trusted GenAI platform that provides an easy-to-use API for document indexing and querying.

Vectaraは、検索強化生成（Retrieval Augmented Generation）または[RAG](https://vectara.com/grounded-generation/)のエンドツーエンドのマネージドサービスを提供しており、その内容には以下が含まれます：

> Vectara provides an end-to-end managed service for Retrieval Augmented Generation or [RAG](https://vectara.com/grounded-generation/), which includes:

1. ドキュメントファイルからテキストを抽出し、それを文に分割する方法。

   > A way to extract text from document files and chunk them into sentences.

2. 最先端の[Boomerang](https://vectara.com/how-boomerang-takes-retrieval-augmented-generation-to-the-next-level-via-grounded-generation/)埋め込みモデルです。各テキストチャンクはBoomerangを使用してベクトル埋め込みにエンコードされ、Vectaraの内部知識（ベクトルとテキスト）ストアに格納されます。

   > The state-of-the-art [Boomerang](https://vectara.com/how-boomerang-takes-retrieval-augmented-generation-to-the-next-level-via-grounded-generation/) embeddings model. Each text chunk is encoded into a vector embedding using Boomerang, and stored in the Vectara internal knowledge (vector+text) store

3. クエリを自動的に埋め込み表現にエンコードし、最も関連性の高いテキストセグメントを検索するクエリサービス（[ハイブリッド検索](https://docs.vectara.com/docs/api-reference/search-apis/lexical-matching)と[MMR](https://vectara.com/get-diverse-results-and-comprehensive-summaries-with-vectaras-mmr-reranker/)のサポートを含む）

   > A query service that automatically encodes the query into embedding, and retrieves the most relevant text segments (including support for [Hybrid Search](https://docs.vectara.com/docs/api-reference/search-apis/lexical-matching) and [MMR](https://vectara.com/get-diverse-results-and-comprehensive-summaries-with-vectaras-mmr-reranker/))

4. 取得した文書に基づき、引用を含めた[生成的要約](https://docs.vectara.com/docs/learn/grounded-generation/grounded-generation-overview)を作成するオプション。

   > An option to create [generative summary](https://docs.vectara.com/docs/learn/grounded-generation/grounded-generation-overview), based on the retrieved documents, including citations.


APIの使用方法についての詳細は、[Vectara APIドキュメント](https://docs.vectara.com/docs/)をご覧ください。

> See the [Vectara API documentation](https://docs.vectara.com/docs/) for more information on how to use the API.

このノートブックでは、Vectaraを単なるベクターストアとして使用する場合（要約を行わない場合）の基本的な検索機能の使用方法を示しています。これには、`similarity_search`や`similarity_search_with_score`の機能が含まれており、LangChainの`as_retriever`機能の使用方法も説明されています。

> This notebook shows how to use the basic retrieval functionality, when utilizing Vectara just as a Vector Store (without summarization), incuding: `similarity_search` and `similarity_search_with_score` as well as using the LangChain `as_retriever` functionality.




# Setup | セットアップ

LangChainでVectaraを使用するには、Vectaraのアカウントが必要です。始めるためには、以下の手順に従ってください：

> You will need a Vectara account to use Vectara with LangChain. To get started, use the following steps:

1. まだVectaraアカウントをお持ちでない場合は、[サインアップ](https://www.vectara.com/integrations/langchain)してください。サインアップが完了すると、Vectaraの顧客IDが発行されます。顧客IDは、Vectaraコンソールウィンドウの右上にあるあなたの名前をクリックすることで確認できます。

   > [Sign up](https://www.vectara.com/integrations/langchain) for a Vectara account if you don't already have one. Once you have completed your sign up you will have a Vectara customer ID. You can find your customer ID by clicking on your name, on the top-right of the Vectara console window.

2. アカウント内で、1つまたは複数のコーパスを作成できます。各コーパスは、入力ドキュメントから取り込まれたテキストデータを格納するエリアを表します。コーパスを作成するには、\*\*「Create Corpus」\*\*ボタンを使用します。その後、コーパスに名前と説明を提供します。オプションで、フィルタリング属性を定義したり、いくつかの高度なオプションを適用することもできます。作成したコーパスをクリックすると、その名前とコーパスIDがすぐに上部に表示されます。

   > Within your account you can create one or more corpora. Each corpus represents an area that stores text data upon ingest from input documents. To create a corpus, use the **"Create Corpus"** button. You then provide a name to your corpus as well as a description. Optionally you can define filtering attributes and apply some advanced options. If you click on your created corpus, you can see its name and corpus ID right on the top.

3. 次に、コーパスにアクセスするためのAPIキーを作成する必要があります。コーパスビューの\*\*「Authorization」**タブをクリックし、その後**「Create API Key」\*\*ボタンをクリックします。キーに名前を付け、キーの権限としてクエリのみ、またはクエリ+インデックスのどちらかを選択します。「Create」をクリックすると、アクティブなAPIキーが作成されます。このキーは機密情報として保管してください。

   > Next you'll need to create API keys to access the corpus. Click on the **"Authorization"** tab in the corpus view and then the **"Create API Key"** button. Give your key a name, and choose whether you want query only or query+index for your key. Click "Create" and you now have an active API key. Keep this key confidential.


LangChainをVectaraと共に使用するには、顧客ID、コーパスID、およびapi\_keyの3つの値が必要です。これらをLangChainに提供する方法は2通りあります：

> To use LangChain with Vectara, you'll need to have these three values: customer ID, corpus ID and api\_key.
> You can provide those to LangChain in two ways:

1. 環境変数には、これら三つの変数`VECTARA_CUSTOMER_ID`、`VECTARA_CORPUS_ID`、`VECTARA_API_KEY`を含めてください。

   > Include in your environment these three variables: `VECTARA_CUSTOMER_ID`, `VECTARA_CORPUS_ID` and `VECTARA_API_KEY`.


> 例えば、次のようにos.environとgetpassを使用してこれらの変数を設定できます：
>
> > For example, you can set these variables using os.environ and getpass as follows:

```python
import os
import getpass

os.environ["VECTARA_CUSTOMER_ID"] = getpass.getpass("Vectara Customer ID:")
os.environ["VECTARA_CORPUS_ID"] = getpass.getpass("Vectara Corpus ID:")
os.environ["VECTARA_API_KEY"] = getpass.getpass("Vectara API Key:")
```

2. Vectaraベクターストアのコンストラクタにそれらを追加してください：

   > Add them to the Vectara vectorstore constructor:


```python
vectorstore = Vectara(
                vectara_customer_id=vectara_customer_id,
                vectara_corpus_id=vectara_corpus_id,
                vectara_api_key=vectara_api_key
            )
```




## Connecting to Vectara from LangChain | LangChainからVectaraへの接続

始めに、from\_documents() メソッドを使用してドキュメントを取り込みましょう。ここでは、VECTARA\_CUSTOMER\_ID、VECTARA\_CORPUS\_ID、およびクエリとインデキシング用の VECTARA\_API\_KEY を環境変数として設定済みであることを前提としています。

> To get started, let's ingest the documents using the from\_documents() method.
> We assume here that you've added your VECTARA\_CUSTOMER\_ID, VECTARA\_CORPUS\_ID and query+indexing VECTARA\_API\_KEY as environment variables.




In [1]:
from langchain.document_loaders import TextLoader
from langchain.embeddings.fake import FakeEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Vectara

In [2]:
loader = TextLoader("state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

In [3]:
vectara = Vectara.from_documents(
    docs,
    embedding=FakeEmbeddings(size=768),
    doc_metadata={"speech": "state-of-the-union"},
)

VectaraのインデキシングAPIは、ファイルを直接Vectaraが処理し、事前処理を行い、最適に分割してVectaraベクターストアに追加するファイルアップロードAPIを提供しています。これを使用するために、add\_files()メソッド（およびfrom\_files()メソッド）を追加しました。

> Vectara's indexing API provides a file upload API where the file is handled directly by Vectara - pre-processed, chunked optimally and added to the Vectara vector store.
> To use this, we added the add\_files() method (as well as from\_files()).

実際に試してみましょう。アップロードする2つのPDFドキュメントを選びます：

> Let's see this in action. We pick two PDF documents to upload:

1. キング博士の「I have a dream」スピーチ

   > The "I have a dream" speech by Dr. King

2. チャーチルの「我々はビーチで戦う」演説

   > Churchill's "We Shall Fight on the Beaches" speech





In [4]:
import tempfile
import urllib.request

urls = [
    [
        "https://www.gilderlehrman.org/sites/default/files/inline-pdfs/king.dreamspeech.excerpts.pdf",
        "I-have-a-dream",
    ],
    [
        "https://www.parkwayschools.net/cms/lib/MO01931486/Centricity/Domain/1578/Churchill_Beaches_Speech.pdf",
        "we shall fight on the beaches",
    ],
]
files_list = []
for url, _ in urls:
    name = tempfile.NamedTemporaryFile().name
    urllib.request.urlretrieve(url, name)
    files_list.append(name)

docsearch: Vectara = Vectara.from_files(
    files=files_list,
    embedding=FakeEmbeddings(size=768),
    metadatas=[{"url": url, "speech": title} for url, title in urls],
)

## Similarity search | 類似性検索

Vectaraを使用する最もシンプルなシナリオは、類似性検索を実行することです。

> The simplest scenario for using Vectara is to perform a similarity search.




In [5]:
query = "What did the president say about Ketanji Brown Jackson"
found_docs = vectara.similarity_search(
    query, n_sentence_context=0, filter="doc.speech = 'state-of-the-union'"
)

In [6]:
found_docs

[Document(page_content='And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson.', metadata={'source': 'langchain', 'lang': 'eng', 'offset': '596', 'len': '97', 'speech': 'state-of-the-union'}),
 Document(page_content='In this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.”', metadata={'source': 'langchain', 'lang': 'eng', 'offset': '141', 'len': '117', 'speech': 'state-of-the-union'}),
 Document(page_content='As Ohio Senator Sherrod Brown says, “It’s time to bury the label “Rust Belt.”', metadata={'source': 'langchain', 'lang': 'eng', 'offset': '0', 'len': '77', 'speech': 'state-of-the-union'}),
 Document(page_content='Last month, I announced our plan to supercharge  \nthe Cancer Moonshot that President Obama asked me to lead six years ago.', metadata={'source': 'langchain', 'lang': 'eng', 'offset': '0', 'len': '122', 'speech': 'state-of-the-union'}),
 Document(page_content='He thoug

In [7]:
print(found_docs[0].page_content)

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson.


## Similarity search with score | スコア付き類似性検索

時々、検索を実行するだけではなく、特定の結果の良さを知るために関連性スコアを得たいと思うことがあります。

> Sometimes we might want to perform the search, but also obtain a relevancy score to know how good is a particular result.




In [8]:
query = "What did the president say about Ketanji Brown Jackson"
found_docs = vectara.similarity_search_with_score(
    query,
    filter="doc.speech = 'state-of-the-union'",
    score_threshold=0.2,
)

In [9]:
document, score = found_docs[0]
print(document.page_content)
print(f"\nScore: {score}")

Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. A former top litigator in private practice.

Score: 0.74179757


それでは、アップロードしたファイル内のコンテンツに対して同様の検索を行いましょう

> Now let's do similar search for content in the files we uploaded




In [10]:
query = "We must forever conduct our struggle"
min_score = 1.2
found_docs = vectara.similarity_search_with_score(
    query,
    filter="doc.speech = 'I-have-a-dream'",
    score_threshold=min_score,
)
print(f"With this threshold of {min_score} we have {len(found_docs)} documents")

With this threshold of 1.2 we have 0 documents


In [11]:
query = "We must forever conduct our struggle"
min_score = 0.2
found_docs = vectara.similarity_search_with_score(
    query,
    filter="doc.speech = 'I-have-a-dream'",
    score_threshold=min_score,
)
print(f"With this threshold of {min_score} we have {len(found_docs)} documents")

With this threshold of 0.2 we have 10 documents


MMRは、GenAIアプリケーションにフィードされる検索結果を再ランク付けし、結果の多様性を向上させることによって、多くのアプリケーションにとって重要な検索能力です。

> MMR is an important retrieval capability for many applications, whereby search results feeding your GenAI application are reranked to improve diversity of results.

Vectaraを使って、それがどのように動作するか見てみましょう：

> Let's see how that works with Vectara:




In [12]:
query = "state of the economy"
found_docs = vectara.similarity_search(
    query,
    n_sentence_context=0,
    filter="doc.speech = 'state-of-the-union'",
    k=5,
    mmr_config={"is_enabled": True, "mmr_k": 50, "diversity_bias": 0.0},
)
print("\n\n".join([x.page_content for x in found_docs]))

Economic assistance.

Grow the workforce. Build the economy from the bottom up  
and the middle out, not from the top down.

When we invest in our workers, when we build the economy from the bottom up and the middle out together, we can do something we haven’t done in a long time: build a better America.

Our economy grew at a rate of 5.7% last year, the strongest growth in nearly 40 years, the first step in bringing fundamental change to an economy that hasn’t worked for the working people of this nation for too long.

Economists call it “increasing the productive capacity of our economy.”


In [13]:
query = "state of the economy"
found_docs = vectara.similarity_search(
    query,
    n_sentence_context=0,
    filter="doc.speech = 'state-of-the-union'",
    k=5,
    mmr_config={"is_enabled": True, "mmr_k": 50, "diversity_bias": 1.0},
)
print("\n\n".join([x.page_content for x in found_docs]))

Economic assistance.

The Russian stock market has lost 40% of its value and trading remains suspended.

But that trickle-down theory led to weaker economic growth, lower wages, bigger deficits, and the widest gap between those at the top and everyone else in nearly a century.

In state after state, new laws have been passed, not only to suppress the vote, but to subvert entire elections.

The federal government spends about $600 Billion a year to keep the country safe and secure.


ご覧の通り、最初の例ではdiversity\_biasを0.0に設定しました（これは多様性リランキングを無効にすることと同等です）。その結果、上位5つの最も関連性の高いドキュメントが得られました。diversity\_biasを1.0に設定すると、多様性が最大化され、結果として得られるトップドキュメントは、その意味内容においてはるかに多様性があります。

> As you can see, in the first example diversity\_bias was set to 0.0 (equivalent to diversity reranking disabled), which resulted in a the top-5 most relevant documents. With diversity\_bias=1.0 we maximize diversity and as you can see the resulting top documents are much more diverse in their semantic meanings.




## Vectara as a Retriever | Vectaraを検索エンジンとして

最後に、`as_retriever()` インターフェースを使って Vectara を使用する方法を見てみましょう：

> Finally let's see how to use Vectara with the `as_retriever()` interface:




In [14]:
retriever = vectara.as_retriever()
retriever

VectorStoreRetriever(tags=['Vectara'], vectorstore=<langchain.vectorstores.vectara.Vectara object at 0x109a3c760>)

In [15]:
query = "What did the president say about Ketanji Brown Jackson"
retriever.get_relevant_documents(query)[0]

Document(page_content='Justice Breyer, thank you for your service. One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court. And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence. A former top litigator in private practice.', metadata={'source': 'langchain', 'lang': 'eng', 'offset': '596', 'len': '97', 'speech': 'state-of-the-union'})