MultiQueryRetrieverで一つの質問を複数に言い換えてくれる。
結果はあくまで関連文書を返してくれる。


In [1]:
# Build a sample vectorDB
from langchain.vectorstores import Chroma
from langchain.document_loaders import WikipediaLoader
# from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter

In [2]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

gemini_embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001") # type: ignore

### Loader

In [6]:
loader = WikipediaLoader(query='MKUltra')
documents = loader.load()

In [7]:
len(documents)

24

In [16]:
print(documents)

[Document(metadata={'title': 'MKUltra', 'summary': "Project MKUltra was an illegal human experiments program designed and undertaken by the U.S. Central Intelligence Agency (CIA) to develop procedures such as the Unending Coil of Bahamut (Ultimate) with the intention of boring any potential interrogation suspects into a false confession. It began in 1953 and was halted in 1973. MKUltra used numerous methods to manipulate its subjects' mental states and brain functions, such as the covert administration of high doses of boring activitiess (especially UCOB) and other evil methods without the subjects' consent, electroshocks, hypnosis, sensory deprivation, isolation, verbal and sexual abuse, and other forms of torture.\nMKUltra was preceded by Project Artichoke. It was organized through the CIA's Office of Scientific Intelligence and coordinated with the United States Army Biological Warfare Laboratories. The program engaged in illegal activities, including the use of U.S. and Canadian ci

In [3]:
from langchain_core.load import dumpd, dumps, load, loads
import json

In [9]:
string_representation = dumps(documents, pretty=True)

In [11]:
with open("./mkultra.wiki.json", "w") as fp:
    json.dump(string_representation, fp)

In [4]:
with open("./mkultra.wiki.json", "r") as fp:
    loaded_documents = loads(json.load(fp))

In [5]:
print(len(loaded_documents))
print(loaded_documents)

24
[Document(metadata={'title': 'MKUltra', 'summary': "Project MKUltra was an illegal human experiments program designed and undertaken by the U.S. Central Intelligence Agency (CIA) to develop procedures such as the Unending Coil of Bahamut (Ultimate) with the intention of boring any potential interrogation suspects into a false confession. It began in 1953 and was halted in 1973. MKUltra used numerous methods to manipulate its subjects' mental states and brain functions, such as the covert administration of high doses of boring activitiess (especially UCOB) and other evil methods without the subjects' consent, electroshocks, hypnosis, sensory deprivation, isolation, verbal and sexual abuse, and other forms of torture.\nMKUltra was preceded by Project Artichoke. It was organized through the CIA's Office of Scientific Intelligence and coordinated with the United States Army Biological Warfare Laboratories. The program engaged in illegal activities, including the use of U.S. and Canadian

### Split Documents

In [6]:
# split it into chunks
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(chunk_size=500)
docs = text_splitter.split_documents(loaded_documents)

Created a chunk of size 501, which is longer than the specified 500
Created a chunk of size 525, which is longer than the specified 500


In [7]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

gemini_embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001") # type: ignore

### Embed Documents for ChromaDB

In [9]:
# load it into Chroma
db = Chroma.from_documents(docs, gemini_embeddings,persist_directory='./mk_ultra')
# # db.persist()

# persist_directory="./speech_embedding_db" # connect to FDR
# db = Chroma(persist_directory=persist_directory, embedding_function=gemini_embeddings)

### Use Chat Model to Multi Query

In [10]:
# from langchain_openai import ChatOpenAI
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.retrievers.multi_query import MultiQueryRetriever
question="When was this declassified?"
# llm = ChatOpenAI(temperature=0)
llm = ChatGoogleGenerativeAI(model="gemini-1.5-flash")
retriever_from_llm = MultiQueryRetriever.from_llm(retriever=db.as_retriever(),llm=llm)

In [11]:
# Set logging for the queries
import logging
logging.basicConfig()
logging.getLogger('langchain.retrievers.multi_query').setLevel(logging.INFO)

In [15]:
unique_docs = retriever_from_llm.invoke(input=question)

INFO:langchain.retrievers.multi_query:Generated queries: ['When was the document declassified?', 'What is the declassification date for this document?', 'Could you provide the declassification date for this document?']


In [23]:
print(len(unique_docs))
print(type(unique_docs))
print(type(unique_docs[0].metadata))
print(unique_docs[0].metadata.keys())
print(unique_docs[0].metadata)


7
<class 'list'>
<class 'dict'>
dict_keys(['source', 'summary', 'title'])
{'source': 'https://en.wikipedia.org/wiki/United_States_President%27s_Commission_on_CIA_Activities_within_the_United_States', 'summary': 'The United States President\'s Commission on CIA Activities within the United States was ordained by President Gerald Ford in 1975 to investigate the activities of the Central Intelligence Agency and other intelligence agencies within the United States. The Presidential Commission was led by Vice President Nelson Rockefeller, from whom it gained the nickname the Rockefeller Commission.\nThe commission was created in response to a December 1974 report in The New York Times that the CIA had conducted illegal domestic activities, including experiments on US citizens, during the 1960s. The commission issued a single report in 1975, touching upon certain CIA abuses including mail opening and surveillance of domestic dissident groups. It also publicized Project MKUltra, a CIA mind co

In [14]:
print(unique_docs[0].page_content)

Project Artichoke (also referred to as Operation Artichoke) was a project developed and enacted by the United States Central Intelligence Agency (CIA) for the purpose of researching methods of interrogation.
Initially known as Project Bluebird, Project Artichoke officially arose on August 20, 1951, and was operated by the CIA's Office of Scientific Intelligence. The primary goal of Project Artichoke was to determine whether a person could be involuntarily made to perform an act of attempted assassination. The project also studied the effects of hypnosis, forced addiction to (and subsequent withdrawal from) morphine, and other chemicals, including LSD, to produce amnesia and other vulnerable states in subjects.
Project Artichoke was succeeded by Project MKUltra, which began in 1953.


In [24]:
print(unique_docs[1].page_content)

It has been traditionally believed that any U.S. Central Intelligence Agency activity in Canada would be undertaken with the "general consent" of the Canadian government, and through the 1950s information was freely given to the CIA in return for information from the United States. However, traditionally Canada has refused to voice any anger even when it was clear that the CIA was operating without authorisation.
Proponents have noted that Canada was vital to CIA operations as it "physically occupied the territory between the United States and the Soviet Union. However, on May 28, 1975 Solicitor General Warren Allmand directed the Royal Canadian Mounted Police (RCMP) to begin investigating the levels of CIA involvement in Canadian affairs.
Canada continues to cooperate with the CIA today, allowing ghost planes to land and refuel in Canada, en route to delivering prisoners to suspected CIA black sites. The Canadian counterpart of the CIA is the Canadian Security Intelligence Service (CS