# Multi Field Retriever

The `MultiFieldRetriever` class is a search tool designed to extend the capabilities of a standard document retriever, such as a `TFIDFRetriever`. It allows users to perform searches not only on the content of documents but also on various metadata fields associated with those documents. This functionality is particularly useful when the relevance of a document to a search query depends on more than just the content of the document itself.

## Features

- **Multi-Field Search**: The class can search across multiple fields, including both the document content and specified metadata fields.
- **Flexible Metadata Inclusion**: Users can specify which metadata fields should be included in the search index, allowing for customized search experiences.
- **Compatibility with any Retrievers**: `MultiFieldRetriever` works with any retriever that extends from `BaseRetriever`, making it a versatile tool for various search scenarios.

## How To Use it


To use `MultiFieldRetriever`, we need to pass the base retriever name to it. And other parameters can be same as the base retriever.

In [None]:
from langchain.retrievers import MultiFieldRetriever, TFIDFRetriever
from langchain.schema.document import Document

# create some documents
documents = [
    Document(
        page_content="This Japanese Sake is good",
        metadata={"production year": 1991, "brand": "Yamazaki", "doc_id": 1},
    ),
    Document(
        page_content="This is a well-known brand of sake, which is a Japanese rice wine.",
        metadata={"production year": 1981, "brand": "Dassai", "doc_id": 2},
    ),
    Document(
        page_content="This wine is good",
        metadata={"production year": 1920, "doc_id": 3},
    ),
    Document(
        page_content="This is a very old brand wine",
        metadata={"production year": 1930, "doc_id": 4},
    ),
]

# let's create a vanilla tfidf retriever
tfidf = TFIDFRetriever.from_documents(documents)

# to use the MultiFieldRetriever, we need to pass the base retriever name to it
tfidf_multi_field = MultiFieldRetriever.from_documents(
    retriever=TFIDFRetriever,
    documents=documents,
)

We can see that multi field retriever returns the result that considers the metadata fields.

In [2]:
result = tfidf.get_relevant_documents("Japanese Sake, Dassai")[0]
print(result.page_content)
print(result.metadata)

This Japanese Sake is good
{'production year': 1991, 'brand': 'Yamazaki', 'doc_id': 1}


In [3]:
result = tfidf_multi_field.get_relevant_documents("Japanese Sake, Dassai")[0]
print(result.page_content)
print(result.metadata)

This is a well-known brand of sake, which is a Japanese rice wine.
{'production year': 1981, 'brand': 'Dassai', 'doc_id': 2}


You can set the metadata fields to be included in the search index. If you do not specify any metadata fields, it will search across all metadata fields.
From the following example, we can see by setting the `procudtion year` to be included in the search index, other metadata fields was not included in the search index, so the first example failed but the second example succeeded.

In [7]:
tfidf_with_year = MultiFieldRetriever.from_documents(
    TFIDFRetriever, documents, meta_data_keys=["production year"]
)
result_1 = tfidf_with_year.get_relevant_documents("Japanese Sake, Dassai")[0]
print(result_1.page_content)
print(result_1.metadata)

This Japanese Sake is good
{'production year': 1991, 'brand': 'Yamazaki', 'doc_id': 1}


In [8]:
result_2 = tfidf_with_year.get_relevant_documents("Wine that was produced in 1930 ")[0]
print(result_2.page_content)
print(result_2.metadata)

This is a very old brand wine
{'production year': 1930, 'doc_id': 4}


The `MultiFieldRetriever` constructs an internal retriever that searches the text containing both the document content and the specified metadata fields. 

In [12]:
internal_result = tfidf_with_year.retriever.get_relevant_documents(
    "Wine that was produced in 1930 "
)[0]
print(internal_result.page_content)

===
production year: 1930
===
content: This is a very old brand wine


By default, it uses the `_get_relevant_documents` function, but you can pass your own function to the `from_documents` method when creating the `MultiFieldRetriever`. This function should take two arguments, `doc` and `meta_data_keys`, and return a string of formatted text.   
Here is an example:

In [15]:
def format_string(doc, meta_data_keys):
    if len(meta_data_keys) == 0:
        meta_data = doc.metadata
    else:
        meta_data = {k: v for k, v in doc.metadata.items() if k in meta_data_keys}
    meta_data_str = "\n".join([f"---\n{k}: {v}" for k, v in meta_data.items()])
    return f"{meta_data_str}\n---\ncontent: {doc.page_content}"


tfidf_year_custom = MultiFieldRetriever.from_documents(
    TFIDFRetriever,
    documents,
    meta_data_keys=["production year"],
    str_concat_func=format_string,
)
internal_result = tfidf_year_custom.retriever.get_relevant_documents(
    "Wine that was produced in 1930 "
)[0]
print(internal_result.page_content)  # we can see the `===` part changed to `---`

---
production year: 1930
---
content: This is a very old brand wine


If you want to use a vector store and set the search parameters in the retriever, you can pass `search_kwargs` to the `from_documents` method.  
Here is an example that set the `k` parameter to be 1 in the search process:

In [23]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

embedding = OpenAIEmbeddings()
faiss_multi_field = MultiFieldRetriever.from_documents(
    retriever=FAISS,
    documents=documents,
    embedding=embedding,
    search_kwargs={"k": 1},
)

In [24]:
faiss_multi_field.get_relevant_documents("Wine that was produced in 1930 ")

[Document(page_content='This is a very old brand wine', metadata={'production year': 1930, 'doc_id': 4})]