retriever.as_retriever() function cannot retrieve data based on filter from azure cognitive search but #19885

Farid-Ullah · 2024-04-01T20:24:40Z

Checked other resources

I added a very descriptive title to this issue.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
I am sure that this is a bug in LangChain rather than my code.
The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

python...
here is the code that i use:
i ahve index data on azure cognitive search and each chunk has searchable type of metadata which is location
if i use the use acs.as_retriever() function along with filter it retrieve different location data as well like you can see in below code output is have print each retrieve doc location metadata.

but i use the acs.similarity_search() and inside we pass filter it will only retrieve that location data and not retrieve mix location data.

acs = acs_search("testindex")
retriever = acs.as_retriever(search_kwargs={'filter': {'location':'US'},
                                            'k': 5})

def format_docs(docs):
    for i in docs:
        print(i.metadata["location"])
    return "\n\n".join(doc.page_content for doc in docs)
    
   
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is hr policy about leave")

OUTPUT

US
PK
MY
US
MY
'The HR policy about leave at xyz includes standard paid leave for full-time employees after 90 days of continuous employment. This includes Annual Leave (AL) of 14 workdays....

USE acs.similarity_search()

res = acs.similarity_search(
    query="what is the hr policy for anual leave", k=4, search_type="hybrid", filters="location eq 'US'"
)
res

OUTPUT:

[Document(page_content='Leave taken under this policy does, metadata={'source': '2023-us.pdf', 'location': 'US'}),
 Document(page_content='You may use available vacation, pers metadata={'source': '2023-us.pdf', 'location': 'US'}),
 Document(page_content="Failure to Return to Work If you fa", metadata={'source': '2023-us.pdf', 'location': 'US'}),
 Document(page_content='To request leave under this policy, , metadata={'source': '2023-us.pdf', 'location': 'US'})]

you can see this function give exact filter data and not mixed data .

what would be the solution because we are use the first function inside chain and we are unable to get filter data.

Error Message and Stack Trace (if applicable)

inside langchain_core > vectorstores.py i have place this print but the filter did not work:

def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        if self.search_type == "similarity":
            print("===filter=======>\n",self.search_kwargs,"\n=============")
            docs = self.vectorstore.similarity_search(query, **self.search_kwargs)

OUTPUT:

===filter=======>
 {'filter': {'location': 'US'}, 'k': 5} 
=============

we are unable to get filter data while using as_retriever() function inside chain the doc return by this is given in first code output

Description

i use the below versions

langchain==0.1.8
langchain-community==0.0.21
langchain-core==0.1.25
langchain-openai==0.0.6

System Info

aiohttp==3.9.3
aiosignal==1.3.1
annotated-types==0.6.0
antlr4-python3-runtime==4.9.3
anyio==4.3.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asgiref==3.7.2
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
asyncio-redis==0.16.0
attrs==23.2.0
azure-common==1.1.28
azure-core==1.30.0
azure-identity==1.15.0
azure-monitor-opentelemetry-exporter==1.0.0b22
azure-search-documents==11.4.0
azure-storage-blob==12.19.1
Babel==2.14.0
backoff==2.2.1
beautifulsoup4==4.12.3
bleach==6.1.0
cachetools==5.3.3
certifi==2024.2.2
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
cohere==4.56
coloredlogs==15.0.1
comm==0.2.1
contourpy==1.2.0
cryptography==42.0.4
cycler==0.12.1
dataclasses-json==0.6.4
debugpy==1.8.1
decorator==5.1.1
deepdiff==6.7.1
defusedxml==0.7.1
Deprecated==1.2.14
distro==1.9.0
effdet==0.4.1
emoji==2.10.1
et-xmlfile==1.1.0
exceptiongroup==1.2.0
executing==2.0.1
fastapi==0.109.2
fastavro==1.9.4
fastjsonschema==2.19.1
filelock==3.13.1
filetype==1.2.0
fixedint==0.1.6
flatbuffers==24.3.6
fonttools==4.49.0
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.2.0
greenlet==3.0.3
h11==0.14.0
httpcore==1.0.4
httpx==0.27.0
huggingface-hub==0.21.4
humanfriendly==10.0
idna==3.6
importlib-metadata==6.11.0
iopath==0.1.10
ipykernel==6.29.2
ipython==8.22.1
ipywidgets==8.1.2
isodate==0.6.1
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.3
joblib==1.3.2
json5==0.9.24
jsonpatch==1.33
jsonpath-python==1.0.6
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.10.0
jupyter-lsp==2.2.4
jupyter_client==8.6.0
jupyter_core==5.7.1
jupyter_server==2.13.0
jupyter_server_terminals==0.5.3
jupyterlab==4.1.5
jupyterlab_pygments==0.3.0
jupyterlab_server==2.25.4
jupyterlab_widgets==3.0.10
kiwisolver==1.4.5
langchain==0.1.8
langchain-community==0.0.21
langchain-core==0.1.25
langchain-openai==0.0.6
langchainhub==0.1.15
langdetect==1.0.9
langsmith==0.1.5
layoutparser==0.3.4
lxml==5.1.0
MarkupSafe==2.1.5
marshmallow==3.20.2
matplotlib==3.8.3
matplotlib-inline==0.1.6
mistune==3.0.2
mpmath==1.3.0
msal==1.26.0
msal-extensions==1.1.0
msrest==0.7.1
multidict==6.0.5
mypy-extensions==1.0.0
nbclient==0.10.0
nbconvert==7.16.3
nbformat==5.10.3
nest-asyncio==1.6.0
networkx==3.2.1
nltk==3.8.1
notebook==7.1.2
notebook_shim==0.2.4
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
omegaconf==2.3.0
onnx==1.15.0
onnxruntime==1.15.1
openai==1.12.0
opencv-python==4.9.0.80
openpyxl==3.1.2
opentelemetry-api==1.22.0
opentelemetry-instrumentation==0.43b0
opentelemetry-instrumentation-asgi==0.43b0
opentelemetry-instrumentation-fastapi==0.43b0
opentelemetry-sdk==1.22.0
opentelemetry-semantic-conventions==0.43b0
opentelemetry-util-http==0.43b0
ordered-set==4.1.0
overrides==7.7.0
packaging==23.2
pandas==2.2.1
pandocfilters==1.5.1
parso==0.8.3
pdf2image==1.17.0
pdfminer.six==20221105
pdfplumber==0.10.4
pexpect==4.9.0
pikepdf==8.13.0
pillow==10.2.0
pillow_heif==0.15.0
platformdirs==4.2.0
portalocker==2.8.2
prometheus_client==0.20.0
prompt-toolkit==3.0.43
protobuf==4.25.3
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
pycocotools==2.0.7
pycparser==2.21
pydantic==2.6.1
pydantic-settings==2.2.0
pydantic_core==2.16.2
Pygments==2.17.2
PyJWT==2.8.0
pymssql==2.2.11
pyparsing==3.1.2
pypdf==4.1.0
pypdfium2==4.27.0
pytesseract==0.3.10
python-dateutil==2.8.2
python-docx==1.1.0
python-dotenv==1.0.1
python-iso639==2024.2.7
python-json-logger==2.0.7
python-magic==0.4.27
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
pyzmq==25.1.2
qtconsole==5.5.1
QtPy==2.4.1
rapidfuzz==3.6.2
redis==5.0.1
referencing==0.34.0
regex==2023.12.25
requests==2.31.0
requests-oauthlib==1.3.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.18.0
safetensors==0.4.2
scipy==1.12.0
Send2Trash==1.8.2
six==1.16.0
sniffio==1.3.0
soupsieve==2.5
SQLAlchemy==2.0.27
stack-data==0.6.3
starlette==0.36.3
sympy==1.12
tabulate==0.9.0
tenacity==8.2.3
terminado==0.18.1
tiktoken==0.6.0
timm==0.9.16
tinycss2==1.2.1
tokenizers==0.15.2
tomli==2.0.1
torch==2.2.1
torchvision==0.17.1
tornado==6.4
tqdm==4.66.2
traitlets==5.14.1
transformers==4.38.2
triton==2.2.0
types-python-dateutil==2.9.0.20240316
types-requests==2.31.0.20240311
typing-inspect==0.9.0
typing_extensions==4.9.0
tzdata==2024.1
unstructured==0.12.4
unstructured-client==0.21.1
unstructured-inference==0.7.23
unstructured.pytesseract==0.3.12
uri-template==1.3.0
urllib3==2.2.1
uvicorn==0.27.1
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
widgetsnbextension==4.0.10
wrapt==1.16.0
xlrd==2.0.1
yarl==1.9.4
zipp==3.17.0

The text was updated successfully, but these errors were encountered:

liugddx · 2024-04-02T15:21:19Z

Let me see

Farid-Ullah · 2024-04-03T17:05:34Z

Hi @liugddx , Have you checked the issue?
Thank

Farid-Ullah · 2024-04-04T18:31:17Z

Hi @jarib @zeke , Hope you all doing well.

Could you help me sort out this problem sloution because if it did not work in chain then i will do it customly step by step to acheive this functionality.

Your help would be appreciated.
thank you

sbusso · 2024-04-04T20:31:18Z

@Farid-Ullah, no random tagging, please.

**Description**: The AzureAISearchRetriever does not support the "$filter" argument offered in the AISearch API: https://learn.microsoft.com/en-us/rest/api/searchservice/documents/search-get?view=rest-searchservice-2023-11-01&tabs=HTTP The $filter allows filtering of indexes based on values in metadata. **Issue**: #19885 **Dependencies**: No **Twitter handle**: @Jeffreym9M - [ ] **Add tests and docs**: Not relevant - [x] **Lint and test**: Run `make format`, `make lint` and `make test` from the root of the package(s) you've modified. See contribution guidelines for more: https://python.langchain.com/docs/contributing/

my23701 · 2024-09-02T13:33:08Z

Hi @Farid-Ullah, did you get the solution for this problem?
I am facing similar problem in my RAG also.

dosubot bot added Ɑ: retriever Related to retriever module 🔌: aws Primarily related to Amazon Web Services (AWS) integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Apr 1, 2024

jeffreyrubi mentioned this issue May 30, 2024

community[patch]:Support filter for AzureAISearchRetriever #22303

Merged

2 tasks

dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 4, 2024

dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 11, 2024

dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

retriever.as_retriever() function cannot retrieve data based on filter from azure cognitive search but #19885

retriever.as_retriever() function cannot retrieve data based on filter from azure cognitive search but #19885

Farid-Ullah commented Apr 1, 2024

liugddx commented Apr 2, 2024

Farid-Ullah commented Apr 3, 2024

Farid-Ullah commented Apr 4, 2024 •

edited

Loading

sbusso commented Apr 4, 2024

my23701 commented Sep 2, 2024 •

edited

Loading

retriever.as_retriever() function cannot retrieve data based on filter from azure cognitive search but #19885

retriever.as_retriever() function cannot retrieve data based on filter from azure cognitive search but #19885

Comments

Farid-Ullah commented Apr 1, 2024

Checked other resources

Example Code

USE acs.similarity_search()

Error Message and Stack Trace (if applicable)

Description

System Info

liugddx commented Apr 2, 2024

Farid-Ullah commented Apr 3, 2024

Farid-Ullah commented Apr 4, 2024 • edited Loading

sbusso commented Apr 4, 2024

my23701 commented Sep 2, 2024 • edited Loading

Farid-Ullah commented Apr 4, 2024 •

edited

Loading

my23701 commented Sep 2, 2024 •

edited

Loading