Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

retriever.as_retriever() function cannot retrieve data based on filter from azure cognitive search but #19885

Closed
5 tasks done
Farid-Ullah opened this issue Apr 1, 2024 · 5 comments
Labels
🔌: aws Primarily related to Amazon Web Services (AWS) integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: retriever Related to retriever module

Comments

@Farid-Ullah
Copy link

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

python...
here is the code that i use:
i ahve index data on azure cognitive search and each chunk has searchable type of metadata which is location
if i use the use acs.as_retriever() function along with filter it retrieve different location data as well like you can see in below code output is have print each retrieve doc location metadata.

but i use the acs.similarity_search() and inside we pass filter it will only retrieve that location data and not retrieve mix location data.

acs = acs_search("testindex")
retriever = acs.as_retriever(search_kwargs={'filter': {'location':'US'},
                                            'k': 5})

def format_docs(docs):
    for i in docs:
        print(i.metadata["location"])
    return "\n\n".join(doc.page_content for doc in docs)
    
   
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is hr policy about leave")
  

OUTPUT

US
PK
MY
US
MY
'The HR policy about leave at xyz includes standard paid leave for full-time employees after 90 days of continuous employment. This includes Annual Leave (AL) of 14 workdays....

USE acs.similarity_search()

res = acs.similarity_search(
    query="what is the hr policy for anual leave", k=4, search_type="hybrid", filters="location eq 'US'"
)
res

OUTPUT:

[Document(page_content='Leave taken under this policy does, metadata={'source': '2023-us.pdf', 'location': 'US'}),
 Document(page_content='You may use available vacation, pers metadata={'source': '2023-us.pdf', 'location': 'US'}),
 Document(page_content="Failure to Return to Work If you fa", metadata={'source': '2023-us.pdf', 'location': 'US'}),
 Document(page_content='To request leave under this policy, , metadata={'source': '2023-us.pdf', 'location': 'US'})]

you can see this function give exact filter data and not mixed data .

what would be the solution because we are use the first function inside chain and we are unable to get filter data.

Error Message and Stack Trace (if applicable)

inside langchain_core > vectorstores.py i have place this print but the filter did not work:

def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun
    ) -> List[Document]:
        if self.search_type == "similarity":
            print("===filter=======>\n",self.search_kwargs,"\n=============")
            docs = self.vectorstore.similarity_search(query, **self.search_kwargs)
            

OUTPUT:

===filter=======>
 {'filter': {'location': 'US'}, 'k': 5} 
=============

we are unable to get filter data while using as_retriever() function inside chain the doc return by this is given in first code output

Description

i use the below versions

langchain==0.1.8
langchain-community==0.0.21
langchain-core==0.1.25
langchain-openai==0.0.6

System Info

aiohttp==3.9.3
aiosignal==1.3.1
annotated-types==0.6.0
antlr4-python3-runtime==4.9.3
anyio==4.3.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asgiref==3.7.2
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
asyncio-redis==0.16.0
attrs==23.2.0
azure-common==1.1.28
azure-core==1.30.0
azure-identity==1.15.0
azure-monitor-opentelemetry-exporter==1.0.0b22
azure-search-documents==11.4.0
azure-storage-blob==12.19.1
Babel==2.14.0
backoff==2.2.1
beautifulsoup4==4.12.3
bleach==6.1.0
cachetools==5.3.3
certifi==2024.2.2
cffi==1.16.0
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
cohere==4.56
coloredlogs==15.0.1
comm==0.2.1
contourpy==1.2.0
cryptography==42.0.4
cycler==0.12.1
dataclasses-json==0.6.4
debugpy==1.8.1
decorator==5.1.1
deepdiff==6.7.1
defusedxml==0.7.1
Deprecated==1.2.14
distro==1.9.0
effdet==0.4.1
emoji==2.10.1
et-xmlfile==1.1.0
exceptiongroup==1.2.0
executing==2.0.1
fastapi==0.109.2
fastavro==1.9.4
fastjsonschema==2.19.1
filelock==3.13.1
filetype==1.2.0
fixedint==0.1.6
flatbuffers==24.3.6
fonttools==4.49.0
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.2.0
greenlet==3.0.3
h11==0.14.0
httpcore==1.0.4
httpx==0.27.0
huggingface-hub==0.21.4
humanfriendly==10.0
idna==3.6
importlib-metadata==6.11.0
iopath==0.1.10
ipykernel==6.29.2
ipython==8.22.1
ipywidgets==8.1.2
isodate==0.6.1
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.3
joblib==1.3.2
json5==0.9.24
jsonpatch==1.33
jsonpath-python==1.0.6
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.10.0
jupyter-lsp==2.2.4
jupyter_client==8.6.0
jupyter_core==5.7.1
jupyter_server==2.13.0
jupyter_server_terminals==0.5.3
jupyterlab==4.1.5
jupyterlab_pygments==0.3.0
jupyterlab_server==2.25.4
jupyterlab_widgets==3.0.10
kiwisolver==1.4.5
langchain==0.1.8
langchain-community==0.0.21
langchain-core==0.1.25
langchain-openai==0.0.6
langchainhub==0.1.15
langdetect==1.0.9
langsmith==0.1.5
layoutparser==0.3.4
lxml==5.1.0
MarkupSafe==2.1.5
marshmallow==3.20.2
matplotlib==3.8.3
matplotlib-inline==0.1.6
mistune==3.0.2
mpmath==1.3.0
msal==1.26.0
msal-extensions==1.1.0
msrest==0.7.1
multidict==6.0.5
mypy-extensions==1.0.0
nbclient==0.10.0
nbconvert==7.16.3
nbformat==5.10.3
nest-asyncio==1.6.0
networkx==3.2.1
nltk==3.8.1
notebook==7.1.2
notebook_shim==0.2.4
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.4.99
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
omegaconf==2.3.0
onnx==1.15.0
onnxruntime==1.15.1
openai==1.12.0
opencv-python==4.9.0.80
openpyxl==3.1.2
opentelemetry-api==1.22.0
opentelemetry-instrumentation==0.43b0
opentelemetry-instrumentation-asgi==0.43b0
opentelemetry-instrumentation-fastapi==0.43b0
opentelemetry-sdk==1.22.0
opentelemetry-semantic-conventions==0.43b0
opentelemetry-util-http==0.43b0
ordered-set==4.1.0
overrides==7.7.0
packaging==23.2
pandas==2.2.1
pandocfilters==1.5.1
parso==0.8.3
pdf2image==1.17.0
pdfminer.six==20221105
pdfplumber==0.10.4
pexpect==4.9.0
pikepdf==8.13.0
pillow==10.2.0
pillow_heif==0.15.0
platformdirs==4.2.0
portalocker==2.8.2
prometheus_client==0.20.0
prompt-toolkit==3.0.43
protobuf==4.25.3
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
pycocotools==2.0.7
pycparser==2.21
pydantic==2.6.1
pydantic-settings==2.2.0
pydantic_core==2.16.2
Pygments==2.17.2
PyJWT==2.8.0
pymssql==2.2.11
pyparsing==3.1.2
pypdf==4.1.0
pypdfium2==4.27.0
pytesseract==0.3.10
python-dateutil==2.8.2
python-docx==1.1.0
python-dotenv==1.0.1
python-iso639==2024.2.7
python-json-logger==2.0.7
python-magic==0.4.27
python-multipart==0.0.9
pytz==2024.1
PyYAML==6.0.1
pyzmq==25.1.2
qtconsole==5.5.1
QtPy==2.4.1
rapidfuzz==3.6.2
redis==5.0.1
referencing==0.34.0
regex==2023.12.25
requests==2.31.0
requests-oauthlib==1.3.1
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.18.0
safetensors==0.4.2
scipy==1.12.0
Send2Trash==1.8.2
six==1.16.0
sniffio==1.3.0
soupsieve==2.5
SQLAlchemy==2.0.27
stack-data==0.6.3
starlette==0.36.3
sympy==1.12
tabulate==0.9.0
tenacity==8.2.3
terminado==0.18.1
tiktoken==0.6.0
timm==0.9.16
tinycss2==1.2.1
tokenizers==0.15.2
tomli==2.0.1
torch==2.2.1
torchvision==0.17.1
tornado==6.4
tqdm==4.66.2
traitlets==5.14.1
transformers==4.38.2
triton==2.2.0
types-python-dateutil==2.9.0.20240316
types-requests==2.31.0.20240311
typing-inspect==0.9.0
typing_extensions==4.9.0
tzdata==2024.1
unstructured==0.12.4
unstructured-client==0.21.1
unstructured-inference==0.7.23
unstructured.pytesseract==0.3.12
uri-template==1.3.0
urllib3==2.2.1
uvicorn==0.27.1
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
widgetsnbextension==4.0.10
wrapt==1.16.0
xlrd==2.0.1
yarl==1.9.4
zipp==3.17.0
@dosubot dosubot bot added Ɑ: retriever Related to retriever module 🔌: aws Primarily related to Amazon Web Services (AWS) integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Apr 1, 2024
@liugddx
Copy link
Contributor

liugddx commented Apr 2, 2024

Let me see

@Farid-Ullah
Copy link
Author

Hi @liugddx , Have you checked the issue?
Thank

@Farid-Ullah
Copy link
Author

Farid-Ullah commented Apr 4, 2024

Hi @jarib @zeke , Hope you all doing well.

Could you help me sort out this problem sloution because if it did not work in chain then i will do it customly step by step to acheive this functionality.

Your help would be appreciated.
thank you

@sbusso
Copy link
Contributor

sbusso commented Apr 4, 2024

@Farid-Ullah, no random tagging, please.

baskaryan pushed a commit that referenced this issue Jun 5, 2024
**Description**: 
The AzureAISearchRetriever does not support the "$filter" argument
offered in the AISearch API:
https://learn.microsoft.com/en-us/rest/api/searchservice/documents/search-get?view=rest-searchservice-2023-11-01&tabs=HTTP
The $filter allows filtering of indexes based on values in metadata.

**Issue**: 
#19885

**Dependencies**: 
No

**Twitter handle**: 
@Jeffreym9M
 

- [ ] **Add tests and docs**: Not relevant


- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
hinthornw pushed a commit that referenced this issue Jun 20, 2024
**Description**: 
The AzureAISearchRetriever does not support the "$filter" argument
offered in the AISearch API:
https://learn.microsoft.com/en-us/rest/api/searchservice/documents/search-get?view=rest-searchservice-2023-11-01&tabs=HTTP
The $filter allows filtering of indexes based on values in metadata.

**Issue**: 
#19885

**Dependencies**: 
No

**Twitter handle**: 
@Jeffreym9M
 

- [ ] **Add tests and docs**: Not relevant


- [x] **Lint and test**: Run `make format`, `make lint` and `make test`
from the root of the package(s) you've modified. See contribution
guidelines for more: https://python.langchain.com/docs/contributing/
@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 4, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 11, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jul 11, 2024
@my23701
Copy link

my23701 commented Sep 2, 2024

Hi @Farid-Ullah, did you get the solution for this problem?
I am facing similar problem in my RAG also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔌: aws Primarily related to Amazon Web Services (AWS) integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: retriever Related to retriever module
Projects
None yet
Development

No branches or pull requests

4 participants