feat: add support for arxiv identifier in ArxivAPIWrapper() #9318

LMC117 · 2023-08-16T14:08:23Z

Description: this PR adds the support for arxiv identifier of the ArxivAPIWrapper. I modified the run() and load() functions in arxiv.py, using regex to recognize if the query is in the form of arxiv identifier (see https://info.arxiv.org/help/find/index.html). If so, it will directly search the paper corresponding to the arxiv identifier. I also modified and added tests in test_arxiv.py.
Issue: ArxivLoader support searching by arxiv id_list #9047
Dependencies: N/A
Tag maintainer: N/A

ArxivAPIWrapper

vercel · 2023-08-16T14:08:27Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Sep 28, 2023 0:04am

baskaryan · 2023-08-16T18:54:18Z

lgtm, cc @leo-gan

leo-gan · 2023-08-16T19:26:45Z

I've tried
loader = ArxivLoader(query="1605.08386 2308.07912 2308.07910", load_max_docs=3)
and it returns all 3 papers.
Disclaimer: if IDs are separated by , it didn't work! Only the space separator works.

I don't think we need special treatment for multiple paper IDs because it works right now.

LMC117 · 2023-08-17T00:19:39Z

Hi,

If the version number is specified in the query (e.g. 2212.00794v2), no results will be returned. So I think there is a need to handle the arxiv identifier separately.

leo-gan · 2023-08-17T00:57:05Z

If the version number is specified in the query (e.g. 2212.00794v2), no results will be returned. So I think there is a need to handle the arxiv identifier separately.

OK. Then, please add unit tests to work with Ids.

LMC117 · 2023-08-18T01:39:15Z

Hi @leo-gan, I've committed new unit tests. You can check that

leo-gan

Thank you!
LGTM

baskaryan · 2023-08-24T06:30:24Z

libs/langchain/langchain/utilities/arxiv.py

@@ -54,6 +58,14 @@ class ArxivAPIWrapper(BaseModel):
    load_all_available_meta: bool = False
    doc_content_chars_max: Optional[int] = 4000

+    def is_arxiv_identifier(self, query: str) -> bool:


could we add a few simple unit tests for this method

sure, I will add it soon

@LMC117 Looks great! LGTM

baskaryan · 2023-08-24T18:58:53Z

thanks @LMC117!

…arxiv_id

leo-gan · 2023-09-21T21:54:29Z

@LMC117 Hi , could you, please, resolve the merging issues? After that ping me and I push this PR for the review. Thanks!

LMC117 · 2023-09-22T05:08:36Z

@leo-gan hi I resolved that the merging issue

hwchase17 · 2023-09-27T20:56:52Z

libs/langchain/langchain/utilities/arxiv.py

-            ).results()
+            if self.is_arxiv_identifier(query):
+                results = self.arxiv_search(
+                    id_list=query[: self.ARXIV_MAX_QUERY_LENGTH].split(),


i think we dont want to do self.ARXIV_MAX_QUERY_LENGTH right?

…n-ai#9318) - Description: this PR adds the support for arxiv identifier of the ArxivAPIWrapper. I modified the `run()` and `load()` functions in `arxiv.py`, using regex to recognize if the query is in the form of arxiv identifier (see [https://info.arxiv.org/help/find/index.html](https://info.arxiv.org/help/find/index.html)). If so, it will directly search the paper corresponding to the arxiv identifier. I also modified and added tests in `test_arxiv.py`. - Issue: langchain-ai#9047 - Dependencies: N/A - Tag maintainer: N/A --------- Co-authored-by: Bagatur <baskaryan@gmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

feat: add support for arxiv identifier of

55e894d

ArxivAPIWrapper

dosubot bot added the 🤖:improvement Medium size change to existing code to handle new use-cases label Aug 16, 2023

LMC117 added 3 commits August 17, 2023 20:00

Merge branch 'langchain-ai:master' into support_arxiv_id

889874c

fix: missing parameter in is_arxiv_identifier()

c33b4f0

feat: add support for multiple arxiv IDs

bbb519d

vercel bot had a problem deploying to Preview – langchain August 17, 2023 12:09 Failure

LMC117 added 2 commits August 17, 2023 12:16

test: add test for multiple arxiv IDs

0957abe

Merge branch 'master' into support_arxiv_id

2b6ceee

Merge branch 'master' into support_arxiv_id

039abcc

vercel bot deployed to Preview – langchain August 18, 2023 01:53 View deployment

leo-gan approved these changes Aug 18, 2023

View reviewed changes

baskaryan reviewed Aug 24, 2023

View reviewed changes

Merge branch 'master' into support_arxiv_id

eddeb9b

vercel bot deployed to Preview – langchain August 24, 2023 12:54 View deployment

LMC117 and others added 4 commits August 24, 2023 13:28

feat: modified the regex in is_arxiv_identifier()

383fc44

test: add unit test for is_arxiv_identifier()

2fd4e0b

fmt

c9984f8

add

f054336

baskaryan added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Aug 24, 2023

LMC117 added 4 commits August 24, 2023 23:51

lint: reformat files

7285440

Merge commit '0f48e6c36eb23a5c5fdcd3c15ec31fa0c4dfd5f5' into support_…

f02082f

…arxiv_id

Merge remote branch into local support_arxiv_id

4f5c56e

Merge branch 'master' into support_arxiv_id

a79e574

vercel bot had a problem deploying to Preview – langchain August 25, 2023 11:39 Failure

LMC117 added 2 commits August 26, 2023 06:11

Merge branch 'master' into support_arxiv_id

bfb9ec4

Merge branch 'master' into support_arxiv_id

2868648

vercel bot deployed to Preview – langchain August 25, 2023 22:34 View deployment

Merge branch 'master' into support_arxiv_id

6cbf5d9

vercel bot deployed to Preview – langchain August 28, 2023 23:22 View deployment

LMC117 and others added 4 commits August 28, 2023 23:24

update poetry.lock

082cae8

cr

9545522

cr

ebcc169

add assertion to avoid mypy error

e002ca0

leo-gan approved these changes Sep 21, 2023

View reviewed changes

Merge branch 'master' into support_arxiv_id

0b02374

vercel bot deployed to Preview September 22, 2023 05:18 View deployment

hwchase17 reviewed Sep 27, 2023

View reviewed changes

hwchase17 added 2 commits September 27, 2023 17:03

cr

7048015

cr

b5d9f1d

hwchase17 merged commit 05b75f3 into langchain-ai:master Sep 28, 2023
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for arxiv identifier in ArxivAPIWrapper() #9318

feat: add support for arxiv identifier in ArxivAPIWrapper() #9318

LMC117 commented Aug 16, 2023

vercel bot commented Aug 16, 2023 •

edited

baskaryan commented Aug 16, 2023

leo-gan commented Aug 16, 2023

LMC117 commented Aug 17, 2023

leo-gan commented Aug 17, 2023

LMC117 commented Aug 18, 2023

leo-gan left a comment

baskaryan Aug 24, 2023

LMC117 Aug 24, 2023

leo-gan Aug 24, 2023

baskaryan commented Aug 24, 2023

leo-gan commented Sep 21, 2023

LMC117 commented Sep 22, 2023

hwchase17 Sep 27, 2023

feat: add support for arxiv identifier in ArxivAPIWrapper() #9318

feat: add support for arxiv identifier in ArxivAPIWrapper() #9318

Conversation

LMC117 commented Aug 16, 2023

vercel bot commented Aug 16, 2023 • edited

baskaryan commented Aug 16, 2023

leo-gan commented Aug 16, 2023

LMC117 commented Aug 17, 2023

leo-gan commented Aug 17, 2023

LMC117 commented Aug 18, 2023

leo-gan left a comment

Choose a reason for hiding this comment

baskaryan Aug 24, 2023

Choose a reason for hiding this comment

LMC117 Aug 24, 2023

Choose a reason for hiding this comment

leo-gan Aug 24, 2023

Choose a reason for hiding this comment

baskaryan commented Aug 24, 2023

leo-gan commented Sep 21, 2023

LMC117 commented Sep 22, 2023

hwchase17 Sep 27, 2023

Choose a reason for hiding this comment

vercel bot commented Aug 16, 2023 •

edited