Amazon Textract as document loader #8661

3coins · 2023-08-03T03:54:43Z

Description: Adding support for Amazon Textract as a PDF document loader

vercel · 2023-08-03T03:54:47Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		Aug 4, 2023 7:29pm

eyurtsev · 2023-08-03T17:49:58Z

@3coins tag me whenever you're ready for re-review

3coins · 2023-08-04T02:52:16Z

@eyurtsev
Took care of your suggestions:

Moved the imports inside the constructor.
Updated the typing hint to Sequence[int], int was easier to convert to the Enum for downstream lib.
Added docs for the constructor arguments
Specifying an S3 path doesn't download the file now, thanks for suggesting this correction 🚀
Updated the logic in parser for accommodating 4.

About your suggestion to work on a S3 blob generator, that's a great idea, and I can work on it if you create the interface. Let me know if there is anything else I should update here.

schadem · 2023-08-04T15:58:23Z

@3coins: rgdg

2. Updated the typing hint to `Sequence[int]`, int was easier to convert to the Enum for downstream lib.

it would be easier for users to pass in a str instead of an int identifying the features (e. g. "FORMS", "TABLES", "SIGNATURES"

schadem · 2023-08-04T16:22:55Z

chatted with @3coins and we will change the int to str

3coins · 2023-08-04T16:59:06Z

@schadem
Thanks for updating the PR. Once you post these changes, plz tag @eyurtsev, so he can do the final review and merge.

Adding a skip marker for the integration test.

3coins · 2023-08-04T19:42:57Z

@eyurtsev
Done with all updates. Plz review and let us know if you have any other feedback.

eyurtsev · 2023-08-04T19:50:27Z

libs/langchain/langchain/document_loaders/pdf.py

+        # raises ValueError when multi-page and not on S3"""
+
+        if self.web_path and self._is_s3_url(self.web_path):
+            blob = Blob(path=self.web_path)


Curious any reason why the Blob is initalized differently here from the line below?

Do you have any advice about Langchain's blob object design? Should we clarify whether it is meant to handle remote paths like s3?

The main reason to handle it differently was that for s3 and http paths, the path reference is in web_path, vs for local files. I am not sure why that distinction is needed in the base loader.

eyurtsev · 2023-08-04T19:52:16Z

libs/langchain/langchain/document_loaders/parsers/pdf.py

+                boto3_textract_client=self.boto3_textract_client,
+            )
+
+        current_text = ""


(not required for merging pr) -- for devs not familiar with the it's unclear what type of information is returned and whether this logic can be assumed to be the "correct" way to project the raw response onto documents

@schadem
Can you provide explanation for this part? We should add some comments around this in a follow up PR.

@3coins : Will do. I'll create a follow up PR and also add it to the comments, like @baskaryan mentions below.

eyurtsev · 2023-08-04T19:54:10Z

@3coins looks good to me, going to merge.

baskaryan · 2023-08-05T18:17:28Z

this is super cool! we should add to the docs so folks can find it

Description: Updating documentation to add AmazonTextractPDFLoader according to [comment](#8661 (comment)) from [baskaryan](https://github.com/baskaryan) Adding one notebook and instructions to the modules/data_connection/document_loaders/pdf.mdx --------- Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Aug 3, 2023

3coins marked this pull request as ready for review August 3, 2023 04:27

3coins mentioned this pull request Aug 3, 2023

Amazon Textract as document loader added #8645

Closed

eyurtsev self-requested a review August 3, 2023 17:49

schadem and others added 6 commits August 3, 2023 19:27

Amazon Textract as document loader added

1f57452

lint

61d46e7

Fixed imports, parametrized tests, ci errors

111dc20

Fixing lint and test errors

02c9d3a

Updated as per PR feedback

1f44d19

Fixed lock after rebase

5c3315f

3coins force-pushed the pijain-textract-support branch from f2545d9 to 5c3315f Compare August 4, 2023 02:33

Fixed logic for multi-page pdf

06e83d3

schadem and others added 3 commits August 4, 2023 19:01

change Textract_features int to str and add fail validation test

e61ad47

Update test_pdf.py

7b07918

Adding a skip marker for the integration test.

Fixed lint errors

052ddd7

eyurtsev approved these changes Aug 4, 2023

View reviewed changes

eyurtsev merged commit 8374367 into langchain-ai:master Aug 4, 2023
23 checks passed

schadem mentioned this pull request Aug 17, 2023

AmazonTextractPDFLoader documentation updates #9415

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Amazon Textract as document loader #8661

Amazon Textract as document loader #8661

3coins commented Aug 3, 2023 •

edited by eyurtsev

vercel bot commented Aug 3, 2023 •

edited

eyurtsev commented Aug 3, 2023

3coins commented Aug 4, 2023

schadem commented Aug 4, 2023

schadem commented Aug 4, 2023

3coins commented Aug 4, 2023

3coins commented Aug 4, 2023

eyurtsev Aug 4, 2023

3coins Aug 4, 2023

eyurtsev Aug 4, 2023

3coins Aug 4, 2023

schadem Aug 7, 2023

eyurtsev commented Aug 4, 2023

baskaryan commented Aug 5, 2023

Amazon Textract as document loader #8661

Amazon Textract as document loader #8661

Conversation

3coins commented Aug 3, 2023 • edited by eyurtsev

vercel bot commented Aug 3, 2023 • edited

eyurtsev commented Aug 3, 2023

3coins commented Aug 4, 2023

schadem commented Aug 4, 2023

schadem commented Aug 4, 2023

3coins commented Aug 4, 2023

3coins commented Aug 4, 2023

eyurtsev Aug 4, 2023

Choose a reason for hiding this comment

3coins Aug 4, 2023

Choose a reason for hiding this comment

eyurtsev Aug 4, 2023

Choose a reason for hiding this comment

3coins Aug 4, 2023

Choose a reason for hiding this comment

schadem Aug 7, 2023

Choose a reason for hiding this comment

eyurtsev commented Aug 4, 2023

baskaryan commented Aug 5, 2023

3coins commented Aug 3, 2023 •

edited by eyurtsev

vercel bot commented Aug 3, 2023 •

edited