Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amazon Textract as document loader #8661

Merged
merged 10 commits into from
Aug 4, 2023

Conversation

3coins
Copy link
Contributor

@3coins 3coins commented Aug 3, 2023

Description: Adding support for Amazon Textract as a PDF document loader

@vercel
Copy link

vercel bot commented Aug 3, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Aug 4, 2023 7:29pm

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Aug 3, 2023
@3coins 3coins marked this pull request as ready for review August 3, 2023 04:27
@eyurtsev eyurtsev self-requested a review August 3, 2023 17:49
@eyurtsev
Copy link
Collaborator

eyurtsev commented Aug 3, 2023

@3coins tag me whenever you're ready for re-review

@3coins
Copy link
Contributor Author

3coins commented Aug 4, 2023

@eyurtsev
Took care of your suggestions:

  1. Moved the imports inside the constructor.
  2. Updated the typing hint to Sequence[int], int was easier to convert to the Enum for downstream lib.
  3. Added docs for the constructor arguments
  4. Specifying an S3 path doesn't download the file now, thanks for suggesting this correction 🚀
  5. Updated the logic in parser for accommodating 4.

About your suggestion to work on a S3 blob generator, that's a great idea, and I can work on it if you create the interface. Let me know if there is anything else I should update here.

@schadem
Copy link
Contributor

schadem commented Aug 4, 2023

@3coins: rgdg

2. Updated the typing hint to `Sequence[int]`, int was easier to convert to the Enum for downstream lib.

it would be easier for users to pass in a str instead of an int identifying the features (e. g. "FORMS", "TABLES", "SIGNATURES"

@schadem
Copy link
Contributor

schadem commented Aug 4, 2023

chatted with @3coins and we will change the int to str

@3coins
Copy link
Contributor Author

3coins commented Aug 4, 2023

@schadem
Thanks for updating the PR. Once you post these changes, plz tag @eyurtsev, so he can do the final review and merge.

@3coins
Copy link
Contributor Author

3coins commented Aug 4, 2023

@eyurtsev
Done with all updates. Plz review and let us know if you have any other feedback.

# raises ValueError when multi-page and not on S3"""

if self.web_path and self._is_s3_url(self.web_path):
blob = Blob(path=self.web_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious any reason why the Blob is initalized differently here from the line below?

Do you have any advice about Langchain's blob object design? Should we clarify whether it is meant to handle remote paths like s3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main reason to handle it differently was that for s3 and http paths, the path reference is in web_path, vs for local files. I am not sure why that distinction is needed in the base loader.

boto3_textract_client=self.boto3_textract_client,
)

current_text = ""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not required for merging pr) -- for devs not familiar with the it's unclear what type of information is returned and whether this logic can be assumed to be the "correct" way to project the raw response onto documents

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schadem
Can you provide explanation for this part? We should add some comments around this in a follow up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@3coins : Will do. I'll create a follow up PR and also add it to the comments, like @baskaryan mentions below.

@eyurtsev
Copy link
Collaborator

eyurtsev commented Aug 4, 2023

@3coins looks good to me, going to merge.

@eyurtsev eyurtsev merged commit 8374367 into langchain-ai:master Aug 4, 2023
23 checks passed
@baskaryan
Copy link
Collaborator

this is super cool! we should add to the docs so folks can find it

hwchase17 added a commit that referenced this pull request Aug 20, 2023
Description: Updating documentation to add AmazonTextractPDFLoader
according to
[comment](#8661 (comment))
from [baskaryan](https://github.com/baskaryan)

Adding one notebook and instructions to the
modules/data_connection/document_loaders/pdf.mdx

---------

Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com>
Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants