-
Notifications
You must be signed in to change notification settings - Fork 13.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amazon Textract as document loader #8661
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
@3coins tag me whenever you're ready for re-review |
f2545d9
to
5c3315f
Compare
@eyurtsev
About your suggestion to work on a S3 blob generator, that's a great idea, and I can work on it if you create the interface. Let me know if there is anything else I should update here. |
@3coins: rgdg
it would be easier for users to pass in a str instead of an int identifying the features (e. g. "FORMS", "TABLES", "SIGNATURES" |
chatted with @3coins and we will change the int to str |
Adding a skip marker for the integration test.
@eyurtsev |
# raises ValueError when multi-page and not on S3""" | ||
|
||
if self.web_path and self._is_s3_url(self.web_path): | ||
blob = Blob(path=self.web_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious any reason why the Blob is initalized differently here from the line below?
Do you have any advice about Langchain's blob object design? Should we clarify whether it is meant to handle remote paths like s3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason to handle it differently was that for s3 and http paths, the path reference is in web_path, vs for local files. I am not sure why that distinction is needed in the base loader.
boto3_textract_client=self.boto3_textract_client, | ||
) | ||
|
||
current_text = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(not required for merging pr) -- for devs not familiar with the it's unclear what type of information is returned and whether this logic can be assumed to be the "correct" way to project the raw response onto documents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schadem
Can you provide explanation for this part? We should add some comments around this in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@3coins : Will do. I'll create a follow up PR and also add it to the comments, like @baskaryan mentions below.
@3coins looks good to me, going to merge. |
this is super cool! we should add to the docs so folks can find it |
Description: Updating documentation to add AmazonTextractPDFLoader according to [comment](#8661 (comment)) from [baskaryan](https://github.com/baskaryan) Adding one notebook and instructions to the modules/data_connection/document_loaders/pdf.mdx --------- Co-authored-by: Eugene Yurtsev <eyurtsev@gmail.com> Co-authored-by: Harrison Chase <hw.chase.17@gmail.com>
Description: Adding support for Amazon Textract as a PDF document loader