Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL/Website content extraction #373

Open
nkidambi opened this issue Nov 30, 2023 · 3 comments
Open

URL/Website content extraction #373

nkidambi opened this issue Nov 30, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@nkidambi
Copy link
Collaborator

Feature request
Many federal customers have public documents (PDFs) and websites (including FAQs) that they would like to search using Info-Assistant.

Additional Details
Support crawling and extracting content from URL/website with recursion up to a certain configurable depth. Also provide support for filtering out certain URLs like forms, pages that call APIs (like office locator and such) and/or certain domains.

@dayland dayland added the enhancement New feature or request label Nov 30, 2023
@dmitri012
Copy link
Collaborator

  • vote for website content extraction.

Document loading I can solve my self by uploading files to appropriate blob container. Website content extraction needs more specification about supported formats (html, txt, json, etc). Ideally is to provide ready scraper script/function.
I my case I have access to database and can extract content via SQL queries, but supported output format is not clear yet.

@dayland
Copy link
Contributor

dayland commented Dec 12, 2023

@jdnuckolls
Copy link

I'd really like to see this take priority as it is a requirement of just about every customer my team works with. Usually when we tell them this feature is not available in this repo, they use a different repo that already has this feature and they mis out on all the great work that has gone into this repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants