-
-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support indexing a webpage as source #415
Conversation
Note that this crawler only support edge browser now. This part of the code is borrowed from Auto-GPT.
1. Now api `/config/data/content_type/{content_type}` can process URL. 2. api `/search` gets url involved.
Thanks for raising this PR, @Ellen7ions. I'm starting a discussion here so we can get more of an idea for what the feature requirements here would be: #423. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for starting this work! I think this might be a little bit complicated, so let's discuss some of the details a bit further.
Let's continue the discussion over here: #423.
|
||
# Input Validation | ||
if is_none_or_empty(urls): | ||
print("At least one of pdf-files or pdf-file-filter is required to be specified") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use logger
for any outputs going to the console. Here, you should use logger.error
.
# The following code is heavily inspired by the implementation found at: [Auto-GPT]( | ||
# https://github.com/Significant-Gravitas/Auto-GPT/blob/master/autogpt/commands/web_selenium.py) | ||
|
||
from bs4 import BeautifulSoup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove any commented-out code, unless it's used to explain the code.
|
||
|
||
def get_webdriver() -> WebDriver: | ||
options: BrowserOptions = EdgeOptions() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason in particular for using Edge
? I think it makes sense to use Chrome
for a feature like this, as that's the most common web browser.
if content_type == "url": | ||
default_config = PageContentConfig( | ||
compressed_jsonl=default_content_type["compressed-jsonl"], | ||
embeddings_file=default_content_type["embeddings-file"], | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the endpoint uses a custom config, it's best to use a separate API for updating/deleting its configuration (see github_config_page
for example).
@@ -41,6 +41,10 @@ def input_filter_or_files_required(cls, input_filter, values, **kwargs): | |||
return input_filter | |||
|
|||
|
|||
class PageContentConfig(TextConfigBase): | |||
input_files: Optional[List[str]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this data type, the input is not optional.
input_files: Optional[List[str]] | |
input_pages: List[str] |
@@ -28,6 +28,12 @@ | |||
"compressed-jsonl": "~/.khoj/content/pdf/pdf.jsonl.gz", | |||
"embeddings-file": "~/.khoj/content/pdf/pdf_embeddings.pt", | |||
}, | |||
"url": { | |||
"input-files": None, | |||
"input-filter": None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"input-filter"
is not a valid property in the configuration.
@@ -79,6 +83,7 @@ class ContentConfig(ConfigBase): | |||
image: Optional[ImageContentConfig] | |||
markdown: Optional[TextContentConfig] | |||
pdf: Optional[TextContentConfig] | |||
url: Optional[PageContentConfig] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes more sense to call this website
.
@@ -146,7 +154,7 @@ def content_config_page(request: Request, content_type: str): | |||
current_config = json.loads(current_config.json()) | |||
|
|||
return templates.TemplateResponse( | |||
"content_type_input.html", | |||
f"content_type_{content_type}_input.html", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please revert this and create a separate page specific to the new configuration (e.g., content_type_website_input.html
) along with relevant new APIs for CRUD operations on the config.
@@ -15,7 +15,7 @@ | |||
web_client = APIRouter() | |||
templates = Jinja2Templates(directory=constants.web_directory) | |||
|
|||
VALID_TEXT_CONTENT_TYPES = ["org", "markdown", "pdf"] | |||
VALID_TEXT_CONTENT_TYPES = ["org", "markdown", "pdf", "url"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be reverted and new configuration used (as mentioned above).
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "body"))) | ||
|
||
# Get the HTML content directly from the browser's DOM | ||
page_source = driver.execute_script("return document.body.outerHTML;") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to use a library like langchain
to hold some of this scraping logic and data extraction. See here: https://python.langchain.com/docs/integrations/document_loaders/web_base.
Web scrapers are notoriously brittle and hard to negotiate over time. If we can use a dependency that handles this more reliably, we can spend less time managing this integration.
Hi @sabaimran, I noticed that you are adding support for plaintext, does it include html? Are you planning to encapsulate the webpages indexing in plaintext? if so maybe this PR is not necessary. 🫠 |
Hey @Ellen7ions, I don't think that's a problem. Saba's plaintext indexing PR is a generic indexer for any plain text files including markdown, html, org files etc. You PR helps configure the website(s) to index, pull it in for indexing and render it as a separate content type. Having said that, we should discuss the use-case and requirements for a website indexer in #423 before continuing with this PR. This will inform if we need this as a separate content-type and what requirements should this change satisfy. |
4a3a800
to
c93dcc9
Compare
Hey @Ellen7ions ! I'm going to close this for the time being, but please get in touch on the Discord if you'd like deeper involvement when we start building out our web crawling infrastructure. |
Give Khoj the ability to chat with web pages
Now we can crawl the content from webpages with selenium and chat with them. But there are some shortcomes:
Examples
1. user input interface
2. Set a URL like this
3. Chat with Khoj