Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support indexing a webpage as source #415

Closed
wants to merge 11 commits into from

Conversation

Ellen7ions
Copy link
Contributor

@Ellen7ions Ellen7ions commented Aug 4, 2023

Give Khoj the ability to chat with web pages

Now we can crawl the content from webpages with selenium and chat with them. But there are some shortcomes:

  1. Only support Edge browser driver now.
  2. Cannot crawl pages recursively.

Examples

1. user input interface

input
input2

2. Set a URL like this

3. Chat with Khoj

result

@sabaimran
Copy link
Collaborator

Thanks for raising this PR, @Ellen7ions. I'm starting a discussion here so we can get more of an idea for what the feature requirements here would be: #423.

Copy link
Collaborator

@sabaimran sabaimran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for starting this work! I think this might be a little bit complicated, so let's discuss some of the details a bit further.

Let's continue the discussion over here: #423.


# Input Validation
if is_none_or_empty(urls):
print("At least one of pdf-files or pdf-file-filter is required to be specified")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use logger for any outputs going to the console. Here, you should use logger.error.

# The following code is heavily inspired by the implementation found at: [Auto-GPT](
# https://github.com/Significant-Gravitas/Auto-GPT/blob/master/autogpt/commands/web_selenium.py)

from bs4 import BeautifulSoup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove any commented-out code, unless it's used to explain the code.



def get_webdriver() -> WebDriver:
options: BrowserOptions = EdgeOptions()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason in particular for using Edge? I think it makes sense to use Chrome for a feature like this, as that's the most common web browser.

Comment on lines +138 to +142
if content_type == "url":
default_config = PageContentConfig(
compressed_jsonl=default_content_type["compressed-jsonl"],
embeddings_file=default_content_type["embeddings-file"],
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the endpoint uses a custom config, it's best to use a separate API for updating/deleting its configuration (see github_config_page for example).

@@ -41,6 +41,10 @@ def input_filter_or_files_required(cls, input_filter, values, **kwargs):
return input_filter


class PageContentConfig(TextConfigBase):
input_files: Optional[List[str]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this data type, the input is not optional.

Suggested change
input_files: Optional[List[str]]
input_pages: List[str]

@@ -28,6 +28,12 @@
"compressed-jsonl": "~/.khoj/content/pdf/pdf.jsonl.gz",
"embeddings-file": "~/.khoj/content/pdf/pdf_embeddings.pt",
},
"url": {
"input-files": None,
"input-filter": None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"input-filter" is not a valid property in the configuration.

@@ -79,6 +83,7 @@ class ContentConfig(ConfigBase):
image: Optional[ImageContentConfig]
markdown: Optional[TextContentConfig]
pdf: Optional[TextContentConfig]
url: Optional[PageContentConfig]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes more sense to call this website.

@@ -146,7 +154,7 @@ def content_config_page(request: Request, content_type: str):
current_config = json.loads(current_config.json())

return templates.TemplateResponse(
"content_type_input.html",
f"content_type_{content_type}_input.html",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this and create a separate page specific to the new configuration (e.g., content_type_website_input.html) along with relevant new APIs for CRUD operations on the config.

@@ -15,7 +15,7 @@
web_client = APIRouter()
templates = Jinja2Templates(directory=constants.web_directory)

VALID_TEXT_CONTENT_TYPES = ["org", "markdown", "pdf"]
VALID_TEXT_CONTENT_TYPES = ["org", "markdown", "pdf", "url"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be reverted and new configuration used (as mentioned above).

WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "body")))

# Get the HTML content directly from the browser's DOM
page_source = driver.execute_script("return document.body.outerHTML;")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to use a library like langchain to hold some of this scraping logic and data extraction. See here: https://python.langchain.com/docs/integrations/document_loaders/web_base.

Web scrapers are notoriously brittle and hard to negotiate over time. If we can use a dependency that handles this more reliably, we can spend less time managing this integration.

@Ellen7ions
Copy link
Contributor Author

Hi @sabaimran, I noticed that you are adding support for plaintext, does it include html? Are you planning to encapsulate the webpages indexing in plaintext? if so maybe this PR is not necessary. 🫠

@debanjum
Copy link
Member

Hi @sabaimran, I noticed that you are adding support for plaintext, does it include html? Are you planning to encapsulate the webpages indexing in plaintext? if so maybe this PR is not necessary. 🫠

Hey @Ellen7ions, I don't think that's a problem. Saba's plaintext indexing PR is a generic indexer for any plain text files including markdown, html, org files etc. You PR helps configure the website(s) to index, pull it in for indexing and render it as a separate content type.

Having said that, we should discuss the use-case and requirements for a website indexer in #423 before continuing with this PR. This will inform if we need this as a separate content-type and what requirements should this change satisfy.

@debanjum debanjum force-pushed the master branch 2 times, most recently from 4a3a800 to c93dcc9 Compare August 28, 2023 18:01
@sabaimran
Copy link
Collaborator

Hey @Ellen7ions ! I'm going to close this for the time being, but please get in touch on the Discord if you'd like deeper involvement when we start building out our web crawling infrastructure.

@sabaimran sabaimran closed this Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants