Support indexing a webpage as source #415

Ellen7ions · 2023-08-04T09:53:31Z

Give Khoj the ability to chat with web pages

Now we can crawl the content from webpages with selenium and chat with them. But there are some shortcomes:

Only support Edge browser driver now.
Cannot crawl pages recursively.

Examples

1. user input interface

2. Set a URL like this

3. Chat with Khoj

Note that this crawler only support edge browser now. This part of the code is borrowed from Auto-GPT.

1. Now api `/config/data/content_type/{content_type}` can process URL. 2. api `/search` gets url involved.

sabaimran · 2023-08-06T21:25:17Z

Thanks for raising this PR, @Ellen7ions. I'm starting a discussion here so we can get more of an idea for what the feature requirements here would be: #423.

sabaimran

Thanks for starting this work! I think this might be a little bit complicated, so let's discuss some of the details a bit further.

Let's continue the discussion over here: #423.

sabaimran · 2023-08-06T21:26:24Z

src/khoj/processor/url/url_to_jsonl.py

+
+        # Input Validation
+        if is_none_or_empty(urls):
+            print("At least one of pdf-files or pdf-file-filter is required to be specified")


Please use logger for any outputs going to the console. Here, you should use logger.error.

sabaimran · 2023-08-06T21:28:26Z

src/khoj/processor/url/web_crawler.py

+# The following code is heavily inspired by the implementation found at: [Auto-GPT](
+# https://github.com/Significant-Gravitas/Auto-GPT/blob/master/autogpt/commands/web_selenium.py)
+
+from bs4 import BeautifulSoup


Please remove any commented-out code, unless it's used to explain the code.

sabaimran · 2023-08-06T21:29:02Z

src/khoj/processor/url/web_crawler.py

+
+
+def get_webdriver() -> WebDriver:
+    options: BrowserOptions = EdgeOptions()


Any reason in particular for using Edge? I think it makes sense to use Chrome for a feature like this, as that's the most common web browser.

sabaimran · 2023-08-06T21:31:26Z

src/khoj/routers/web_client.py

+        if content_type == "url":
+            default_config = PageContentConfig(
+                compressed_jsonl=default_content_type["compressed-jsonl"],
+                embeddings_file=default_content_type["embeddings-file"],
+            )


If the endpoint uses a custom config, it's best to use a separate API for updating/deleting its configuration (see github_config_page for example).

sabaimran · 2023-08-06T21:32:36Z

src/khoj/utils/rawconfig.py

@@ -41,6 +41,10 @@ def input_filter_or_files_required(cls, input_filter, values, **kwargs):
        return input_filter


+class PageContentConfig(TextConfigBase):
+    input_files: Optional[List[str]]


For this data type, the input is not optional.

Suggested change

input_files: Optional[List[str]]

input_pages: List[str]

sabaimran · 2023-08-06T21:33:05Z

src/khoj/utils/constants.py

@@ -28,6 +28,12 @@
            "compressed-jsonl": "~/.khoj/content/pdf/pdf.jsonl.gz",
            "embeddings-file": "~/.khoj/content/pdf/pdf_embeddings.pt",
        },
+        "url": {
+            "input-files": None,
+            "input-filter": None,


"input-filter" is not a valid property in the configuration.

sabaimran · 2023-08-06T21:34:09Z

src/khoj/utils/rawconfig.py

@@ -79,6 +83,7 @@ class ContentConfig(ConfigBase):
    image: Optional[ImageContentConfig]
    markdown: Optional[TextContentConfig]
    pdf: Optional[TextContentConfig]
+    url: Optional[PageContentConfig]


I think it makes more sense to call this website.

sabaimran · 2023-08-06T21:36:00Z

src/khoj/routers/web_client.py

@@ -146,7 +154,7 @@ def content_config_page(request: Request, content_type: str):
        current_config = json.loads(current_config.json())

        return templates.TemplateResponse(
-            "content_type_input.html",
+            f"content_type_{content_type}_input.html",


Please revert this and create a separate page specific to the new configuration (e.g., content_type_website_input.html) along with relevant new APIs for CRUD operations on the config.

sabaimran · 2023-08-06T21:36:24Z

src/khoj/routers/web_client.py

@@ -15,7 +15,7 @@
 web_client = APIRouter()
 templates = Jinja2Templates(directory=constants.web_directory)

-VALID_TEXT_CONTENT_TYPES = ["org", "markdown", "pdf"]
+VALID_TEXT_CONTENT_TYPES = ["org", "markdown", "pdf", "url"]


This should be reverted and new configuration used (as mentioned above).

sabaimran · 2023-08-06T21:45:01Z

src/khoj/processor/url/web_crawler.py

+    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, "body")))
+
+    # Get the HTML content directly from the browser's DOM
+    page_source = driver.execute_script("return document.body.outerHTML;")


It would be better to use a library like langchain to hold some of this scraping logic and data extraction. See here: https://python.langchain.com/docs/integrations/document_loaders/web_base.

Web scrapers are notoriously brittle and hard to negotiate over time. If we can use a dependency that handles this more reliably, we can spend less time managing this integration.

Ellen7ions · 2023-08-08T07:52:17Z

Hi @sabaimran, I noticed that you are adding support for plaintext, does it include html? Are you planning to encapsulate the webpages indexing in plaintext? if so maybe this PR is not necessary. 🫠

debanjum · 2023-08-10T00:43:31Z

Hi @sabaimran, I noticed that you are adding support for plaintext, does it include html? Are you planning to encapsulate the webpages indexing in plaintext? if so maybe this PR is not necessary. 🫠

Hey @Ellen7ions, I don't think that's a problem. Saba's plaintext indexing PR is a generic indexer for any plain text files including markdown, html, org files etc. You PR helps configure the website(s) to index, pull it in for indexing and render it as a separate content type.

Having said that, we should discuss the use-case and requirements for a website indexer in #423 before continuing with this PR. This will inform if we need this as a separate content-type and what requirements should this change satisfy.

sabaimran · 2023-09-08T21:54:58Z

Hey @Ellen7ions ! I'm going to close this for the time being, but please get in touch on the Discord if you'd like deeper involvement when we start building out our web crawling infrastructure.

Ellen7ions added 10 commits August 4, 2023 16:34

Add web crawler.

a9c7652

Note that this crawler only support edge browser now. This part of the code is borrowed from Auto-GPT.

Add url input pages.

b511ac8

Add url render for search mode.

605e50a

Fix a conflict bug about the clearContentType of URL and Pdf.

b8a4a23

Add URL parsing to the api.

bc00219

1. Now api `/config/data/content_type/{content_type}` can process URL. 2. api `/search` gets url involved.

Initialize URL Search.

b95f3d7

Parse url content to Jsonl.

fc0378e

Use Edge icon.

a385d75

Some typos

860c438

Add dependencies. "selenium == 4.1.4", "webdriver-manager >= 4.0.0"

8d33d8e

Ellen7ions mentioned this pull request Aug 4, 2023

[feature request] txt file plugin #416

Closed

Add code inspired Auto-GPT.

b9dbba9

sabaimran requested changes Aug 6, 2023

View reviewed changes

debanjum force-pushed the master branch 2 times, most recently from 4a3a800 to c93dcc9 Compare August 28, 2023 18:01

sabaimran closed this Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support indexing a webpage as source #415

Support indexing a webpage as source #415

Ellen7ions commented Aug 4, 2023 •

edited

sabaimran commented Aug 6, 2023

sabaimran left a comment

sabaimran Aug 6, 2023

sabaimran Aug 6, 2023

sabaimran Aug 6, 2023

sabaimran Aug 6, 2023

sabaimran Aug 6, 2023

sabaimran Aug 6, 2023

sabaimran Aug 6, 2023

sabaimran Aug 6, 2023

sabaimran Aug 6, 2023

sabaimran Aug 6, 2023

Ellen7ions commented Aug 8, 2023

debanjum commented Aug 10, 2023

sabaimran commented Sep 8, 2023



		def get_webdriver() -> WebDriver:
		options: BrowserOptions = EdgeOptions()

Support indexing a webpage as source #415

Support indexing a webpage as source #415

Conversation

Ellen7ions commented Aug 4, 2023 • edited

Give Khoj the ability to chat with web pages

Examples

1. user input interface

2. Set a URL like this

3. Chat with Khoj

sabaimran commented Aug 6, 2023

sabaimran left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Ellen7ions commented Aug 8, 2023

debanjum commented Aug 10, 2023

sabaimran commented Sep 8, 2023

Ellen7ions commented Aug 4, 2023 •

edited