Merge pull request #9190 from mindsdb/verify-web

Verify web
mindsdb · May 15, 2024 · 1dc5130 · 1dc5130
2 parents 037f87c + f999bf5
commit 1dc5130
Show file tree

Hide file tree

Showing 12 changed files with 419 additions and 293 deletions.
diff --git a/.flake8 b/.flake8
@@ -89,11 +89,9 @@ exclude =
   mindsdb/integrations/handlers/quickbooks_handler/*
   mindsdb/integrations/handlers/strava_handler/*
   mindsdb/integrations/handlers/strava_handler/*
-  mindsdb/integrations/handlers/web_handler/*
   mindsdb/integrations/handlers/strava_handler/*
   mindsdb/integrations/handlers/github_handler/*
   mindsdb/integrations/handlers/vitess_handler/*
-  mindsdb/integrations/handlers/web_handler/*
   mindsdb/integrations/handlers/impala_handler/*
   mindsdb/integrations/handlers/tdengine_handler/*
   mindsdb/integrations/handlers/huggingface_api_handler/*

diff --git a/.github/workflows/test_on_push.yml b/.github/workflows/test_on_push.yml
@@ -119,6 +119,7 @@ jobs:
         pip install mindsdb[mssql]
         pip install mindsdb[clickhouse]
         pip install mindsdb[snowflake]
+        pip install mindsdb[web]
         pip freeze
     - name: Run unit tests
       run: |
@@ -134,7 +135,7 @@ jobs:
         fi
     - name: Run Handlers tests and submit Coverage to coveralls
       run: |
-        handlers=("mysql" "postgres" "mssql" "clickhouse" "snowflake")
+        handlers=("mysql" "postgres" "mssql" "clickhouse" "snowflake" "web")
         for handler in "${handlers[@]}"
         do
           pytest --cov=mindsdb/integrations/handlers/${handler}_handler tests/unit/handlers/test_${handler}.py 

diff --git a/docs/integrations/app-integrations/web-crawler.mdx b/docs/integrations/app-integrations/web-crawler.mdx
@@ -5,16 +5,14 @@ sidebarTitle: Web Crawler
 
 In this section, we present how to use a web crawler within MindsDB.
 
-A web crawler is a computer program or automated script that browses the internet and navigates through websites, web pages, and web content to gather data. Within the realm of MindsDB, a web crawler can be employed to harvest data, which can be used to train models, 
-domain specific chatbots or fine-tune LLMs.
+A web crawler is an automated script designed to systematically browse and index content on the internet. Within MindsDB, you can utilize a web crawler to efficiently collect data from various websites.
 
 ## Prerequisites
 
 Before proceeding, ensure the following prerequisites are met:
 
 1. Install MindsDB locally via [Docker](/setup/self-hosted/docker) or [Docker Desktop](/setup/self-hosted/docker-desktop).
-2. To connect Web Crawler to MindsDB, install the required dependencies following [this instruction](/setup/self-hosted/docker#install-dependencies).
-3. Install or ensure access to Web Crawler.
+2. To use Web Crawler with MindsDB, install the required dependencies following [this instruction](/setup/self-hosted/docker#install-dependencies).
 
 ## Connection
 
@@ -26,12 +24,19 @@ Here is how to initialize a web crawler:
 CREATE DATABASE my_web 
 WITH ENGINE = 'web';
 ```
+<Tip>
+The above query creates a database called `my_web`. This database by default has a table called `crawler` that we can use to crawl data from a given url/urls.
+</Tip>
 
 ## Usage
 
+<Note>
+Specifying a `LIMIT` clause is required. To crawl all pages on a site, consider setting the limit to a high value, such as 10,000, which exceeds the expected number of pages. Be aware that setting a higher limit may result in longer response times.
+</Note>
+
 ### Get Websites Content
 
-Here is how to get the content of `docs.mindsdb.com`:
+The following usage examples demonstrate how to retrieve content from `docs.mindsdb.com`:
 
 ```sql
 SELECT * 
@@ -40,7 +45,7 @@ WHERE url = 'docs.mindsdb.com'
 LIMIT 1;
 ```
 
-You can also get the content of internal pages. Here is how to fetch the content from 10 internal pages:
+You can also retrieve content from internal pages. The following query fetches the content from 10 internal pages:
 
 ```sql
 SELECT * 
@@ -49,7 +54,7 @@ WHERE url = 'docs.mindsdb.com'
 LIMIT 10;
 ```
 
-Another option is to get the content from multiple websites.
+Another option is to get the content from multiple websites by using the `IN ()` operator:
 
 ```sql
 SELECT * 
@@ -60,7 +65,7 @@ LIMIT 1;
 
 ### Get PDF Content
 
-MindsDB accepts [file uploads](/sql/create/file) of `csv`, `xlsx`, `xls`, `sheet`, `json`, and `parquet`. However, you can utilize the web crawler to fetch data from `pdf` files.
+MindsDB accepts [file uploads](/sql/create/file) of `csv`, `xlsx`, `xls`, `sheet`, `json`, and `parquet`. However, you can also configure the web crawler to fetch data from PDF files accessible via URLs.
 
 ```sql
 SELECT * 
@@ -69,4 +74,22 @@ WHERE url = '<link-to-pdf-file>'
 LIMIT 1;
 ```
 
-For example, you can provide a link to a `pdf` file stored in Amazon S3.
+## Troubleshooting
+
+<Warning>
+`Web crawler encounters character encoding issues`
+
+* **Symptoms**: Extracted text appears garbled or contains strange characters instead of the expected text.
+* **Checklist**:
+      1. Open a GitHub Issue: If you encounter a bug or a repeatable error with encoding, 
+      report it on the [MindsDB GitHub](https://github.com/mindsdb/mindsdb/issues) repository by opening an issue.
+</Warning>
+
+
+<Warning>
+`Web crawler times out while trying to fetch content`
+
+* **Symptoms**: The crawler fails to retrieve data from a website, resulting in timeout errors.
+* **Checklist**:
+      1. Check the network connection to ensure the target site is reachable.
+</Warning>
diff --git a/mindsdb/integrations/handlers/web_handler/README.md b/mindsdb/integrations/handlers/web_handler/README.md
@@ -1,54 +1,95 @@
-# Build your Web crawler
+---
+title: Web Crawler
+sidebarTitle: Web Crawler
+---
 
-This integration allows you to query the results of a crawler in SQL:
+In this section, we present how to use a web crawler within MindsDB.
 
-- This can be particularly useful for building A/Q systems from data on a website.
+A web crawler is an automated script designed to systematically browse and index content on the internet. Within MindsDB, you can utilize a web crawler to efficiently collect data from various websites.
 
-Note that this crawler can crawl every single sub-site from the original.
+## Prerequisites
 
-Let's see in action
+Before proceeding, ensure the following prerequisites are met:
+
+1. Install MindsDB locally via [Docker](/setup/self-hosted/docker) or [Docker Desktop](/setup/self-hosted/docker-desktop).
+2. To use Web Crawler with MindsDB, install the required dependencies following [this instruction](/setup/self-hosted/docker#install-dependencies).
+
+## Connection
+
+This handler does not require any connection parameters.
+
+Here is how to initialize a web crawler:
 
 ```sql
--- Should be able to create a web crawler database
 CREATE DATABASE my_web 
-With 
-    ENGINE = 'web';
+WITH ENGINE = 'web';
 ```
+<Tip>
+The above query creates a database called `my_web`. This database by default has a table called `crawler` that we can use to crawl data from a given url/urls.
+</Tip>
 
-This creates a database called my_web. This database ships with a table called crawler that we can use to crawl data given some url/urls.
+## Usage
 
+<Note>
+Specifying a `LIMIT` clause is required. To crawl all pages on a site, consider setting the limit to a high value, such as 10,000, which exceeds the expected number of pages. Be aware that setting a higher limit may result in longer response times.
+</Note>
 
-## Searching for web content in SQL
+### Get Websites Content
 
-Let's get the content of a docs.mindsdb.com website:
+The following usage examples demonstrate how to retrieve content from `docs.mindsdb.com`:
 
 ```sql
-SELECT 
-   * 
+SELECT * 
 FROM my_web.crawler 
-WHERE 
-   url = 'docs.mindsdb.com' 
+WHERE url = 'docs.mindsdb.com' 
 LIMIT 1;
 ```
 
+You can also retrieve content from internal pages. The following query fetches the content from 10 internal pages:
+
+```sql
+SELECT * 
+FROM my_web.crawler 
+WHERE url = 'docs.mindsdb.com' 
+LIMIT 10;
+```
+
+Another option is to get the content from multiple websites by using the `IN ()` operator:
 
-This should return the contents of docs.mindsdb.com.
+```sql
+SELECT * 
+FROM my_web.crawler 
+WHERE url IN ('docs.mindsdb.com', 'docs.python.org') 
+LIMIT 1;
+```
 
+### Get PDF Content
 
-Now, let's assume we want to search for the content on multiple websites.
+MindsDB accepts [file uploads](/sql/create/file) of `csv`, `xlsx`, `xls`, `sheet`, `json`, and `parquet`. However, you can also configure the web crawler to fetch data from PDF files accessible via URLs.
 
 ```sql
-SELECT 
-   * 
+SELECT * 
 FROM my_web.crawler 
-WHERE 
-   url IN ('docs.mindsdb.com', 'docs.python.org') 
-LIMIT 30;
+WHERE url = '<link-to-pdf-file>' 
+LIMIT 1;
 ```
 
-This command will crawl two sites and stop when the results count hits 30. The total count of rows in the result will be 30.
+## Troubleshooting
+
+<Warning>
+`Web crawler encounters character encoding issues`
+
+* **Symptoms**: Extracted text appears garbled or contains strange characters instead of the expected text.
+* **Checklist**:
+      1. Open a GitHub Issue: If you encounter a bug or a repeatable error with encoding, 
+      report it on the [MindsDB GitHub](https://github.com/mindsdb/mindsdb/issues) repository by opening an issue.
+</Warning>
 
-NOTE: limit is mandatory. If you want to crawl all pages on the site, you can pass a big number in the limit (for example, 10000), more than the expected count of pages on the site. 
-However, a big limit also increases the time waiting for a response.
 
+<Warning>
+`Web crawler times out while trying to fetch content`
 
+* **Symptoms**: The crawler fails to retrieve data from a website, resulting in timeout errors.
+* **Checklist**:
+      1. Check the network connection to ensure the target site is reachable.
+</Warning>
diff --git a/mindsdb/integrations/handlers/web_handler/__about__.py b/mindsdb/integrations/handlers/web_handler/__about__.py
@@ -6,4 +6,4 @@
 __github__ = 'https://github.com/mindsdb/mindsdb'
 __pypi__ = 'https://pypi.org/project/mindsdb/'
 __license__ = 'MIT'
-__copyright__ = 'Copyright 2022- mindsdb'
+__copyright__ = 'Copyright 2022 - MindsDB'
diff --git a/mindsdb/integrations/handlers/web_handler/requirements.txt b/mindsdb/integrations/handlers/web_handler/requirements.txt
@@ -1 +1,2 @@
-pymupdf
+pymupdf
+html2text
diff --git a/mindsdb/integrations/handlers/web_handler/tests/__init__.py b/mindsdb/integrations/handlers/web_handler/tests/__init__.py
diff --git a/mindsdb/integrations/handlers/web_handler/tests/example_data.py b/mindsdb/integrations/handlers/web_handler/tests/example_data.py
diff --git a/mindsdb/integrations/handlers/web_handler/tests/test_helpers.py b/mindsdb/integrations/handlers/web_handler/tests/test_helpers.py