-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #9190 from mindsdb/verify-web
Verify web
- Loading branch information
Showing
12 changed files
with
419 additions
and
293 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,54 +1,95 @@ | ||
# Build your Web crawler | ||
--- | ||
title: Web Crawler | ||
sidebarTitle: Web Crawler | ||
--- | ||
|
||
This integration allows you to query the results of a crawler in SQL: | ||
In this section, we present how to use a web crawler within MindsDB. | ||
|
||
- This can be particularly useful for building A/Q systems from data on a website. | ||
A web crawler is an automated script designed to systematically browse and index content on the internet. Within MindsDB, you can utilize a web crawler to efficiently collect data from various websites. | ||
|
||
Note that this crawler can crawl every single sub-site from the original. | ||
## Prerequisites | ||
|
||
Let's see in action | ||
Before proceeding, ensure the following prerequisites are met: | ||
|
||
1. Install MindsDB locally via [Docker](/setup/self-hosted/docker) or [Docker Desktop](/setup/self-hosted/docker-desktop). | ||
2. To use Web Crawler with MindsDB, install the required dependencies following [this instruction](/setup/self-hosted/docker#install-dependencies). | ||
|
||
## Connection | ||
|
||
This handler does not require any connection parameters. | ||
|
||
Here is how to initialize a web crawler: | ||
|
||
```sql | ||
-- Should be able to create a web crawler database | ||
CREATE DATABASE my_web | ||
With | ||
ENGINE = 'web'; | ||
WITH ENGINE = 'web'; | ||
``` | ||
<Tip> | ||
The above query creates a database called `my_web`. This database by default has a table called `crawler` that we can use to crawl data from a given url/urls. | ||
</Tip> | ||
|
||
This creates a database called my_web. This database ships with a table called crawler that we can use to crawl data given some url/urls. | ||
## Usage | ||
|
||
<Note> | ||
Specifying a `LIMIT` clause is required. To crawl all pages on a site, consider setting the limit to a high value, such as 10,000, which exceeds the expected number of pages. Be aware that setting a higher limit may result in longer response times. | ||
</Note> | ||
|
||
## Searching for web content in SQL | ||
### Get Websites Content | ||
|
||
Let's get the content of a docs.mindsdb.com website: | ||
The following usage examples demonstrate how to retrieve content from `docs.mindsdb.com`: | ||
|
||
```sql | ||
SELECT | ||
* | ||
SELECT * | ||
FROM my_web.crawler | ||
WHERE | ||
url = 'docs.mindsdb.com' | ||
WHERE url = 'docs.mindsdb.com' | ||
LIMIT 1; | ||
``` | ||
|
||
You can also retrieve content from internal pages. The following query fetches the content from 10 internal pages: | ||
|
||
```sql | ||
SELECT * | ||
FROM my_web.crawler | ||
WHERE url = 'docs.mindsdb.com' | ||
LIMIT 10; | ||
``` | ||
|
||
Another option is to get the content from multiple websites by using the `IN ()` operator: | ||
|
||
This should return the contents of docs.mindsdb.com. | ||
```sql | ||
SELECT * | ||
FROM my_web.crawler | ||
WHERE url IN ('docs.mindsdb.com', 'docs.python.org') | ||
LIMIT 1; | ||
``` | ||
|
||
### Get PDF Content | ||
|
||
Now, let's assume we want to search for the content on multiple websites. | ||
MindsDB accepts [file uploads](/sql/create/file) of `csv`, `xlsx`, `xls`, `sheet`, `json`, and `parquet`. However, you can also configure the web crawler to fetch data from PDF files accessible via URLs. | ||
|
||
```sql | ||
SELECT | ||
* | ||
SELECT * | ||
FROM my_web.crawler | ||
WHERE | ||
url IN ('docs.mindsdb.com', 'docs.python.org') | ||
LIMIT 30; | ||
WHERE url = '<link-to-pdf-file>' | ||
LIMIT 1; | ||
``` | ||
|
||
This command will crawl two sites and stop when the results count hits 30. The total count of rows in the result will be 30. | ||
## Troubleshooting | ||
|
||
<Warning> | ||
`Web crawler encounters character encoding issues` | ||
|
||
* **Symptoms**: Extracted text appears garbled or contains strange characters instead of the expected text. | ||
* **Checklist**: | ||
1. Open a GitHub Issue: If you encounter a bug or a repeatable error with encoding, | ||
report it on the [MindsDB GitHub](https://github.com/mindsdb/mindsdb/issues) repository by opening an issue. | ||
</Warning> | ||
|
||
NOTE: limit is mandatory. If you want to crawl all pages on the site, you can pass a big number in the limit (for example, 10000), more than the expected count of pages on the site. | ||
However, a big limit also increases the time waiting for a response. | ||
|
||
<Warning> | ||
`Web crawler times out while trying to fetch content` | ||
|
||
* **Symptoms**: The crawler fails to retrieve data from a website, resulting in timeout errors. | ||
* **Checklist**: | ||
1. Check the network connection to ensure the target site is reachable. | ||
</Warning> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,2 @@ | ||
pymupdf | ||
pymupdf | ||
html2text |
Empty file.
18 changes: 0 additions & 18 deletions
18
mindsdb/integrations/handlers/web_handler/tests/example_data.py
This file was deleted.
Oops, something went wrong.
49 changes: 0 additions & 49 deletions
49
mindsdb/integrations/handlers/web_handler/tests/test_helpers.py
This file was deleted.
Oops, something went wrong.
Oops, something went wrong.