Skip to content

Commit

Permalink
Merge pull request #9190 from mindsdb/verify-web
Browse files Browse the repository at this point in the history
Verify web
  • Loading branch information
ZoranPandovski committed May 15, 2024
2 parents 037f87c + f999bf5 commit 1dc5130
Show file tree
Hide file tree
Showing 12 changed files with 419 additions and 293 deletions.
2 changes: 0 additions & 2 deletions .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -89,11 +89,9 @@ exclude =
mindsdb/integrations/handlers/quickbooks_handler/*
mindsdb/integrations/handlers/strava_handler/*
mindsdb/integrations/handlers/strava_handler/*
mindsdb/integrations/handlers/web_handler/*
mindsdb/integrations/handlers/strava_handler/*
mindsdb/integrations/handlers/github_handler/*
mindsdb/integrations/handlers/vitess_handler/*
mindsdb/integrations/handlers/web_handler/*
mindsdb/integrations/handlers/impala_handler/*
mindsdb/integrations/handlers/tdengine_handler/*
mindsdb/integrations/handlers/huggingface_api_handler/*
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/test_on_push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ jobs:
pip install mindsdb[mssql]
pip install mindsdb[clickhouse]
pip install mindsdb[snowflake]
pip install mindsdb[web]
pip freeze
- name: Run unit tests
run: |
Expand All @@ -134,7 +135,7 @@ jobs:
fi
- name: Run Handlers tests and submit Coverage to coveralls
run: |
handlers=("mysql" "postgres" "mssql" "clickhouse" "snowflake")
handlers=("mysql" "postgres" "mssql" "clickhouse" "snowflake" "web")
for handler in "${handlers[@]}"
do
pytest --cov=mindsdb/integrations/handlers/${handler}_handler tests/unit/handlers/test_${handler}.py
Expand Down
41 changes: 32 additions & 9 deletions docs/integrations/app-integrations/web-crawler.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,14 @@ sidebarTitle: Web Crawler

In this section, we present how to use a web crawler within MindsDB.

A web crawler is a computer program or automated script that browses the internet and navigates through websites, web pages, and web content to gather data. Within the realm of MindsDB, a web crawler can be employed to harvest data, which can be used to train models,
domain specific chatbots or fine-tune LLMs.
A web crawler is an automated script designed to systematically browse and index content on the internet. Within MindsDB, you can utilize a web crawler to efficiently collect data from various websites.

## Prerequisites

Before proceeding, ensure the following prerequisites are met:

1. Install MindsDB locally via [Docker](/setup/self-hosted/docker) or [Docker Desktop](/setup/self-hosted/docker-desktop).
2. To connect Web Crawler to MindsDB, install the required dependencies following [this instruction](/setup/self-hosted/docker#install-dependencies).
3. Install or ensure access to Web Crawler.
2. To use Web Crawler with MindsDB, install the required dependencies following [this instruction](/setup/self-hosted/docker#install-dependencies).

## Connection

Expand All @@ -26,12 +24,19 @@ Here is how to initialize a web crawler:
CREATE DATABASE my_web
WITH ENGINE = 'web';
```
<Tip>
The above query creates a database called `my_web`. This database by default has a table called `crawler` that we can use to crawl data from a given url/urls.
</Tip>

## Usage

<Note>
Specifying a `LIMIT` clause is required. To crawl all pages on a site, consider setting the limit to a high value, such as 10,000, which exceeds the expected number of pages. Be aware that setting a higher limit may result in longer response times.
</Note>

### Get Websites Content

Here is how to get the content of `docs.mindsdb.com`:
The following usage examples demonstrate how to retrieve content from `docs.mindsdb.com`:

```sql
SELECT *
Expand All @@ -40,7 +45,7 @@ WHERE url = 'docs.mindsdb.com'
LIMIT 1;
```

You can also get the content of internal pages. Here is how to fetch the content from 10 internal pages:
You can also retrieve content from internal pages. The following query fetches the content from 10 internal pages:

```sql
SELECT *
Expand All @@ -49,7 +54,7 @@ WHERE url = 'docs.mindsdb.com'
LIMIT 10;
```

Another option is to get the content from multiple websites.
Another option is to get the content from multiple websites by using the `IN ()` operator:

```sql
SELECT *
Expand All @@ -60,7 +65,7 @@ LIMIT 1;

### Get PDF Content

MindsDB accepts [file uploads](/sql/create/file) of `csv`, `xlsx`, `xls`, `sheet`, `json`, and `parquet`. However, you can utilize the web crawler to fetch data from `pdf` files.
MindsDB accepts [file uploads](/sql/create/file) of `csv`, `xlsx`, `xls`, `sheet`, `json`, and `parquet`. However, you can also configure the web crawler to fetch data from PDF files accessible via URLs.

```sql
SELECT *
Expand All @@ -69,4 +74,22 @@ WHERE url = '<link-to-pdf-file>'
LIMIT 1;
```

For example, you can provide a link to a `pdf` file stored in Amazon S3.
## Troubleshooting

<Warning>
`Web crawler encounters character encoding issues`

* **Symptoms**: Extracted text appears garbled or contains strange characters instead of the expected text.
* **Checklist**:
1. Open a GitHub Issue: If you encounter a bug or a repeatable error with encoding,
report it on the [MindsDB GitHub](https://github.com/mindsdb/mindsdb/issues) repository by opening an issue.
</Warning>


<Warning>
`Web crawler times out while trying to fetch content`

* **Symptoms**: The crawler fails to retrieve data from a website, resulting in timeout errors.
* **Checklist**:
1. Check the network connection to ensure the target site is reachable.
</Warning>
91 changes: 66 additions & 25 deletions mindsdb/integrations/handlers/web_handler/README.md
Original file line number Diff line number Diff line change
@@ -1,54 +1,95 @@
# Build your Web crawler
---
title: Web Crawler
sidebarTitle: Web Crawler
---

This integration allows you to query the results of a crawler in SQL:
In this section, we present how to use a web crawler within MindsDB.

- This can be particularly useful for building A/Q systems from data on a website.
A web crawler is an automated script designed to systematically browse and index content on the internet. Within MindsDB, you can utilize a web crawler to efficiently collect data from various websites.

Note that this crawler can crawl every single sub-site from the original.
## Prerequisites

Let's see in action
Before proceeding, ensure the following prerequisites are met:

1. Install MindsDB locally via [Docker](/setup/self-hosted/docker) or [Docker Desktop](/setup/self-hosted/docker-desktop).
2. To use Web Crawler with MindsDB, install the required dependencies following [this instruction](/setup/self-hosted/docker#install-dependencies).

## Connection

This handler does not require any connection parameters.

Here is how to initialize a web crawler:

```sql
-- Should be able to create a web crawler database
CREATE DATABASE my_web
With
ENGINE = 'web';
WITH ENGINE = 'web';
```
<Tip>
The above query creates a database called `my_web`. This database by default has a table called `crawler` that we can use to crawl data from a given url/urls.
</Tip>

This creates a database called my_web. This database ships with a table called crawler that we can use to crawl data given some url/urls.
## Usage

<Note>
Specifying a `LIMIT` clause is required. To crawl all pages on a site, consider setting the limit to a high value, such as 10,000, which exceeds the expected number of pages. Be aware that setting a higher limit may result in longer response times.
</Note>

## Searching for web content in SQL
### Get Websites Content

Let's get the content of a docs.mindsdb.com website:
The following usage examples demonstrate how to retrieve content from `docs.mindsdb.com`:

```sql
SELECT
*
SELECT *
FROM my_web.crawler
WHERE
url = 'docs.mindsdb.com'
WHERE url = 'docs.mindsdb.com'
LIMIT 1;
```

You can also retrieve content from internal pages. The following query fetches the content from 10 internal pages:

```sql
SELECT *
FROM my_web.crawler
WHERE url = 'docs.mindsdb.com'
LIMIT 10;
```

Another option is to get the content from multiple websites by using the `IN ()` operator:

This should return the contents of docs.mindsdb.com.
```sql
SELECT *
FROM my_web.crawler
WHERE url IN ('docs.mindsdb.com', 'docs.python.org')
LIMIT 1;
```

### Get PDF Content

Now, let's assume we want to search for the content on multiple websites.
MindsDB accepts [file uploads](/sql/create/file) of `csv`, `xlsx`, `xls`, `sheet`, `json`, and `parquet`. However, you can also configure the web crawler to fetch data from PDF files accessible via URLs.

```sql
SELECT
*
SELECT *
FROM my_web.crawler
WHERE
url IN ('docs.mindsdb.com', 'docs.python.org')
LIMIT 30;
WHERE url = '<link-to-pdf-file>'
LIMIT 1;
```

This command will crawl two sites and stop when the results count hits 30. The total count of rows in the result will be 30.
## Troubleshooting

<Warning>
`Web crawler encounters character encoding issues`

* **Symptoms**: Extracted text appears garbled or contains strange characters instead of the expected text.
* **Checklist**:
1. Open a GitHub Issue: If you encounter a bug or a repeatable error with encoding,
report it on the [MindsDB GitHub](https://github.com/mindsdb/mindsdb/issues) repository by opening an issue.
</Warning>

NOTE: limit is mandatory. If you want to crawl all pages on the site, you can pass a big number in the limit (for example, 10000), more than the expected count of pages on the site.
However, a big limit also increases the time waiting for a response.

<Warning>
`Web crawler times out while trying to fetch content`

* **Symptoms**: The crawler fails to retrieve data from a website, resulting in timeout errors.
* **Checklist**:
1. Check the network connection to ensure the target site is reachable.
</Warning>
2 changes: 1 addition & 1 deletion mindsdb/integrations/handlers/web_handler/__about__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@
__github__ = 'https://github.com/mindsdb/mindsdb'
__pypi__ = 'https://pypi.org/project/mindsdb/'
__license__ = 'MIT'
__copyright__ = 'Copyright 2022- mindsdb'
__copyright__ = 'Copyright 2022 - MindsDB'
3 changes: 2 additions & 1 deletion mindsdb/integrations/handlers/web_handler/requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
pymupdf
pymupdf
html2text
Empty file.
18 changes: 0 additions & 18 deletions mindsdb/integrations/handlers/web_handler/tests/example_data.py

This file was deleted.

49 changes: 0 additions & 49 deletions mindsdb/integrations/handlers/web_handler/tests/test_helpers.py

This file was deleted.

0 comments on commit 1dc5130

Please sign in to comment.