Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify web #9190

Merged
merged 12 commits into from
May 15, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 0 additions & 2 deletions .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -89,11 +89,9 @@ exclude =
mindsdb/integrations/handlers/quickbooks_handler/*
mindsdb/integrations/handlers/strava_handler/*
mindsdb/integrations/handlers/strava_handler/*
mindsdb/integrations/handlers/web_handler/*
mindsdb/integrations/handlers/strava_handler/*
mindsdb/integrations/handlers/github_handler/*
mindsdb/integrations/handlers/vitess_handler/*
mindsdb/integrations/handlers/web_handler/*
mindsdb/integrations/handlers/impala_handler/*
mindsdb/integrations/handlers/tdengine_handler/*
mindsdb/integrations/handlers/huggingface_api_handler/*
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/test_on_push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,7 @@ jobs:
pip install mindsdb[mssql]
pip install mindsdb[clickhouse]
pip install mindsdb[snowflake]
pip install mindsdb[web]
pip freeze
- name: Run unit tests
run: |
Expand All @@ -133,7 +134,7 @@ jobs:
fi
- name: Run Handlers tests and submit Coverage to coveralls
run: |
handlers=("mysql" "postgres" "mssql" "clickhouse" "snowflake")
handlers=("mysql" "postgres" "mssql" "clickhouse" "snowflake" "web")
for handler in "${handlers[@]}"
do
pytest --cov=mindsdb/integrations/handlers/${handler}_handler tests/unit/handlers/test_${handler}.py
Expand Down
41 changes: 32 additions & 9 deletions docs/integrations/app-integrations/web-crawler.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,14 @@ sidebarTitle: Web Crawler

In this section, we present how to use a web crawler within MindsDB.

A web crawler is a computer program or automated script that browses the internet and navigates through websites, web pages, and web content to gather data. Within the realm of MindsDB, a web crawler can be employed to harvest data, which can be used to train models,
domain specific chatbots or fine-tune LLMs.
A web crawler is an automated script designed to systematically browse and index content on the internet. Within MindsDB, you can utilize a web crawler to efficiently collect data from various websites.

## Prerequisites

Before proceeding, ensure the following prerequisites are met:

1. Install MindsDB locally via [Docker](/setup/self-hosted/docker) or [Docker Desktop](/setup/self-hosted/docker-desktop).
2. To connect Web Crawler to MindsDB, install the required dependencies following [this instruction](/setup/self-hosted/docker#install-dependencies).
3. Install or ensure access to Web Crawler.
2. To use Web Crawler with MindsDB, install the required dependencies following [this instruction](/setup/self-hosted/docker#install-dependencies).

## Connection

Expand All @@ -26,12 +24,19 @@ Here is how to initialize a web crawler:
CREATE DATABASE my_web
WITH ENGINE = 'web';
```
<Tip>
The above query creates a database called `my_web`. This database by default has a table called `crawler` that we can use to crawl data from a given url/urls.
</Tip>

## Usage

<Note>
Specifying a `LIMIT` clause is required. To crawl all pages on a site, consider setting the limit to a high value, such as 10,000, which exceeds the expected number of pages. Be aware that setting a higher limit may result in longer response times.
</Note>

### Get Websites Content
MinuraPunchihewa marked this conversation as resolved.
Show resolved Hide resolved

Here is how to get the content of `docs.mindsdb.com`:
The following usage examples demonstrate how to retrieve content from `docs.mindsdb.com`:

```sql
SELECT *
Expand All @@ -40,7 +45,7 @@ WHERE url = 'docs.mindsdb.com'
LIMIT 1;
```

You can also get the content of internal pages. Here is how to fetch the content from 10 internal pages:
You can also retrieve content from internal pages. The following query fetches the content from 10 internal pages:
MinuraPunchihewa marked this conversation as resolved.
Show resolved Hide resolved

```sql
SELECT *
Expand All @@ -49,7 +54,7 @@ WHERE url = 'docs.mindsdb.com'
LIMIT 10;
```

Another option is to get the content from multiple websites.
Another option is to get the content from multiple websites by using the `IN ()` operator:

```sql
SELECT *
Expand All @@ -60,7 +65,7 @@ LIMIT 1;

### Get PDF Content

MindsDB accepts [file uploads](/sql/create/file) of `csv`, `xlsx`, `xls`, `sheet`, `json`, and `parquet`. However, you can utilize the web crawler to fetch data from `pdf` files.
MindsDB accepts [file uploads](/sql/create/file) of `csv`, `xlsx`, `xls`, `sheet`, `json`, and `parquet`. However, you can also configure the web crawler to fetch data from PDF files accessible via URLs.

```sql
SELECT *
Expand All @@ -69,4 +74,22 @@ WHERE url = '<link-to-pdf-file>'
LIMIT 1;
```

For example, you can provide a link to a `pdf` file stored in Amazon S3.
## Troubleshooting
MinuraPunchihewa marked this conversation as resolved.
Show resolved Hide resolved

<Warning>
`Web crawler encounters character encoding issues`

* **Symptoms**: Extracted text appears garbled or contains strange characters instead of the expected text.
* **Checklist**:
1. Open a GitHub Issue: If you encounter a bug or a repeatable error with encoding,
report it on the [MindsDB GitHub](https://github.com/mindsdb/mindsdb/issues) repository by opening an issue.
</Warning>


<Warning>
`Web crawler times out while trying to fetch content`

* **Symptoms**: The crawler fails to retrieve data from a website, resulting in timeout errors.
* **Checklist**:
1. Check the network connection to ensure the target site is reachable.
</Warning>
91 changes: 66 additions & 25 deletions mindsdb/integrations/handlers/web_handler/README.md
Original file line number Diff line number Diff line change
@@ -1,54 +1,95 @@
# Build your Web crawler
---
title: Web Crawler
sidebarTitle: Web Crawler
---

This integration allows you to query the results of a crawler in SQL:
In this section, we present how to use a web crawler within MindsDB.

- This can be particularly useful for building A/Q systems from data on a website.
A web crawler is an automated script designed to systematically browse and index content on the internet. Within MindsDB, you can utilize a web crawler to efficiently collect data from various websites.

Note that this crawler can crawl every single sub-site from the original.
## Prerequisites

Let's see in action
Before proceeding, ensure the following prerequisites are met:

1. Install MindsDB locally via [Docker](/setup/self-hosted/docker) or [Docker Desktop](/setup/self-hosted/docker-desktop).
2. To use Web Crawler with MindsDB, install the required dependencies following [this instruction](/setup/self-hosted/docker#install-dependencies).

## Connection

This handler does not require any connection parameters.

Here is how to initialize a web crawler:

```sql
-- Should be able to create a web crawler database
CREATE DATABASE my_web
With
ENGINE = 'web';
WITH ENGINE = 'web';
```
<Tip>
The above query creates a database called `my_web`. This database by default has a table called `crawler` that we can use to crawl data from a given url/urls.
</Tip>

This creates a database called my_web. This database ships with a table called crawler that we can use to crawl data given some url/urls.
## Usage

<Note>
Specifying a `LIMIT` clause is required. To crawl all pages on a site, consider setting the limit to a high value, such as 10,000, which exceeds the expected number of pages. Be aware that setting a higher limit may result in longer response times.
</Note>

## Searching for web content in SQL
### Get Websites Content

Let's get the content of a docs.mindsdb.com website:
The following usage examples demonstrate how to retrieve content from `docs.mindsdb.com`:

```sql
SELECT
*
SELECT *
FROM my_web.crawler
WHERE
url = 'docs.mindsdb.com'
WHERE url = 'docs.mindsdb.com'
LIMIT 1;
```

You can also retrieve content from internal pages. The following query fetches the content from 10 internal pages:

```sql
SELECT *
FROM my_web.crawler
WHERE url = 'docs.mindsdb.com'
LIMIT 10;
```

Another option is to get the content from multiple websites by using the `IN ()` operator:

This should return the contents of docs.mindsdb.com.
```sql
SELECT *
FROM my_web.crawler
WHERE url IN ('docs.mindsdb.com', 'docs.python.org')
LIMIT 1;
```

### Get PDF Content

Now, let's assume we want to search for the content on multiple websites.
MindsDB accepts [file uploads](/sql/create/file) of `csv`, `xlsx`, `xls`, `sheet`, `json`, and `parquet`. However, you can also configure the web crawler to fetch data from PDF files accessible via URLs.

```sql
SELECT
*
SELECT *
FROM my_web.crawler
WHERE
url IN ('docs.mindsdb.com', 'docs.python.org')
LIMIT 30;
WHERE url = '<link-to-pdf-file>'
LIMIT 1;
```

This command will crawl two sites and stop when the results count hits 30. The total count of rows in the result will be 30.
## Troubleshooting

<Warning>
`Web crawler encounters character encoding issues`

* **Symptoms**: Extracted text appears garbled or contains strange characters instead of the expected text.
* **Checklist**:
1. Open a GitHub Issue: If you encounter a bug or a repeatable error with encoding,
report it on the [MindsDB GitHub](https://github.com/mindsdb/mindsdb/issues) repository by opening an issue.
</Warning>

NOTE: limit is mandatory. If you want to crawl all pages on the site, you can pass a big number in the limit (for example, 10000), more than the expected count of pages on the site.
However, a big limit also increases the time waiting for a response.

<Warning>
`Web crawler times out while trying to fetch content`

* **Symptoms**: The crawler fails to retrieve data from a website, resulting in timeout errors.
* **Checklist**:
1. Check the network connection to ensure the target site is reachable.
</Warning>
2 changes: 1 addition & 1 deletion mindsdb/integrations/handlers/web_handler/__about__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@
__github__ = 'https://github.com/mindsdb/mindsdb'
__pypi__ = 'https://pypi.org/project/mindsdb/'
__license__ = 'MIT'
__copyright__ = 'Copyright 2022- mindsdb'
__copyright__ = 'Copyright 2022 - MindsDB'
3 changes: 2 additions & 1 deletion mindsdb/integrations/handlers/web_handler/requirements.txt
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
pymupdf
pymupdf
html2text
Empty file.
18 changes: 0 additions & 18 deletions mindsdb/integrations/handlers/web_handler/tests/example_data.py

This file was deleted.

49 changes: 0 additions & 49 deletions mindsdb/integrations/handlers/web_handler/tests/test_helpers.py

This file was deleted.