This is a simple web scraper that allows you to search for information on a given topic and scrape the text from the top results from a specified domain.
- Searches for information using DuckDuckGo search engine
- Scrapes text from the top results from a specified domain
- Uses NLTK's sentence tokenizer to split the scraped text into sentences
- Cleans the scraped text by removing non-alphanumeric characters
- Python 3.x
- DuckDuckGo search API (
duckduckgo_search
package) - Requests library (
requests
package) - BeautifulSoup 4 (
beautifulsoup4
package) - NLTK (
nltk
package)
- Clone the repository:
git clone https://github.com/scrape-any-sites/web-scraper.git
- Install the required packages:
pip install duckduckgo_search requests beautifulsoup4 nltk
- Download the NLTK sentence tokenizer data:
python
>>> import nltk
>>> nltk.download("punkt")
- Run the
scraper.py
script:
python scraper.py
- Enter your search query:
Enter Your query: example query
- Enter the domain you want to scrape from:
Enter the domain you want to scrape from (e.g., wikipedia.com, reddit.com): example.com
- The script will print the URLs of the top results and the scraped text from each URL.
The duckduckgo_search
package is not officially supported by DuckDuckGo and may violate their terms of service. Use at your own risk.
This project is licensed under the MIT License - see the LICENSE file for details.