Skip to content

Scrape any websties with a single script all you need is the domain name!

License

Notifications You must be signed in to change notification settings

himisir/Scrape-Any-Sites

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Scraper

This is a simple web scraper that allows you to search for information on a given topic and scrape the text from the top results from a specified domain.

Features

  • Searches for information using DuckDuckGo search engine
  • Scrapes text from the top results from a specified domain
  • Uses NLTK's sentence tokenizer to split the scraped text into sentences
  • Cleans the scraped text by removing non-alphanumeric characters

Requirements

  • Python 3.x
  • DuckDuckGo search API (duckduckgo_search package)
  • Requests library (requests package)
  • BeautifulSoup 4 (beautifulsoup4 package)
  • NLTK (nltk package)

Installation

  1. Clone the repository:
git clone https://github.com/scrape-any-sites/web-scraper.git
  1. Install the required packages:
pip install duckduckgo_search requests beautifulsoup4 nltk
  1. Download the NLTK sentence tokenizer data:
python
>>> import nltk
>>> nltk.download("punkt")

Usage

  1. Run the scraper.py script:
python scraper.py
  1. Enter your search query:
Enter Your query: example query
  1. Enter the domain you want to scrape from:
Enter the domain you want to scrape from (e.g., wikipedia.com, reddit.com): example.com
  1. The script will print the URLs of the top results and the scraped text from each URL.

Note

The duckduckgo_search package is not officially supported by DuckDuckGo and may violate their terms of service. Use at your own risk.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Scrape any websties with a single script all you need is the domain name!

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages