Skip to content

This repository contains Python code for web crawling. It is built using the BeautifulSoup library and allows you to extract text from web pages and store it in text files. The crawler can also extract hyperlinks from web pages and crawl them recursively.This code will be a great starting point for your own web scraping projects

ksn-developer/webcrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 

Repository files navigation

WebCrawler

This is a Python script that crawls a website and saves the text content of each page in a text file. It also extracts all the hyperlinks from each page and follows the links that are within the same domain to continue the crawling process.

Requirements

Python 3.x
Works on Linux, Windows, macOS, BSD

Install

Install dependencies:

pip install -r requirements.txt

Usage

To use this script, replace the domain and full_url variables with the domain and full URL of the website you want to crawl. Then, simply run the script in your Python environment.

The script will create a text directory in the same directory as the script, which will contain a directory for the domain being crawled and text files for each page crawled.

Note: It is recommended to use this script with permission from the authors of the websites.

About

This repository contains Python code for web crawling. It is built using the BeautifulSoup library and allows you to extract text from web pages and store it in text files. The crawler can also extract hyperlinks from web pages and crawl them recursively.This code will be a great starting point for your own web scraping projects

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages