Installation

A fast, efficient and asynchronous crawler to scrape all url's on a page.

Installation • Commands • Running • Running with Docker • License

Simple asynchronous crawler written in python that scraps url's from a specified website. What is important is that pages from a different domain than the one specified are not collected and crawled (subdomains are crawled). Note that it is impossible to get url's from dynamically loaded elements using this tool. You would have to use a UI testing framework to achieve this.

Installation

Download the repository locally

git clone https://github.com/gpiechnik2/byship.git

Go to the downloaded repository

cd byship

Install the tool locally

pip install --editable .

The tool can be run through Docker. Go to the Docker session for more information.

Commands

byship --help

Here are all the flags it supports.

Flag	Description
-t, --threads	Maximum number of threads at once (default 5)
-h, --headers	Custom headers in JSON file
-o, --output	Results file name (default results.txt)
-j, --json / -nj, --no-json	Write output in JSON format (false by default)
-wt, --wait-timeout	Waiting time in seconds between threads_value requests (default 2s)
-ct, --connect-timeout	The maximum amount of time to wait until a socket connection to the requested host is established (default 10s)
-rt, --read-timeout	The maximum duration to wait for a chunk of data to be received (for example, a chunk of the response body) (default 10s)
-f, --force / -nf, --no-force	Force running

Running

To use the tool, use the following command:

byship https://example.com

 __                 __     __        
|  |--.--.--.-----.|  |--.|__|.-----.
|  _  |  |  |__ --||     ||  ||  _  |
|_____|___  |_____||__|__||__||   __| v1.0 by @gpiechnik2
      |_____|                 |__|   
 
◉ [16:20:38] url: https://example.com; domain: example.com; headers: None; output: results.txt; json: False; wait timeout: 2 seconds; connect timeout: 10.0 seconds; read timeout: 10.0 seconds;

◉ [16:20:38] The file with the scraped urls has been created. Its name is: results.txt. You can open it at any time and see the list of url's which is updating all the time. If you stop the program, it will not delete the file;
◉ [16:24:38] Total number of scraped urls: 4302;

In the meantime the url file will be generated. You can stop the program at any time and the scraped url will not be lost. They are continuously added to the results file.

Running with Docker

Build a docker image:

docker build -t byship -f Dockerfile .

Then get the IMAGE_ID with the docker images command and run the tool using it.

docker run -it IMAGE_ID https://example.com

License

Not included yet

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
byship		byship
static		static
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
byshipcli.py		byshipcli.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A fast, efficient and asynchronous crawler to scrape all url's on a page.

Installation

Commands

Running

Running with Docker

License

About

Releases

Packages

Languages

gpiechnik2/byship

Folders and files

Latest commit

History

Repository files navigation

A fast, efficient and asynchronous crawler to scrape all url's on a page.

Installation

Commands

Running

Running with Docker

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages