####Simple web crawler built using Node.js
- Clone the repo or download the zip file, and
cd
into the project folder. - Install dependencies and run the crawler on your site!
npm install
node index.js --domain https://vikaslalwani.com --batch 5
###Arguments:
domain
: sub-domain on which you want to run the crawler. Please note that this crawler will not cross sub-domains.batch
: instead of sending plenty of requests at once, it restricts the number to the batch size you enter. Only after finishing a batch it will pick next batch.
Note: A batch gets only five tries before next one begins. That's why you shoud not keep very high batch size. 5 is the default.
PS: 5 is a good number. Just saying!
This crawler is designed for gathering a subdomain's assets:
- Crawls all pages of a subdomain. Excludes other subdomains.
- Generates JSON output listing the URLs of every static asset(images, javascript, stylesheets), grouped by page.
- Respects
robots.txt
of a site. - Batches the requests to avoid sending too many requests. Ideally, it will run at 1 page per second.
- Keeps saving the output to
output.json
file locally. So even if you stop it midway, you will get your results. - Only crawls
html
pages. Ignores images and pdfs. - Gracefully ignores
404
pages.
.
├── README.md // you are here
├── crawlSinglePage.js // crawls one page at a time
├── index.js // manages the crawler. script starts here
├── node_modules // dependencies
├── output.json // final output
├── package.json // meta data about app
├── parseUrl.js // parses input url
├── robots.js // parses robots.txt
└── robots.txt // stores a site's robots.txt if one exists
###Overview:
- Input url is parsed.
- Crawling starts at the given page. Each crawl:
- returns that pages's assets
- more hyperlinks from that page
- Newly discovered urls are added to the pool.
- Crawler goes to more urls, and extract assets and even more urls.
- Once all urls are crawled, program ends.
###Inside code:
- Code starts with
index.js
, whereinit
function:- parses the input url using
parseUrl.js
module and adds it to url pool. - fetches
robots.txt
usingrobots.json
and stores banned urls inbannedUrls
array. - calls
checkCrawlingStatus
function.
- parses the input url using
- Next,
checkCrawlingStatus
keeps an eye on url pool. If all urls in the pool are processed it ends the program, if not, it callsprocessUrlPool
function. It keeps callingprocessUrlPool
every 5 seconds. processUrlPool
makes batches of certain size and fetches data from those pages usingcrawlSinglePage.js
module. Only after finishing current batch, next one gets formed.crawlSinglePage.js
sendsGET
requests to individual pages and returns an object containing input url, page assets and newly discovered links.- The newly discovered urls are added to the pool and
checkCrawlingStatus.js
notices them in the next cycle. - Output file is appended after each batch is processed.
- Once all the urls in the pool are done, program stops.
- If no protocol given, it assumes
https
. about
andabout.html
pages will be treated differently.- Both
http
andhttps
pages will be crawled. - Does not respect canonical tags.
- Does not respect individual page's meta-robot tags.
- Does not handle redirects.
- Although it works, there are no unit tests yet :(