Web Crawler

####Simple web crawler built using Node.js

Getting Started

Clone the repo or download the zip file, and cd into the project folder.
Install dependencies and run the crawler on your site!

npm install
node index.js --domain https://vikaslalwani.com --batch 5

###Arguments:

domain: sub-domain on which you want to run the crawler. Please note that this crawler will not cross sub-domains.
batch: instead of sending plenty of requests at once, it restricts the number to the batch size you enter. Only after finishing a batch it will pick next batch.

Note: A batch gets only five tries before next one begins. That's why you shoud not keep very high batch size. 5 is the default.

PS: 5 is a good number. Just saying!

Features

This crawler is designed for gathering a subdomain's assets:

Crawls all pages of a subdomain. Excludes other subdomains.
Generates JSON output listing the URLs of every static asset(images, javascript, stylesheets), grouped by page.
Respects robots.txt of a site.
Batches the requests to avoid sending too many requests. Ideally, it will run at 1 page per second.
Keeps saving the output to output.json file locally. So even if you stop it midway, you will get your results.
Only crawls html pages. Ignores images and pdfs.
Gracefully ignores 404 pages.

Components

.
├── README.md // you are here
├── crawlSinglePage.js // crawls one page at a time
├── index.js // manages the crawler. script starts here
├── node_modules // dependencies
├── output.json // final output
├── package.json // meta data about app
├── parseUrl.js // parses input url
├── robots.js // parses robots.txt
└── robots.txt // stores a site's robots.txt if one exists

How it works

###Overview:

Input url is parsed.
Crawling starts at the given page. Each crawl:
- returns that pages's assets
- more hyperlinks from that page
Newly discovered urls are added to the pool.
Crawler goes to more urls, and extract assets and even more urls.
Once all urls are crawled, program ends.

###Inside code:

Code starts with index.js, where init function:
- parses the input url using parseUrl.js module and adds it to url pool.
- fetches robots.txt using robots.json and stores banned urls in bannedUrls array.
- calls checkCrawlingStatus function.
Next, checkCrawlingStatus keeps an eye on url pool. If all urls in the pool are processed it ends the program, if not, it calls processUrlPool function. It keeps calling processUrlPool every 5 seconds.
processUrlPool makes batches of certain size and fetches data from those pages using crawlSinglePage.js module. Only after finishing current batch, next one gets formed.
crawlSinglePage.js sends GET requests to individual pages and returns an object containing input url, page assets and newly discovered links.
The newly discovered urls are added to the pool and checkCrawlingStatus.js notices them in the next cycle.
Output file is appended after each batch is processed.
Once all the urls in the pool are done, program stops.

Known Issues

If no protocol given, it assumes https.
about and about.html pages will be treated differently.
Both http and https pages will be crawled.
Does not respect canonical tags.
Does not respect individual page's meta-robot tags.
Does not handle redirects.
Although it works, there are no unit tests yet :(

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler

Getting Started

Features

Components

How it works

Known Issues

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
crawlSinglePage.js		crawlSinglePage.js
index.js		index.js
package.json		package.json
parseUrl.js		parseUrl.js
robots.js		robots.js

lalwanivikas/web-crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Getting Started

Features

Components

How it works

Known Issues

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages