GitHub - ididntrealize/url-crawler: Node web crawler - input a JSON list of links to output JSON with data scraped from each of the links

Table of Contents

About The Project
- Built With
Getting Started
- Installation

About The Project

This project is designed to input a JSON list of links to output JSON with data scraped from each of the links. You can scrape any data that you want, by using site specific targetting (similar method to element targetting with jQuery.js)

Why use this project:

Start your scraping project right away without having to worry about laying the foundations
Create targeting methods to execute on every page from your JSON link list
Use premade data modification methods to perform common data modifications before saving it to your output JSON
Use config options to decide how to print or save results into timestamped file
Generate Report on errors and pages where your targetting functions fail

Of course, there are always further optimizations and useful tools to add. You may also suggest changes by forking this repo and creating a pull request or opening an issue. Project made during contract work for Digital Yalo, and expressly given permission to use and share.

(back to top)

Built With

Thanks to othneildrew for the readme template

(back to top)

Getting Started

To get a local copy up and running follow these simple steps.

Clone the repo

git clone https://github.com/ididntrealize/url-crawler.git

Install NPM packages
```
npm install
```
Create empty folders in root directory:
```
exports/
logs/
```
Start example scrape
```
node index.js
```

(back to top)

Config

Scrape config options are found at the top of the index.js file:

//Default values
debug = true                               //verbose console output per input .json items
printResultsToFile = true                  //after scrape completion, create .json output file in exports/ dir
hideBrowser = true                         //hide browser opening for each link in your input .json
limitPagesToScrape = false                 //set as integer (limitPagesToScrape = 15) to limit number of links to scrape from from inputted link list
currentScrapePrefix = "wikipediaArticles"  //create your own unique scrape title to allow multiple projects running simultaneously

Usage Example

This project includes sample data with links to Wikipedia. It is intended only as an example, but not meant for extended use. Wikipedia has an API which would be far more efficient than this scraper.

In order to use this application, you must find a way to create a json file with links to pages that you want to scrape. See /imports/wikipediaArticles.json for an example of the required format. Once you have created an import file, you must then create another file of the same name (but different extension) in /site-specific-targets/wikipediaArticles.js

The file that you create in /site-specific-targets/ controls what happens once the scraper is on one of the pages that is included in your links from your json import file. You must:

Change the class name to what your import is named
Replace the js selector in the variable: target
Replace .text() methods with .html() as needed

Finally, you will have to edit the index.js file:

Change currentScrapePrefix to what your import is named.
Create /imports/yourScrapePrefix.json
Create /site-specific-targets/yourScrapePrefix.js

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
imports		imports
site-specific-targets		site-specific-targets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dataModify.js		dataModify.js
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json
web-scraper.PNG		web-scraper.PNG

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About The Project

Built With

Getting Started

Config

Usage Example

About

Releases

Packages

Languages

License

ididntrealize/url-crawler

Folders and files

Latest commit

History

Repository files navigation

About The Project

Built With

Getting Started

Config

Usage Example

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages