Skip to content

Node web crawler - input a JSON list of links to output JSON with data scraped from each of the links

License

Notifications You must be signed in to change notification settings

ididntrealize/url-crawler

Repository files navigation

Table of Contents
  1. About The Project
  2. Getting Started

About The Project

This project is designed to input a JSON list of links to output JSON with data scraped from each of the links. You can scrape any data that you want, by using site specific targetting (similar method to element targetting with jQuery.js)

Why use this project:

  • Start your scraping project right away without having to worry about laying the foundations
  • Create targeting methods to execute on every page from your JSON link list
  • Use premade data modification methods to perform common data modifications before saving it to your output JSON
  • Use config options to decide how to print or save results into timestamped file
  • Generate Report on errors and pages where your targetting functions fail

Of course, there are always further optimizations and useful tools to add. You may also suggest changes by forking this repo and creating a pull request or opening an issue. Project made during contract work for Digital Yalo, and expressly given permission to use and share.

(back to top)

Built With

Thanks to othneildrew for the readme template

(back to top)

Getting Started

To get a local copy up and running follow these simple steps.

  1. Clone the repo
    git clone https://github.com/ididntrealize/url-crawler.git
  2. Install NPM packages
    npm install
  3. Create empty folders in root directory:
    exports/
    logs/
    
  4. Start example scrape
    node index.js

(back to top)

Config

Scrape config options are found at the top of the index.js file:

//Default values
debug = true                               //verbose console output per input .json items
printResultsToFile = true                  //after scrape completion, create .json output file in exports/ dir
hideBrowser = true                         //hide browser opening for each link in your input .json
limitPagesToScrape = false                 //set as integer (limitPagesToScrape = 15) to limit number of links to scrape from from inputted link list
currentScrapePrefix = "wikipediaArticles"  //create your own unique scrape title to allow multiple projects running simultaneously

Usage Example

This project includes sample data with links to Wikipedia. It is intended only as an example, but not meant for extended use. Wikipedia has an API which would be far more efficient than this scraper.

In order to use this application, you must find a way to create a json file with links to pages that you want to scrape. See /imports/wikipediaArticles.json for an example of the required format. Once you have created an import file, you must then create another file of the same name (but different extension) in /site-specific-targets/wikipediaArticles.js

The file that you create in /site-specific-targets/ controls what happens once the scraper is on one of the pages that is included in your links from your json import file. You must:

  • Change the class name to what your import is named
  • Replace the js selector in the variable: target
  • Replace .text() methods with .html() as needed

Finally, you will have to edit the index.js file:

  • Change currentScrapePrefix to what your import is named.
  • Create /imports/yourScrapePrefix.json
  • Create /site-specific-targets/yourScrapePrefix.js

(back to top)

About

Node web crawler - input a JSON list of links to output JSON with data scraped from each of the links

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published