blog-scraper

A generic blog scraper.

Saves data to mongodb.

install

npm install blog-scraper

usage

from the command line

./scrape <url> [db-name]

url should be the url to the first post. On that blog page should be a link to the next post.
db-name is blog-scraper by default.

as a Nodejs module

var scraper = require('blog-scraper');
scraper.init('db-name', options);

options has these defaults, any of which can be changed by passing an object to scraper.init():

options = {
	title: 'div.contentContainer section.wide-article > article header h1',
	body: 'div.contentContainer section.wide-article > article div.article-body',
	comments: 'section.comment-list > section',
	time: 'div.contentContainer section.wide-article > article time',
	next: '.article-actions .continue-reading .next a'
}

Note: the time extraction expects a datetime attribute at the selector's target element.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.gitignore		.gitignore
README.md		README.md
index.js		index.js
package.json		package.json
scrape		scrape

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

blog-scraper

install

usage

from the command line

as a Nodejs module

About

Releases

Packages

Languages

ile/blog-scraper

Folders and files

Latest commit

History

Repository files navigation

blog-scraper

install

usage

from the command line

as a Nodejs module

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages