Extract structured data from the web using GraphQL.
Branch: master
Clone or download
Latest commit c66c0e1 Apr 3, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
build add typescript declaration Feb 4, 2018
doc add generated docs Jan 25, 2018
examples add bin tests Jan 25, 2018
src add childNodes Feb 4, 2018
.gitignore initial commit - everything up to query Jan 19, 2018
bin.js export schema directly again Jan 25, 2018
changelog.md v1.2.1 changelog Feb 9, 2018
package.json v1.2.1 Feb 9, 2018
readme.md update readme Apr 3, 2018
screenshot.png readme Jan 25, 2018
tsconfig.json add typescript declaration Feb 4, 2018
yarn.lock add generated docs Jan 25, 2018

readme.md

graphql-scraper

GraphQL lets us query all sorts of graph-shaped data - so why not use it to query the world's most useful graph, the web?

graphql-scraper is a command-line tool and reusable GraphQL schema which lets you easily extract data from HTML.

Check out a live demo here. You can easily spin up your own by using graphql-scraper-server.

The command-line tool

npx graphql-scraper <query-file>

or

npm install -g graphql-scraper
graphql-scraper <query-file>

Reads a GraphQL query from the path query-file, and prints the result.

If query-file is not given, reads the query from stdin.

Command-line options

  • --json Returns the result in JSON format, for use in other tools.
  • --help Prints a help string.

Variables

Any other named options you pass to the CLI will be used as a query variable.

For example, if you want to reuse the same query on several pages, you could write the following query file (query.graphql):

query ExampleQueryWithVariable($page: String) {
  page(url: $page) {
    items: queryAll(selector: "tr.athing") {
      rank: text(selector: "td span.rank")
      title: text(selector: "td.title a")
      sitebit: text(selector: "span.comhead a")
      url: attr(selector: "td.title a", name: "href")
      attrs: next {
        score: text(selector: "span.score")
        user: text(selector: "a:first-of-type")
        comments: text(selector: "a:nth-of-type(3)")
      }
    }
  }
}

...and execute the query like this:

graphql-scraper query.graphql --page="https://news.ycombinator.com/"

The schema

You can check out an auto-generated schema description here, but I recommend trying out the graphql-scraper-server example and exploring the types interactively. You can also play around with the schema in the live demo.

Re-using the schema in your own projects

The npm package exports the GraphQL schema which is used by the command-line tool. This an instance of graphql-js GraphQLSchema, which you can use anywhere that expects a schema, for example apollo-server or graphql-yoga.

Use npm install graphql-scraper or yarn add graphql-scraper to add the schema to your project.

Basic example with graphql

import { graphql } from 'graphql'
import schema from 'graphql-scraper'
// You can also import it as follows:
// const schema = require('graphql-scraper')


const query = `
{
  page(url: "http://news.ycombinator.com") {
    items: queryAll(selector: "tr.athing") {
      rank: text(selector: "td span.rank")
      title: text(selector: "td.title a")
      sitebit: text(selector: "span.comhead a")
      url: attr(selector: "td.title a", name: "href")
      attrs: next {
        score: text(selector: "span.score")
        user: text(selector: "a:first-of-type")
        comments: text(selector: "a:nth-of-type(3)")
      }
    }
  }
}
`

graphql(schema, query).then(response => {
  console.log(response)
})

Background

This project was inspired by gdom, which is written in Python and uses the Graphene GraphQL library.

If you want to switch over from gdom, please note some schema changes:

  • query(selector: String!) now only returns a single Element, rather than a list (like document.querySelector). Added a new queryAll(selector: String!): [Element] field, which behaves like document.querySelectorAll.
  • is(selector: String!) is renamed to has(selector: String!).
  • children, parent, siblings, next etc. no longer have a selector argument. If you need to select children with a specific selector, use child selectors (.foo > .bar).
  • parents is removed.
  • prev[All] is renamed to previous[All].

Maintainers

@lachenmayer

Contribute

PRs accepted.

License

MIT © 2018 harry lachenmayer