Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for plugins #41

Merged
merged 11 commits into from
Dec 14, 2017
Merged

Add support for plugins #41

merged 11 commits into from
Dec 14, 2017

Conversation

Kikobeats
Copy link
Member

@Kikobeats Kikobeats commented Dec 13, 2017

Introduction

metascraper is a set of rules around a determinate value that we want to extract.

They are enough for the most common cases, but sometimes we need to go a little further and we need to define specific rules to reach one of these goals:

  • Extend the current rules set for supporting specifying a specific web target.
  • Get the content fetching it from an external resource, such as a third party API.
  • Add your own rules to different non-standard values.

All of these cases are out of the metascraper scope because they are very closest to reach one specific target.

I want to use an example for illustrating: Think about amazon.com site.

Our goal is to extract product information from the site.

If you check the HTML content of the website, the markup is too specific: a lot of auto-generated class around a lot of div as wrappers.

Also, probably exists a better way to extract the content, for example, trying to get specific product information using Amazon API.

On the other hand, we are interested in product content, like price or availability, that this information is definitely out of the metascraper meta information.

metascraper plugins are exactly for these tasks: Extend safety metascraper core rules for support custom and specific cases.

How to define a plugin

Now when you load metascraper, you need to initialize the constructor passing an array of plugins to be used for extending the metascraper rules set:

const metascraper = require('metascraper')({
  plugins: [] // your custom plugins here
})

These plugins need to follow a tiny interface. For example, let's build a plugin for user Clearbit Logo API:

'use strict'

const { URL } = require('url')

const DEFAULTS = {
  size: '128',
  format: 'png'
}

const ENDPOINT = 'https://logo.clearbit.com'

module.exports = opts => {
  opts = Object.assign({}, DEFAULTS, opts)
  const { size, format } = opts

  function fn ({ url, meta }) {
    const { hostname } = new URL(url)
    return { logo: `${ENDPOINT}/${hostname}?size=${size}&format=${format}` }
  }

  fn.test = ({ url, meta }) =>
    meta.logo === `${new URL(url).origin}/favicon.ico`

  return fn
}

We can found two points here:

  • Plugins need to define a test function. It will be used for determinate if the plugin needs to apply for the current link. In my case, I want to use the plugin just when the logo detected is the last fallback, that it's based on site favicon.
  • A plugin has a main function, that needs to return an object. This information will be merged with the basic information detected by metascraper.

That's all!

The only thing that you need to do is initialize metascraper with metascraper-clearbit-logo:

const metascraper = require('metascraper')({ 
  plugins: [
    require('metascraper-clearbit-logo')()
  ] 
})

On this Pull Request

  • A little metascraper core refactoring for supporting plugins out of the box.
  • Add a plugin test using metascraper-clearbit-logo.
  • Update tests interface.

Discussing Points

Core rules as plugins

Following this approach, we would move the current rules as a plugins around values, for example:

  • metascraper-author
  • metascraper-description
  • metascraper-image
  • etc

and then just load this plugins by default as part of the boostrapping:

// simple usage
// it loads implicity a set of `metascraper-*` plugins by default
const metascraper = require('metascraper')() 

This will be the same than do it explicitly:

// advanced usage
// need to define rules to be used
const metascraper = require('metascraper')({
  rules: [
   require('metascraper-author'),
   require('metascraper-description')
   // etc
  ]
})

The advantage of do that is the possibility to exclude specific rules set around props. Also it removes to use plugins to differentiate core and external rules set. Just all are rules.

API interface

I achieved the plugins definition as an interface breaking change that I'm not sure if it's the best way. I'm supposing that you want to define the plugins to use just once:

const metascraper = require('metascraper')({
  plugins: [ 
     require('metascraper-clearbit-logo')
  ]
})

const meta = await metascraper({html, url}) 

instead of something that you need to provide all the time:

const metascraper = require('metascraper')
const plugins = [ require('metascraper-clearbit-logo')  ]
const meta = await metascraper({html, url, plugins}) 

The old way (1.x) is adding rules exposed at metascraper.rules. I don't like this way because you are mutating things and it's less explicitly

@Kikobeats Kikobeats mentioned this pull request Dec 13, 2017
5 tasks
@coveralls
Copy link
Collaborator

coveralls commented Dec 13, 2017

Coverage Status

Coverage increased (+0.3%) to 98.02% when pulling f065096 on v3 into 3a81306 on master.

This was referenced Dec 13, 2017
@Kikobeats
Copy link
Member Author

Kikobeats commented Dec 14, 2017

Update #1

Monorepo

I started this approach just as an experiment, but now I'm seeing the value of using a monorepo when you need to ship breaking API changes.

I moved property rules as independents packages. Also creating a domain package called @metascraper/helpers for utils.

Config file

For reduce breaking API changes, I try an interesting approach: Be possible load rules set from the config file and use default rules set as a fallback.

This approach is followed by projects like babel or prettier (actually they weres an inspiration for doing that!):

{
  "rules": [
  "metascraper-author",
  "metascraper-date",
  "metascraper-description",
  "metascraper-image",
  "metascraper-logo",
  {"metascraper-clearbit-logo": { // specific config!
    "format": "jpg"
  }},
  "metascraper-publisher",
  "metascraper-title",
  "metascraper-url"
  ]
}

This is loaded on bootstrapping time and never more. Notes how the order is important.

Rules API

The rules interface changes a bit. It needs to export an array of rules and the propName to be used in the output. An example updating metascraper-clearbit-logo:

module.exports = opts => {
  opts = Object.assign({}, DEFAULTS, opts)
  const { size, format } = opts

  const rules = [({htmlDom, meta, url: baseUrl}) => {
    const {origin, hostname} = new URL(baseUrl)
    if (meta.logo !== `${origin}/favicon.ico`) return
    return `${ENDPOINT}/${hostname}?size=${size}&format=${format}`
  }]

  rules.propName = 'logo'

  return rules
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants