-
-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for plugins #41
Conversation
Use default rules as fallback
Update #1MonorepoI started this approach just as an experiment, but now I'm seeing the value of using a monorepo when you need to ship breaking API changes. I moved property rules as independents packages. Also creating a domain package called Config fileFor reduce breaking API changes, I try an interesting approach: Be possible load rules set from the config file and use default rules set as a fallback. This approach is followed by projects like babel or prettier (actually they weres an inspiration for doing that!): {
"rules": [
"metascraper-author",
"metascraper-date",
"metascraper-description",
"metascraper-image",
"metascraper-logo",
{"metascraper-clearbit-logo": { // specific config!
"format": "jpg"
}},
"metascraper-publisher",
"metascraper-title",
"metascraper-url"
]
} This is loaded on bootstrapping time and never more. Notes how the order is important. Rules APIThe rules interface changes a bit. It needs to export an array of module.exports = opts => {
opts = Object.assign({}, DEFAULTS, opts)
const { size, format } = opts
const rules = [({htmlDom, meta, url: baseUrl}) => {
const {origin, hostname} = new URL(baseUrl)
if (meta.logo !== `${origin}/favicon.ico`) return
return `${ENDPOINT}/${hostname}?size=${size}&format=${format}`
}]
rules.propName = 'logo'
return rules
} |
Introduction
metascraper is a set of rules around a determinate value that we want to extract.
They are enough for the most common cases, but sometimes we need to go a little further and we need to define specific rules to reach one of these goals:
All of these cases are out of the metascraper scope because they are very closest to reach one specific target.
I want to use an example for illustrating: Think about amazon.com site.
Our goal is to extract product information from the site.
If you check the HTML content of the website, the markup is too specific: a lot of auto-generated class around a lot of
div
as wrappers.Also, probably exists a better way to extract the content, for example, trying to get specific product information using Amazon API.
On the other hand, we are interested in product content, like
price
oravailability
, that this information is definitely out of the metascraper meta information.metascraper plugins are exactly for these tasks: Extend safety metascraper core rules for support custom and specific cases.
How to define a plugin
Now when you load
metascraper
, you need to initialize the constructor passing an array of plugins to be used for extending the metascraper rules set:These plugins need to follow a tiny interface. For example, let's build a plugin for user Clearbit Logo API:
We can found two points here:
test
function. It will be used for determinate if the plugin needs to apply for the current link. In my case, I want to use the plugin just when the logo detected is the last fallback, that it's based on site favicon.object
. This information will be merged with the basic information detected by metascraper.That's all!
The only thing that you need to do is initialize
metascraper
withmetascraper-clearbit-logo
:On this Pull Request
metascraper-clearbit-logo
.Discussing Points
Core rules as plugins
Following this approach, we would move the current rules as a plugins around values, for example:
metascraper-author
metascraper-description
metascraper-image
and then just load this plugins by default as part of the boostrapping:
This will be the same than do it explicitly:
The advantage of do that is the possibility to exclude specific rules set around props. Also it removes to use
plugins
to differentiate core and external rules set. Just all are rules.API interface
I achieved the
plugins
definition as an interface breaking change that I'm not sure if it's the best way. I'm supposing that you want to define the plugins to use just once:instead of something that you need to provide all the time:
The old way (1.x) is adding rules exposed at
metascraper.rules
. I don't like this way because you are mutating things and it's less explicitly