-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: HTTP request tool #9228
base: master
Are you sure you want to change the base?
feat: HTTP request tool #9228
Conversation
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
); | ||
} | ||
const returnData: string[] = []; | ||
const html = cheerio.load(response); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use @mozilla/readability
and jsdom
here to cleanly extract the content that's likely relevant to an end-user.
Something like this perhaps:
import { JSDOM } from 'jsdom'
import { Readability } from '@mozilla/readability'
const dom = await JSDOM.fromURL(url)
const article = new Readability(dom.window.document, {
keepClasses: true,
}).parse()
and then use article.content
.
we could also consider using turndown to convert the html into markdown, which LLM tend to handle better than html IMO.
import Turndown from 'turndown'
const markdown = turndown.turndown(article.content)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@netroy
what would be advantages over html-to-text
+ Cheerio
? since we already using such setup for Html
node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cheerio
is great to either use css selectors to extract text, but leaves the burden of determining the semantics in the markup to the end user.
First download an article via curl https://www.bbc.com/news/articles/cldd6x6gglxo > news.html
.
Then try this with cheerio
:
const fs = require('fs');
const cheerio = require('cheerio');
const html = fs.readFileSync('news.html', 'utf8');
const $ = cheerio.load(html);
console.log($('body').text());
With just readability
:
(async () => {
const { JSDOM } = require('jsdom');
const { Readability } = require('@mozilla/readability');
const Turndown = require('turndown');
const dom = await JSDOM.fromFile('news.html');
const article = new Readability(dom.window.document, {
keepClasses: true,
}).parse();
console.log(article.textContent);
})();
With readability
+ turndown
:
(async () => {
const { JSDOM } = require('jsdom');
const { Readability } = require('@mozilla/readability');
const Turndown = require('turndown');
const dom = await JSDOM.fromFile('news.html');
const article = new Readability(dom.window.document, {
keepClasses: true,
}).parse();
const turndown = new Turndown({
headingStyle: 'atx',
hr: '---',
bulletListMarker: '-',
codeBlockStyle: 'fenced',
});
const markdown = turndown.turndown(article.content);
console.log(markdown);
})();
Perhaps we should add a "Extract as Markdown" option in the node to determine if we want to use markup to reduce semantic noise in the extracted text?
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…rs processing from model
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
…ol-to-visit-a-website
Summary
Tool to visit a website
Related tickets and issues
https://linear.app/n8n/issue/AI-162/tool-to-visit-a-website