Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: HTTP request tool #9228

Open
wants to merge 58 commits into
base: master
Choose a base branch
from

Conversation

michael-radency
Copy link
Contributor

Summary

Tool to visit a website

Related tickets and issues

https://linear.app/n8n/issue/AI-162/tool-to-visit-a-website

@michael-radency michael-radency added node/new Creation of an entirely new node n8n team Authored by the n8n team labels Apr 26, 2024
);
}
const returnData: string[] = [];
const html = cheerio.load(response);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use @mozilla/readability and jsdom here to cleanly extract the content that's likely relevant to an end-user.

Something like this perhaps:

import { JSDOM } from 'jsdom'
import { Readability } from '@mozilla/readability'

const dom = await JSDOM.fromURL(url)
const article = new Readability(dom.window.document, {
    keepClasses: true,
}).parse()

and then use article.content.

we could also consider using turndown to convert the html into markdown, which LLM tend to handle better than html IMO.

import Turndown from 'turndown'
const markdown = turndown.turndown(article.content)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@netroy
what would be advantages over html-to-text + Cheerio? since we already using such setup for Html node

Copy link
Member

@netroy netroy May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cheerio is great to either use css selectors to extract text, but leaves the burden of determining the semantics in the markup to the end user.

First download an article via curl https://www.bbc.com/news/articles/cldd6x6gglxo > news.html.

Then try this with cheerio:

const fs = require('fs');
const cheerio = require('cheerio');
const html = fs.readFileSync('news.html', 'utf8');
const $ = cheerio.load(html);
console.log($('body').text());

I got this
image

With just readability:

(async () => {
	const { JSDOM } = require('jsdom');
	const { Readability } = require('@mozilla/readability');
	const Turndown = require('turndown');

	const dom = await JSDOM.fromFile('news.html');
	const article = new Readability(dom.window.document, {
		keepClasses: true,
	}).parse();
	console.log(article.textContent);
})();

I got this
image

With readability + turndown:

(async () => {
	const { JSDOM } = require('jsdom');
	const { Readability } = require('@mozilla/readability');
	const Turndown = require('turndown');

	const dom = await JSDOM.fromFile('news.html');
	const article = new Readability(dom.window.document, {
		keepClasses: true,
	}).parse();
	const turndown = new Turndown({
		headingStyle: 'atx',
		hr: '---',
		bulletListMarker: '-',
		codeBlockStyle: 'fenced',
	});
	const markdown = turndown.turndown(article.content);
	console.log(markdown);
})();

I got this
image

Perhaps we should add a "Extract as Markdown" option in the node to determine if we want to use markup to reduce semantic noise in the extracted text?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
n8n team Authored by the n8n team node/new Creation of an entirely new node
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants