Links are now extracted after applying excludeTags#828
Links are now extracted after applying excludeTags#828txrp0x9 wants to merge 1 commit intofirecrawl:mainfrom
Conversation
|
|
||
| if (pageOptions.includeLinks) { | ||
| linksOnPage = extractLinks(rawHtml, urlToScrap); | ||
| linksOnPage = extractLinks(html, urlToScrap); |
There was a problem hiding this comment.
I think the problem with this is that we rely on this function for our /crawl, which will end up failling to grab all the links if we don't pass the raw version
There was a problem hiding this comment.
A global code search did not reveal any usage of the linksOnPage field anywhere other than api return
I believe the crawler uses a separate extractLinksFromHTML function
https://github.com/mendableai/firecrawl/blob/8a4f4cb9d98884bc70f4cf188a2c4dc87f656462/apps/api/src/services/queue-worker.ts#L374 and
https://github.com/mendableai/firecrawl/blob/8a4f4cb9d98884bc70f4cf188a2c4dc87f656462/apps/api/src/scraper/WebScraper/crawler.ts#L382
defined as
https://github.com/mendableai/firecrawl/blob/8a4f4cb9d98884bc70f4cf188a2c4dc87f656462/apps/api/src/scraper/WebScraper/crawler.ts#L322-L337
fixes #701