Links are now extracted after applying excludeTags by txrp0x9 · Pull Request #828 · firecrawl/firecrawl

txrp0x9 · 2024-10-27T23:23:51Z

fixes #701

nickscamara · 2024-10-28T02:19:21Z

apps/api/src/scraper/WebScraper/single_url.ts


    if (pageOptions.includeLinks) {
-      linksOnPage = extractLinks(rawHtml, urlToScrap);
+      linksOnPage = extractLinks(html, urlToScrap);


I think the problem with this is that we rely on this function for our /crawl, which will end up failling to grab all the links if we don't pass the raw version

@mogery lmk if im wrong

A global code search did not reveal any usage of the linksOnPage field anywhere other than api return
I believe the crawler uses a separate extractLinksFromHTML function
https://github.com/mendableai/firecrawl/blob/8a4f4cb9d98884bc70f4cf188a2c4dc87f656462/apps/api/src/services/queue-worker.ts#L374 and
https://github.com/mendableai/firecrawl/blob/8a4f4cb9d98884bc70f4cf188a2c4dc87f656462/apps/api/src/scraper/WebScraper/crawler.ts#L382
defined as
https://github.com/mendableai/firecrawl/blob/8a4f4cb9d98884bc70f4cf188a2c4dc87f656462/apps/api/src/scraper/WebScraper/crawler.ts#L322-L337

Right makes sense.

links are now extracted after applying excludeTags

3bec009

nickscamara reviewed Oct 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Links are now extracted after applying excludeTags#828

Links are now extracted after applying excludeTags#828
txrp0x9 wants to merge 1 commit intofirecrawl:mainfrom
txrp0x9:links_with_excludetags

txrp0x9 commented Oct 27, 2024

Uh oh!

nickscamara Oct 28, 2024 •

edited

Loading

Uh oh!

nickscamara Oct 28, 2024

Uh oh!

txrp0x9 Oct 28, 2024 •

edited

Loading

Uh oh!

nickscamara Oct 31, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

txrp0x9 commented Oct 27, 2024

Uh oh!

nickscamara Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nickscamara Oct 28, 2024

Choose a reason for hiding this comment

Uh oh!

txrp0x9 Oct 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nickscamara Oct 31, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nickscamara Oct 28, 2024 •

edited

Loading

txrp0x9 Oct 28, 2024 •

edited

Loading