Skip to content

Conversation

@grantdfoster
Copy link
Collaborator

@grantdfoster grantdfoster commented Sep 5, 2025

What

This PR fixes https://github.com/masa-finance/tee-indexer/issues/357 and adds support for web scraping via Apfiy. It also implements an LLM actor client that is used alongside the web scraping client in the web job to give an LLM summary of the content of the web page.

Why

We want to pivot to support web scraping via the worker exclusively, including a summary of the context for indexing purposes.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces web scraping capabilities using Apify and LLM processing for content summarization. The implementation adds support for web scraping jobs that collect web page content and generate AI-powered summaries using the Gemini LLM.

  • Adds web scraping via Apify's website content crawler actor
  • Implements LLM processing client for content summarization using Gemini
  • Creates new web job type that combines scraping and LLM processing
  • Updates configuration and capabilities detection to support the new workflow

Reviewed Changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tee/masa-tee-worker.json Adds GEMINI_API_KEY environment variable
pkg/client/apify_client.go Adds DatasetId field to DatasetResponse and propagates it
internal/jobs/webapify/client.go New web scraping client using Apify's website content crawler
internal/jobs/llmapify/client.go New LLM processing client for content summarization
internal/jobs/web.go Main web scraper job implementation combining scraping and LLM
internal/config/config.go Adds Gemini API key configuration and WebConfig struct
internal/capabilities/detector.go Updates capability detection for web jobs requiring both keys
Various test files Comprehensive test coverage for new functionality
go.mod Updates tee-types dependency and removes unused web scraper dependencies
Makefile Updates test commands for new module structure
Comments suppressed due to low confidence (1)

internal/jobs/stats/stats.go:72

  • The struct field jobConfiguration is declared but not defined in the visible diff. This appears to be an incomplete change where the field declaration line shows an incorrect diff marker.
	jobConfiguration config.JobConfiguration

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Contributor

@mcamou mcamou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing, but I see one issue: the old web job is what we're using in the E2E tests, since it requires no API tokens. Can we somehow keep it for internal use, or do you have some idea as to how to get around it? Otherwise we'll have to either add tokens to GH Actions or disable most E2E tests in CI, which defeats their purpose.

@grantdfoster grantdfoster requested a review from mcamou September 12, 2025 19:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 22 out of 23 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

internal/config/config.go:1

  • The TODO comment indicates incomplete implementation of Gemini API key validation. This should either be implemented or the comment should be more specific about what validation is needed and when it will be implemented.
package config

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@grantdfoster grantdfoster merged commit 4e69de9 into main Sep 17, 2025
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants