-
Notifications
You must be signed in to change notification settings - Fork 6
feat: web scraper and llm processor #168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces web scraping capabilities using Apify and LLM processing for content summarization. The implementation adds support for web scraping jobs that collect web page content and generate AI-powered summaries using the Gemini LLM.
- Adds web scraping via Apify's website content crawler actor
- Implements LLM processing client for content summarization using Gemini
- Creates new web job type that combines scraping and LLM processing
- Updates configuration and capabilities detection to support the new workflow
Reviewed Changes
Copilot reviewed 21 out of 22 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tee/masa-tee-worker.json | Adds GEMINI_API_KEY environment variable |
| pkg/client/apify_client.go | Adds DatasetId field to DatasetResponse and propagates it |
| internal/jobs/webapify/client.go | New web scraping client using Apify's website content crawler |
| internal/jobs/llmapify/client.go | New LLM processing client for content summarization |
| internal/jobs/web.go | Main web scraper job implementation combining scraping and LLM |
| internal/config/config.go | Adds Gemini API key configuration and WebConfig struct |
| internal/capabilities/detector.go | Updates capability detection for web jobs requiring both keys |
| Various test files | Comprehensive test coverage for new functionality |
| go.mod | Updates tee-types dependency and removes unused web scraper dependencies |
| Makefile | Updates test commands for new module structure |
Comments suppressed due to low confidence (1)
internal/jobs/stats/stats.go:72
- The struct field
jobConfigurationis declared but not defined in the visible diff. This appears to be an incomplete change where the field declaration line shows an incorrect diff marker.
jobConfiguration config.JobConfiguration
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still reviewing, but I see one issue: the old web job is what we're using in the E2E tests, since it requires no API tokens. Can we somehow keep it for internal use, or do you have some idea as to how to get around it? Otherwise we'll have to either add tokens to GH Actions or disable most E2E tests in CI, which defeats their purpose.
…irst transcription
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 22 out of 23 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
internal/config/config.go:1
- The TODO comment indicates incomplete implementation of Gemini API key validation. This should either be implemented or the comment should be more specific about what validation is needed and when it will be implemented.
package config
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
What
This PR fixes https://github.com/masa-finance/tee-indexer/issues/357 and adds support for web scraping via Apfiy. It also implements an LLM actor client that is used alongside the web scraping client in the web job to give an LLM summary of the content of the web page.
Why
We want to pivot to support web scraping via the worker exclusively, including a summary of the context for indexing purposes.