feat: web scraper and llm processor #168

grantdfoster · 2025-09-05T18:43:49Z

What

This PR fixes https://github.com/masa-finance/tee-indexer/issues/357 and adds support for web scraping via Apfiy. It also implements an LLM actor client that is used alongside the web scraping client in the web job to give an LLM summary of the content of the web page.

Why

We want to pivot to support web scraping via the worker exclusively, including a summary of the context for indexing purposes.

Copilot

Pull Request Overview

This PR introduces web scraping capabilities using Apify and LLM processing for content summarization. The implementation adds support for web scraping jobs that collect web page content and generate AI-powered summaries using the Gemini LLM.

Adds web scraping via Apify's website content crawler actor
Implements LLM processing client for content summarization using Gemini
Creates new web job type that combines scraping and LLM processing
Updates configuration and capabilities detection to support the new workflow

Reviewed Changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tee/masa-tee-worker.json	Adds GEMINI_API_KEY environment variable
pkg/client/apify_client.go	Adds DatasetId field to DatasetResponse and propagates it
internal/jobs/webapify/client.go	New web scraping client using Apify's website content crawler
internal/jobs/llmapify/client.go	New LLM processing client for content summarization
internal/jobs/web.go	Main web scraper job implementation combining scraping and LLM
internal/config/config.go	Adds Gemini API key configuration and WebConfig struct
internal/capabilities/detector.go	Updates capability detection for web jobs requiring both keys
Various test files	Comprehensive test coverage for new functionality
go.mod	Updates tee-types dependency and removes unused web scraper dependencies
Makefile	Updates test commands for new module structure

Comments suppressed due to low confidence (1)

internal/jobs/stats/stats.go:72

The struct field jobConfiguration is declared but not defined in the visible diff. This appears to be an incomplete change where the field declaration line shows an incorrect diff marker.

	jobConfiguration config.JobConfiguration

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

internal/capabilities/detector_test.go

mcamou

Still reviewing, but I see one issue: the old web job is what we're using in the E2E tests, since it requires no API tokens. Can we somehow keep it for internal use, or do you have some idea as to how to get around it? Otherwise we'll have to either add tokens to GH Actions or disable most E2E tests in CI, which defeats their purpose.

internal/jobs/llmapify/client.go

internal/config/config.go

internal/jobs/llmapify/client.go

internal/jobs/webapify/client.go

internal/capabilities/detector.go

internal/jobs/web.go

…irst transcription

Copilot

Pull Request Overview

Copilot reviewed 22 out of 23 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

internal/config/config.go:1

The TODO comment indicates incomplete implementation of Gemini API key validation. This should either be implemented or the comment should be more specific about what validation is needed and when it will be implemented.

package config

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

internal/jobs/tiktok.go

internal/capabilities/detector.go

internal/jobs/tiktok.go

internal/jobs/web.go

internal/jobs/web_test.go

grantdfoster added 18 commits September 5, 2025 20:43

chore: bump tee types

741a19b

feat: adds web scraping client and test

7e465b4

chore: fix api test

448840f

fix: actually fixes api test suite

14a5d5b

fix: web tests

e92cba7

fix: web tests

ba5cdc2

chore: fix capabilities test

402cdd8

feat: adds llm client and tests

1fab0af

fix: tiktok test

6babf81

fix: update masa tee worker json with gemini key

4bb59b9

fix: updates capabilities without explicity llm

ac2a249

chore: update tee types

47285c2

feat: incorporate llm

c0221f6

fix: web test

9cc7285

fix: web testing to integrate apify client

3d12a1f

fix: adds processed pages

b22cb10

chore: go mod tidy

b0c7350

chore: improve web test

025c75c

grantdfoster requested review from Copilot and mcamou September 11, 2025 16:23

Copilot AI reviewed Sep 11, 2025

View reviewed changes

internal/capabilities/detector_test.go Show resolved Hide resolved

mcamou reviewed Sep 12, 2025

View reviewed changes

grantdfoster added 6 commits September 12, 2025 20:16

feat: without tiktok lang preference, attempt default, then attempt f…

50a937d

…irst transcription

fix: config refactor and variable naming in web apify client

ae3388c

fix: use llm config in tests and clients

72fed28

chore: remove web var names from web client

bf4399e

chore: fix web tests

1c89445

chore: add validation method to llmapikey

c7ac898

grantdfoster requested a review from mcamou September 12, 2025 19:25

grantdfoster requested a review from Copilot September 12, 2025 19:25

Copilot AI reviewed Sep 12, 2025

View reviewed changes

internal/jobs/tiktok.go Show resolved Hide resolved

fix: update tee types to 1.1.14

3ff79bd

mcamou reviewed Sep 15, 2025

View reviewed changes

grantdfoster added 16 commits September 15, 2025 22:27

chore: casts llmapikey sooner

88cce6f

chore: better if/else

3b6d02a

fix: rename variable

e43c635

chore: move skip

6c9d5fe

chore: fix test to test multiple pages

65c5c18

fix: support max pages in llm args and client

c987891

fix: support max pages in llm worker

291382d

chore: massage prompt

ab0c93e

fix: remove focused test

8eb96e5

fix: upgrade tee types

7632f4d

fix: web pages to llm items

755fb62

fix: remove print ln

bfe72ae

fix: clear no sub string

7539e7c

chore: map to web response

ef6957e

fix: make api test more accepting

66501e2

chore: update tee types

8eb686c

mcamou approved these changes Sep 16, 2025

View reviewed changes

grantdfoster added 2 commits September 16, 2025 20:10

fix: worker test

3916a8e

chore: bump tee types

ddd3f2f

grantdfoster merged commit 4e69de9 into main Sep 17, 2025
7 of 9 checks passed

feat: web scraper and llm processor #168

feat: web scraper and llm processor #168

Conversation

grantdfoster commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

mcamou left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

grantdfoster commented Sep 5, 2025 •

edited

Loading

mcamou left a comment •

edited

Loading