Skip to content

Add embed information to url-metadata#121

Open
fennifith wants to merge 1 commit intomainfrom
oembed-url-metadata
Open

Add embed information to url-metadata#121
fennifith wants to merge 1 commit intomainfrom
oembed-url-metadata

Conversation

@fennifith
Copy link
Member

@fennifith fennifith commented Mar 3, 2026

Fixes #82 - copies the platform-detectors implementation from the frontend to run as part of the url-metadata task to store/upload github gists, posts, and video embed info.

Summary by CodeRabbit

  • New Features
    • Added metadata extraction support for GitHub Gists, X posts, and video platforms (YouTube, Vimeo, Twitch)
    • Enhanced webpage icon and banner image retrieval
    • Improved content enrichment with platform-specific metadata storage capabilities
    • Extended URL format support across multiple content platforms

@coderabbitai
Copy link

coderabbitai bot commented Mar 3, 2026

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive embed data extraction system for URLs and migrates the TypeBox dependency from @sinclair/typebox to typebox across the codebase. The changes add support for extracting and storing metadata from GitHub Gists, X/Twitter posts, and video platforms (YouTube, Vimeo, Twitch) via new worker tasks, database tables, and data providers, while extending the API's URL metadata endpoint to return enriched embed information.

Changes

Cohort / File(s) Summary
Dependency Migration
apps/api/package.json, apps/worker/package.json, packages/bullmq/package.json, packages/common/package.json, packages/post-images/package.json, pnpm-workspace.yaml
Updated dependency declarations from @sinclair/typebox to typebox with version bump to ^1.1.5 in catalog.
Import Path Updates
apps/api/src/routes/tasks/post-images.ts, apps/worker/src/tasks/sync-*/..., packages/bullmq/src/tasks/..., packages/post-images/src/fetch-post-data.ts, packages/common/src/env.ts
Updated import statements to reference typebox instead of @sinclair/typebox and typebox/value; no logic changes.
URL Metadata Route Enhancement
apps/api/src/routes/tasks/url-metadata.ts
Expanded response schema with gist, post, embedType, and embed fields; added mapping helpers (mapObjectKey, mapImageData, mapGist, mapPost, mapEmbed) to construct enriched response objects from database results; adjusted URL normalization to include query strings.
Database Schema Extensions
packages/db/src/schema/url-metadata.ts
Added three new tables: url_metadata_gist (gistId, username, description), url_metadata_gist_file (junction table with contentKey, language), and url_metadata_post (postId with author/content/media metadata); extended url_metadata table with embedSrc, embedWidth, embedHeight, embedType fields and foreign key references.
Gist Embed Extraction
apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.ts, .test.ts
New module fetching GitHub Gist metadata, uploading file contents to S3 with hash-based keys, upserting metadata to database, and cleaning up deleted files; includes unit tests with mocked GitHub and database interactions.
Post/Tweet Embed Extraction
apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts, .test.ts
New module extracting X/Twitter post data via FXTwitter API, processing author avatar and post images, upserting to url_metadata_post table; includes unit tests with mocked endpoints.
Video Embed Extraction
apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts
New module delegating to platform-specific video OEmbed providers (YouTube, Vimeo, Twitch), extracting iframe attributes and thumbnail images, returning embedSrc and dimensions.
HTML Embed Extraction
apps/worker/src/tasks/url-metadata/getEmbedDataFromHtml.ts
New module fetching page HTML, extracting title and icons/banners via parallel asset processing; includes error handling for timeout and robot-denial scenarios.
Video Data Providers
apps/worker/src/tasks/url-metadata/data-providers/youtube.ts, vimeo.ts, twitch.ts, video.ts
New modules implementing OEmbed fetching for YouTube (via noembed), Vimeo (via bot-enabled fetch), and Twitch (clip URL normalization); video.ts dispatcher delegates to appropriate provider based on hostname.
X/Twitter Data Providers
apps/worker/src/tasks/url-metadata/data-providers/x.ts, data-providers/fx-embed/twitter.ts, data-providers/fx-embed/types.ts
New modules providing FXTwitter API client (getXPostData) and comprehensive TypeScript type definitions for Twitter API responses, media structures, polls, community notes, and articles.
Embed Data Common
apps/worker/src/tasks/url-metadata/common.ts
New module exporting EmbedData type (with title, icon, banner, embedSrc/Width/Height, gistId, postId, error fields) and BUCKET constant initialized via S3 ensureBucket.
Worker Processor Refactor
apps/worker/src/tasks/url-metadata/processor.ts
Replaced single HTML-driven flow with dispatch-based approach selecting embedType-specific fetcher (gist/post/video/html) based on hostname; consolidated error handling; stores embedType and related metadata in url_metadata record.
Test Infrastructure Updates
apps/worker/test-utils/setup.ts
Extended mock implementations: s3.remove, stateful db.insert/delete with mocking maps, new db properties (urlMetadata, urlMetadataPost, urlMetadataGist, urlMetadataGistFile), github-api.getGistById mock function.
GitHub API Extension
packages/github-api/src/getGistById.ts, src/index.ts
New module implementing getGistById(params: {gistId, signal?}) to fetch gist data via GitHub API with error handling; re-exported from index.ts.
Image Processing
apps/worker/src/tasks/url-metadata/utils/processImage.ts
Exported ProcessImageResult interface (previously internal).
S3 Utilities
packages/s3/src/utils.ts
Added remove(bucket: string, key: string) function to delete objects from S3 using DeleteObjectCommand.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API
    participant Worker
    participant GitHub
    participant DB
    participant S3
    
    Client->>API: POST /tasks/url-metadata (gist URL)
    API->>Worker: Queue gist embed task
    Worker->>GitHub: GET /gists/{gistId}
    GitHub-->>Worker: Gist metadata + files
    Worker->>S3: Upload file content (MD5-keyed)
    S3-->>Worker: Content key stored
    Worker->>DB: Upsert gist metadata
    Worker->>DB: Upsert gist files
    Worker->>DB: Delete stale S3 files
    Worker->>S3: Remove deleted files
    S3-->>Worker: Deletion complete
    Worker->>DB: Store url_metadata with gistId
    Worker-->>API: Embed data complete
    API->>DB: Query url_metadata with mappings
    API-->>Client: Enriched response (gist + files)
Loading
sequenceDiagram
    participant Client
    participant API
    participant Worker
    participant FXTwitter
    participant Images
    participant DB
    participant S3
    
    Client->>API: POST /tasks/url-metadata (Twitter URL)
    API->>Worker: Queue post embed task
    Worker->>FXTwitter: GET /userId/status/postId
    FXTwitter-->>Worker: Tweet data with media URLs
    Worker->>Images: Process avatar image
    Images->>S3: Upload avatar
    S3-->>Images: Avatar key
    Worker->>Images: Process tweet image
    Images->>S3: Upload tweet image
    S3-->>Images: Image key
    Worker->>DB: Upsert url_metadata_post
    Worker->>DB: Store url_metadata with postId
    Worker-->>API: Post embed complete
    API->>DB: Query url_metadata with post mapping
    API-->>Client: Enriched response (author + content)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • hoof#111: Modifies database schema and url_metadata/post_images-related artifacts (migrations, snapshots) with related code changes for reading/writing url_metadata records.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add embed information to url-metadata' directly describes the primary change—introducing embed data handling to the url-metadata task.
Linked Issues check ✅ Passed The PR implements all three coding objectives from issue #82: detects platform-specific embeds (gist, post, video, HTML) via dispatchers, stores embed metadata in new database tables, uploads image URLs to S3 with keys, and extends the url-metadata schema and route response.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing embed detection and storage for url-metadata, including dependency migration from @sinclair/typebox to typebox and supporting infrastructure—no unrelated functionality added.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch oembed-url-metadata

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🧹 Nitpick comments (4)
pnpm-workspace.yaml (1)

6-6: TypeBox migration is complete; drift check would be a good preventive measure.

All source files are using the new typebox package correctly, and no legacy @sinclair/typebox references remain. A CI check to prevent reintroduction of the legacy package name is recommended as a future safeguard, but not required now.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pnpm-workspace.yaml` at line 6, Add a preventive CI drift check to ensure the
legacy package name '@sinclair/typebox' is never reintroduced: create a CI job
(or an npm script referenced by CI) named something like "check-typebox" that
searches the repository for the string '@sinclair/typebox' (excluding
node_modules/build artifacts) and fails the job if any matches are found; this
should be triggered on PRs and main branch updates and references the package
names "typebox" and "@sinclair/typebox" so reviewers can quickly locate the
check and its intent.
apps/worker/src/tasks/url-metadata/processor.ts (1)

10-10: Consider adding an "html" embed type for completeness.

The UrlMetadataEmbedType defines "gist" | "video" | "post", but the HTML fallback path leaves embedType as undefined. For clarity and future filtering/querying, you might want to include "html" in the union:

♻️ Suggested addition
-type UrlMetadataEmbedType = "gist" | "video" | "post";
+type UrlMetadataEmbedType = "gist" | "video" | "post" | "html";

And then set it explicitly in the else branch (line 28-29):

 } else {
+	embedType = "html";
 	embedPromise = getEmbedDataFromHtml(jobId, inputUrl, signal);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/processor.ts` at line 10, The
UrlMetadataEmbedType union should include an "html" variant and the fallback
assignment should set embedType = "html" instead of leaving it undefined; update
the type alias UrlMetadataEmbedType to include "html" and modify the fallback
branch in the URL metadata parsing logic (the code that currently leaves
embedType undefined) to explicitly assign "html" so downstream filtering/queries
can rely on a concrete embedType.
apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts (1)

3-25: Consider relaxing literal types for external API fields.

The provider_url: "https://vimeo.com/" is typed as a literal string. If Vimeo's API ever returns a slightly different value (e.g., without trailing slash), it could cause type mismatches at runtime with stricter type checking. Consider using string for fields that aren't critical to the application logic.

♻️ Suggested type relaxation
 export interface VimeoOEmbedResponse {
 	type: "video";
 	version: "1.0";
 	provider_name: "Vimeo";
-	provider_url: "https://vimeo.com/";
+	provider_url: string;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts` around lines 3 -
25, The VimeoOEmbedResponse interface uses literal string types for external API
fields (e.g., provider_url: "https://vimeo.com/") which is brittle; update the
interface by changing those literal types to more permissive types (for example
provider_url: string and similarly relax provider_name, version, and type if
they are not critical) so the shape accepts minor API variations while keeping
strongly typed fields like title, width, height, duration, thumbnail_url,
video_id, and uri unchanged.
apps/worker/src/tasks/url-metadata/common.ts (1)

17-17: Consider lazy initialization for the S3 bucket.

The top-level await runs at module import time, which means:

  1. Any import failure from s3.ensureBucket will prevent the module from loading.
  2. The bucket is created even if the importing code never uses it.
  3. Test mocking becomes trickier since this runs before test setup.

For a worker context this may be acceptable, but lazy initialization would be more resilient:

♻️ Suggested lazy initialization pattern
-export const BUCKET = await s3.ensureBucket(env.S3_BUCKET);
+let _bucket: Awaited<ReturnType<typeof s3.ensureBucket>> | undefined;
+
+export async function getBucket() {
+  if (!_bucket) {
+    _bucket = await s3.ensureBucket(env.S3_BUCKET);
+  }
+  return _bucket;
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/common.ts` at line 17, Replace the
top-level awaited export with a lazy initializer: remove the top-level "export
const BUCKET = await s3.ensureBucket(env.S3_BUCKET)" and add an async getter
(e.g., getBucket or ensureBucketInitialized) that caches the bucket after
calling s3.ensureBucket(env.S3_BUCKET) on first use; update callers to call this
getter instead of referencing BUCKET directly so the bucket creation runs lazily
and is easier to mock/test and to avoid module-import side effects.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@apps/api/src/routes/tasks/url-metadata.ts`:
- Around line 107-109: Replace the use of logical OR with nullish coalescing for
optional numeric fields so valid 0 values are preserved: change occurrences of
"width: width || undefined" and "height: height || undefined" (and the analogous
occurrences at the later lines) to use "width: width ?? undefined" and "height:
height ?? undefined"; keep altText as "altText || undefined" only if empty
string should be treated as missing, otherwise change altText to "altText ??
undefined". Update the assignments where these symbols (width, height, altText)
are used in the URL metadata construction and at the later lines mentioned to
use ?? accordingly.
- Around line 113-115: The mapGist function currently throws when the side-table
row is missing; instead, change the behavior in mapGist (and the analogous
functions that throw for missing side rows—e.g., the mapPost path and the other
URL metadata mappers around the later throw sites) to return undefined when
db.query.urlMetadataGist.findFirst(...) (or the equivalent db query) returns no
row, and call the existing requeue/refresh helper (use the module's
enqueue/refresh function used elsewhere for metadata refresh) to schedule an
async refresh rather than throwing; ensure callers of mapGist handle undefined
as a cache-hit fallback so the request path does not 500.
- Line 237: The code currently force-casts result.embedType using "embedType:
result.embedType as 'post' | 'gist' | 'video' | undefined" which is unsafe;
replace this with a guarded mapper: add a small type‑guard (e.g.,
isValidEmbedType(value): value is 'post'|'gist'|'video') and set embedType to
isValidEmbedType(result.embedType) ? result.embedType : undefined (or handle
invalid values explicitly), updating the assignment site where embedType is set
from result.embedType so the response contract is enforced without unsafe
casting.

In `@apps/worker/src/tasks/url-metadata/data-providers/twitch.ts`:
- Line 42: twitchHosts currently lists ["clips.twitch.tv", "twitch.tv"] but
misses "www.twitch.tv", causing valid Twitch URLs to be ignored; update the
twitchHosts export (the twitchHosts constant in
apps/worker/src/tasks/url-metadata/data-providers/twitch.ts) to include
"www.twitch.tv" in the array alongside the existing entries so URL
matching/embedding recognizes both forms.
- Around line 29-38: The current fallback uses the page URL (embedUrl = url)
which produces invalid Twitch iframe src; instead detect non-clip Twitch URLs
and build a proper player endpoint (use "https://player.twitch.tv/" with either
?channel=<channelName> for channel pages or ?video=v<videoId> for video pages /
collections), preserve/add the parent query param on that player URL (use
embedUrl.searchParams.set("parent", parent)), and update the returned html to
use width="400" (minimum) and height="300"; change references to embedUrl and
url where the iframe src is constructed and handle the TODO for video/collection
paths by mapping path segments to the correct channel or video query parameter
before returning html.

In `@apps/worker/src/tasks/url-metadata/data-providers/video.ts`:
- Line 37: Exported videoHosts currently combines youtubeHosts and vimeoHosts
but omits twitchHosts, preventing Twitch URLs from being routed into the video
embed path despite getVideoDataFromUrl supporting Twitch; update the exported
videoHosts (symbol: videoHosts) to include twitchHosts alongside youtubeHosts
and vimeoHosts so Twitch URLs are recognized, and verify getVideoDataFromUrl
still handles Twitch-specific logic (symbols: getVideoDataFromUrl, twitchHosts).

In `@apps/worker/src/tasks/url-metadata/data-providers/youtube.ts`:
- Around line 27-47: The code builds youtubeUrl using the variable videoId
without ensuring it was set; update the parsing logic around videoId in
apps/worker/src/tasks/url-metadata/data-providers/youtube.ts (the videoId
variable and the youtubeUrl construction) to guard against missing/empty IDs:
after the if/else blocks that set videoId, check that videoId is a non-empty
string and if not, return null (or otherwise handle the invalid case) instead of
proceeding to construct `https://www.youtube.com/watch?v=${videoId}`.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.test.ts`:
- Around line 33-35: The test uses a simplified mock chain
((db.delete(urlMetadataGistFile).where(undefined).returning as Mock)) that hides
the real where clause used by the implementation (and(eq(...),
not(inArray(...))); update the mock in getEmbedDataFromGist.test.ts to reflect
or assert the actual where predicate: either mock
db.delete(urlMetadataGistFile).where with a matcher that checks the passed
predicate (e.g., capture the where arg and verify it composes and(eq(...),
not(inArray(...)))) or replace the .where(undefined) with a more explicit
mockImplementation that validates the incoming where function/criteria before
returning Promise.resolve([]); alternatively, if simplifying is intentional add
a brief comment explaining why the simplified mock covers the case and cannot
hide regressions, referencing db.delete and urlMetadataGistFile so reviewers can
locate the change.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.ts`:
- Around line 25-33: The function currently logs an invalid gist URL but
continues to call github.getGistById; change the invalid-path branch (when
!githubUsername || !gistId) to return immediately (e.g., return undefined) after
logging to avoid the network call and further processing instead of only setting
error = true; ensure this prevents execution of github.getGistById and
downstream code that expects valid gistId/inputUrl.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.test.ts`:
- Around line 97-101: The test incorrectly passes a URL string as the jobId to
getEmbedDataFromPost; update the test to pass a distinct mock job ID (e.g.,
"test-job-id" or similar) as the first argument while keeping the existing URL
(new URL(fakeFxTwitterUrl)) and AbortSignal, so the call mirrors production
usage and avoids conflating jobId with the input URL when generating S3 keys.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts`:
- Around line 33-37: The code in getEmbedDataFromPost uses new URL(avatarUrl)
and new URL(image.url) directly (variables avatarUrl and image.url) which can
throw synchronously; wrap the URL construction in a try/catch or pre-validate
the string before calling processImages so malformed URLs are skipped or
converted to undefined instead of throwing, and ensure any skipped images are
filtered out before calling processImages([ ... ]) and before mapping into new
URL for images so the downstream .catch handlers are not bypassed.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts`:
- Around line 37-45: In getEmbedDataFromVideo, protect parsing of
videoData.thumbnail_url before calling processImages: the expression new
URL(videoData.thumbnail_url) can throw synchronously and bypass the existing
.catch on the processImages promise; wrap the URL construction/validation in a
try/catch (or guard with a safe URL-parse helper) and only call processImages
when the URL was successfully created, and ensure any thrown error is caught and
logged/handled using the existing jobId/signal flow so banner assignment doesn't
crash the task.

In `@apps/worker/test-utils/setup.ts`:
- Around line 38-41: The mock deleteMockResponse currently sets returning =
vi.fn() which yields undefined when awaited; change returning to be a function
that returns an empty array by default (e.g., make returning an async/mock
function that returns []), so callers that await the .returning(...) can safely
iterate; update the returned object in deleteMockResponse (the where mock that
returns { returning }) accordingly so .where(...).returning(...) resolves to an
empty array by default.

In `@packages/db/src/schema/url-metadata.ts`:
- Line 24: The url_metadata_gist_file schema currently forces language as
notNull (varchar("language", { length: 16 }).notNull()), which fails when GitHub
returns null/omits files[...].language; change the column definition in
packages/db/src/schema/url-metadata.ts for the url_metadata_gist_file entry to
allow nulls (remove .notNull() or use .nullable() depending on the ORM) so
inserts succeed when language is absent, and run migrations/regen types if
applicable.

In `@packages/github-api/src/getGistById.ts`:
- Around line 26-29: The error message in getGistById incorrectly references
getCommits; update the throw in the block that checks if data is undefined or
response.error (in function getGistById) to mention getGistById instead of
getCommits and include the same response.response.status and response.error
details so the runtime message correctly identifies the failing operation.

---

Nitpick comments:
In `@apps/worker/src/tasks/url-metadata/common.ts`:
- Line 17: Replace the top-level awaited export with a lazy initializer: remove
the top-level "export const BUCKET = await s3.ensureBucket(env.S3_BUCKET)" and
add an async getter (e.g., getBucket or ensureBucketInitialized) that caches the
bucket after calling s3.ensureBucket(env.S3_BUCKET) on first use; update callers
to call this getter instead of referencing BUCKET directly so the bucket
creation runs lazily and is easier to mock/test and to avoid module-import side
effects.

In `@apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts`:
- Around line 3-25: The VimeoOEmbedResponse interface uses literal string types
for external API fields (e.g., provider_url: "https://vimeo.com/") which is
brittle; update the interface by changing those literal types to more permissive
types (for example provider_url: string and similarly relax provider_name,
version, and type if they are not critical) so the shape accepts minor API
variations while keeping strongly typed fields like title, width, height,
duration, thumbnail_url, video_id, and uri unchanged.

In `@apps/worker/src/tasks/url-metadata/processor.ts`:
- Line 10: The UrlMetadataEmbedType union should include an "html" variant and
the fallback assignment should set embedType = "html" instead of leaving it
undefined; update the type alias UrlMetadataEmbedType to include "html" and
modify the fallback branch in the URL metadata parsing logic (the code that
currently leaves embedType undefined) to explicitly assign "html" so downstream
filtering/queries can rely on a concrete embedType.

In `@pnpm-workspace.yaml`:
- Line 6: Add a preventive CI drift check to ensure the legacy package name
'@sinclair/typebox' is never reintroduced: create a CI job (or an npm script
referenced by CI) named something like "check-typebox" that searches the
repository for the string '@sinclair/typebox' (excluding node_modules/build
artifacts) and fails the job if any matches are found; this should be triggered
on PRs and main branch updates and references the package names "typebox" and
"@sinclair/typebox" so reviewers can quickly locate the check and its intent.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f71585 and 389e182.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (39)
  • apps/api/package.json
  • apps/api/src/routes/tasks/post-images.ts
  • apps/api/src/routes/tasks/url-metadata.ts
  • apps/worker/package.json
  • apps/worker/src/tasks/sync-author/processor.ts
  • apps/worker/src/tasks/sync-author/types.ts
  • apps/worker/src/tasks/sync-collection/processor.ts
  • apps/worker/src/tasks/sync-collection/types.ts
  • apps/worker/src/tasks/sync-post/processor.ts
  • apps/worker/src/tasks/sync-post/types.ts
  • apps/worker/src/tasks/url-metadata/common.ts
  • apps/worker/src/tasks/url-metadata/data-providers/fx-embed/twitter.ts
  • apps/worker/src/tasks/url-metadata/data-providers/fx-embed/types.ts
  • apps/worker/src/tasks/url-metadata/data-providers/twitch.ts
  • apps/worker/src/tasks/url-metadata/data-providers/video.ts
  • apps/worker/src/tasks/url-metadata/data-providers/vimeo.ts
  • apps/worker/src/tasks/url-metadata/data-providers/x.ts
  • apps/worker/src/tasks/url-metadata/data-providers/youtube.ts
  • apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.test.ts
  • apps/worker/src/tasks/url-metadata/getEmbedDataFromGist.ts
  • apps/worker/src/tasks/url-metadata/getEmbedDataFromHtml.ts
  • apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.test.ts
  • apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts
  • apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts
  • apps/worker/src/tasks/url-metadata/processor.ts
  • apps/worker/src/tasks/url-metadata/utils/processImage.ts
  • apps/worker/test-utils/setup.ts
  • packages/bullmq/package.json
  • packages/bullmq/src/tasks/post-image.ts
  • packages/bullmq/src/tasks/url-metadata.ts
  • packages/common/package.json
  • packages/common/src/env.ts
  • packages/db/src/schema/url-metadata.ts
  • packages/github-api/src/getGistById.ts
  • packages/github-api/src/index.ts
  • packages/post-images/package.json
  • packages/post-images/src/fetch-post-data.ts
  • packages/s3/src/utils.ts
  • pnpm-workspace.yaml

Comment on lines 107 to +109
width: width || undefined,
height: height || undefined,
altText: altText || undefined,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use nullish coalescing for optional numeric fields.

Line [107], Line [108], Line [163], and Line [164] use ||, which drops valid 0 values. Use ?? so only null/undefined are omitted.

Suggested fix
-    width: width || undefined,
-    height: height || undefined,
+    width: width ?? undefined,
+    height: height ?? undefined,
@@
-    width: result.embedWidth || undefined,
-    height: result.embedHeight || undefined,
+    width: result.embedWidth ?? undefined,
+    height: result.embedHeight ?? undefined,

Also applies to: 163-164

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/routes/tasks/url-metadata.ts` around lines 107 - 109, Replace
the use of logical OR with nullish coalescing for optional numeric fields so
valid 0 values are preserved: change occurrences of "width: width || undefined"
and "height: height || undefined" (and the analogous occurrences at the later
lines) to use "width: width ?? undefined" and "height: height ?? undefined";
keep altText as "altText || undefined" only if empty string should be treated as
missing, otherwise change altText to "altText ?? undefined". Update the
assignments where these symbols (width, height, altText) are used in the URL
metadata construction and at the later lines mentioned to use ?? accordingly.

Comment on lines +113 to +115
async function mapGist(gistId: string): Promise<Static<typeof GistSchema>> {
const gist = await db.query.urlMetadataGist.findFirst({ where: { gistId } });
if (!gist) throw new Error(`Gist information for ${gistId} not found!`);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t hard-fail cached responses when gist/post side rows are missing.

On Line [115] and Line [134], throwing makes a cache hit fail with 500 when side-table data is absent/inconsistent. Return undefined and requeue refresh instead of failing the request path.

Suggested resilience fix
-async function mapGist(gistId: string): Promise<Static<typeof GistSchema>> {
+async function mapGist(
+  gistId: string,
+): Promise<Static<typeof GistSchema> | undefined> {
   const gist = await db.query.urlMetadataGist.findFirst({ where: { gistId } });
-  if (!gist) throw new Error(`Gist information for ${gistId} not found!`);
+  if (!gist) return undefined;
@@
-async function mapPost(postId: string): Promise<Static<typeof PostSchema>> {
+async function mapPost(
+  postId: string,
+): Promise<Static<typeof PostSchema> | undefined> {
   const post = await db.query.urlMetadataPost.findFirst({ where: { postId } });
-  if (!post) throw new Error(`Post information for ${postId} not found!`);
+  if (!post) return undefined;
@@
-        const response = {
+        const gist = result.gistId ? await mapGist(result.gistId) : undefined;
+        const post = result.postId ? await mapPost(result.postId) : undefined;
+        if ((result.gistId && !gist) || (result.postId && !post)) {
+          shouldSubmitJob = true;
+        }
+
+        const response = {
           ...mapUrlMetadata(result),
-          gist: result.gistId ? await mapGist(result.gistId) : undefined,
-          post: result.postId ? await mapPost(result.postId) : undefined,
+          gist,
+          post,

Also applies to: 132-134, 233-236

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/routes/tasks/url-metadata.ts` around lines 113 - 115, The
mapGist function currently throws when the side-table row is missing; instead,
change the behavior in mapGist (and the analogous functions that throw for
missing side rows—e.g., the mapPost path and the other URL metadata mappers
around the later throw sites) to return undefined when
db.query.urlMetadataGist.findFirst(...) (or the equivalent db query) returns no
row, and call the existing requeue/refresh helper (use the module's
enqueue/refresh function used elsewhere for metadata refresh) to schedule an
async refresh rather than throwing; ensure callers of mapGist handle undefined
as a cache-hit fallback so the request path does not 500.

...mapUrlMetadata(result),
gist: result.gistId ? await mapGist(result.gistId) : undefined,
post: result.postId ? await mapPost(result.postId) : undefined,
embedType: result.embedType as "post" | "gist" | "video" | undefined,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid unsafe embedType assertion from DB data.

Line [237] force-casts DB data into the API enum; invalid persisted values can silently violate the response contract. Prefer a guarded mapper.

Suggested contract-safe mapping
+function mapEmbedType(
+  value: string | null,
+): "post" | "gist" | "video" | undefined {
+  return value === "post" || value === "gist" || value === "video"
+    ? value
+    : undefined;
+}
@@
-          embedType: result.embedType as "post" | "gist" | "video" | undefined,
+          embedType: mapEmbedType(result.embedType),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
embedType: result.embedType as "post" | "gist" | "video" | undefined,
function mapEmbedType(
value: string | null,
): "post" | "gist" | "video" | undefined {
return value === "post" || value === "gist" || value === "video"
? value
: undefined;
}
// ... other code ...
embedType: mapEmbedType(result.embedType),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/api/src/routes/tasks/url-metadata.ts` at line 237, The code currently
force-casts result.embedType using "embedType: result.embedType as 'post' |
'gist' | 'video' | undefined" which is unsafe; replace this with a guarded
mapper: add a small type‑guard (e.g., isValidEmbedType(value): value is
'post'|'gist'|'video') and set embedType to isValidEmbedType(result.embedType) ?
result.embedType : undefined (or handle invalid values explicitly), updating the
assignment site where embedType is set from result.embedType so the response
contract is enforced without unsafe casting.

Comment on lines +29 to +38
if (!embedUrl) {
embedUrl = url;
}

// Set the "parent" property for embedding - https://dev.twitch.tv/docs/embed/everything/#usage
embedUrl.searchParams.set("parent", parent);

return {
html: `<iframe src="${embedUrl.toString()}" height="300"></iframe>`,
title: "Twitch Embed",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n apps/worker/src/tasks/url-metadata/data-providers/twitch.ts

Repository: playfulprogramming/hoof

Length of output: 1678


🌐 Web query:

Twitch embed iframe src requirements player.twitch.tv channel video

💡 Result:

For a non-interactive Twitch player iframe, the src must point to https://player.twitch.tv/ and include one content identifier plus at least one parent domain:

src formats

  • Live channel
    <iframe src="https://player.twitch.tv/?channel=<channel>&parent=<yourdomain.com>"></iframe>
  • VOD / past broadcast
    <iframe src="https://player.twitch.tv/?video=<videoID>&parent=<yourdomain.com>"></iframe>

Key requirements

  • HTTPS/SSL required for domains using Twitch embeds. [1]
  • parent is required: add the full domain (include subdomain like www if used). If the embed can appear on multiple domains, include multiple parent= entries. Missing/mismatched parent triggers a playback/config error. [1][2]
    src="https://player.twitch.tv/?channel=twitchpresents&parent=example.com&parent=www.example.com"
  • Minimum size (per Twitch embed requirements): width ≥ 400, height ≥ 300. [3]

Sources: Twitch embed requirements and examples from Twitch Developers docs and Twitch Developer Forums. [1][2][3]


Non-clip Twitch URLs produce invalid embed endpoints.

The code falls back to embedding the page URL directly (line 30: embedUrl = url). This results in iframe sources like https://twitch.tv/somechannel?parent=..., which Twitch does not recognize as valid embed endpoints. Twitch requires embeds to use https://player.twitch.tv/ with channel or video query parameters.

Additionally, the iframe has height="300" but no width attribute. Twitch embeds require a minimum width of 400 pixels.

The TODO comment on line 25 explicitly flags this limitation for video and collection URLs, which also need proper handling.

💡 Suggested fix
if (!embedUrl) {
	const [first, second] = url.pathname.split("/").filter(Boolean);
	embedUrl = new URL("https://player.twitch.tv/");
	if (first === "videos" && second) {
		embedUrl.searchParams.set("video", `v${second}`);
	} else if (first) {
		embedUrl.searchParams.set("channel", first);
	} else {
		return { html: "", title: "Twitch Embed" };
	}
}

Also add width to the iframe: width="400" minimum.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/data-providers/twitch.ts` around lines 29
- 38, The current fallback uses the page URL (embedUrl = url) which produces
invalid Twitch iframe src; instead detect non-clip Twitch URLs and build a
proper player endpoint (use "https://player.twitch.tv/" with either
?channel=<channelName> for channel pages or ?video=v<videoId> for video pages /
collections), preserve/add the parent query param on that player URL (use
embedUrl.searchParams.set("parent", parent)), and update the returned html to
use width="400" (minimum) and height="300"; change references to embedUrl and
url where the iframe src is constructed and handle the TODO for video/collection
paths by mapping path segments to the correct channel or video query parameter
before returning html.

};
}

export const twitchHosts = ["clips.twitch.tv", "twitch.tv"];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add www.twitch.tv to supported hosts.

Common Twitch URLs use www.twitch.tv; excluding it can skip valid embeds.

🛠️ Proposed fix
-export const twitchHosts = ["clips.twitch.tv", "twitch.tv"];
+export const twitchHosts = ["clips.twitch.tv", "twitch.tv", "www.twitch.tv"];
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export const twitchHosts = ["clips.twitch.tv", "twitch.tv"];
export const twitchHosts = ["clips.twitch.tv", "twitch.tv", "www.twitch.tv"];
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/data-providers/twitch.ts` at line 42,
twitchHosts currently lists ["clips.twitch.tv", "twitch.tv"] but misses
"www.twitch.tv", causing valid Twitch URLs to be ignored; update the twitchHosts
export (the twitchHosts constant in
apps/worker/src/tasks/url-metadata/data-providers/twitch.ts) to include
"www.twitch.tv" in the array alongside the existing entries so URL
matching/embedding recognizes both forms.

Comment on lines +33 to +37
const avatarUrl = post.author.avatar_url;
let avatarResult: ProcessImageResult | undefined = undefined;
if (avatarUrl) {
avatarResult = await processImages(
[new URL(avatarUrl)],
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard provider URL parsing before image processing.

new URL(avatarUrl) (Line 37) and new URL(image.url) (Line 54) can throw before the downstream .catch(...) blocks run. A malformed upstream value can abort the whole task.

💡 Proposed fix
+	const safeParseUrl = (value: string, label: string): URL | undefined => {
+		try {
+			return new URL(value);
+		} catch (e) {
+			console.error(`Invalid ${label} URL for '${inputUrl}'`, e);
+			error = true;
+			return undefined;
+		}
+	};
+
 	const avatarUrl = post.author.avatar_url;
 	let avatarResult: ProcessImageResult | undefined = undefined;
 	if (avatarUrl) {
+		const parsedAvatarUrl = safeParseUrl(avatarUrl, "avatar");
+		if (!parsedAvatarUrl) {
+			avatarResult = undefined;
+		} else {
 		avatarResult = await processImages(
-			[new URL(avatarUrl)],
+			[parsedAvatarUrl],
 			72,
 			BUCKET,
 			"remote-posts",
 			jobId,
 			signal,
 		).catch((e) => {
 			console.error(`Error processing post image for '${inputUrl}'`, e);
 			error = true;
 			return undefined;
 		});
+		}
 	}
 ...
 	if (image) {
+		const parsedImageUrl = safeParseUrl(image.url, "post image");
+		if (!parsedImageUrl) {
+			imageResult = undefined;
+		} else {
 		imageResult = await processImages(
-			[new URL(image.url)],
+			[parsedImageUrl],
 			image.width,
 			BUCKET,
 			"remote-posts",
 			jobId,
 			signal,
 		).catch((e) => {
 			console.error(`Error processing post image for '${inputUrl}'`, e);
 			error = true;
 			return undefined;
 		});
+		}
 	}

Also applies to: 53-55

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromPost.ts` around lines 33 -
37, The code in getEmbedDataFromPost uses new URL(avatarUrl) and new
URL(image.url) directly (variables avatarUrl and image.url) which can throw
synchronously; wrap the URL construction in a try/catch or pre-validate the
string before calling processImages so malformed URLs are skipped or converted
to undefined instead of throwing, and ensure any skipped images are filtered out
before calling processImages([ ... ]) and before mapping into new URL for images
so the downstream .catch handlers are not bypassed.

Comment on lines +37 to +45
const banner = videoData?.thumbnail_url
? await processImages(
[new URL(videoData.thumbnail_url)],
expectedThumbnailWidth,
BUCKET,
"remote-video-thumbnail",
jobId,
signal,
).catch((e) => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Protect thumbnail URL parsing before calling processImages.

new URL(videoData.thumbnail_url) (Line 39) can throw before the current .catch(...) handler executes. This can fail the task unexpectedly on malformed provider data.

💡 Proposed fix
-	const banner = videoData?.thumbnail_url
-		? await processImages(
-				[new URL(videoData.thumbnail_url)],
-				expectedThumbnailWidth,
-				BUCKET,
-				"remote-video-thumbnail",
-				jobId,
-				signal,
-			).catch((e) => {
-				console.error(`Unable upload thumbnail image for ${inputUrl}`, e);
-				error = true;
-				return undefined;
-			})
-		: undefined;
+	let banner = undefined;
+	if (videoData?.thumbnail_url) {
+		try {
+			const thumbnailUrl = new URL(videoData.thumbnail_url);
+			banner = await processImages(
+				[thumbnailUrl],
+				expectedThumbnailWidth,
+				BUCKET,
+				"remote-video-thumbnail",
+				jobId,
+				signal,
+			).catch((e) => {
+				console.error(`Unable upload thumbnail image for ${inputUrl}`, e);
+				error = true;
+				return undefined;
+			});
+		} catch (e) {
+			console.error(`Invalid thumbnail URL for ${inputUrl}`, e);
+			error = true;
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/src/tasks/url-metadata/getEmbedDataFromVideo.ts` around lines 37
- 45, In getEmbedDataFromVideo, protect parsing of videoData.thumbnail_url
before calling processImages: the expression new URL(videoData.thumbnail_url)
can throw synchronously and bypass the existing .catch on the processImages
promise; wrap the URL construction/validation in a try/catch (or guard with a
safe URL-parse helper) and only call processImages when the URL was successfully
created, and ensure any thrown error is caught and logged/handled using the
existing jobId/signal flow so banner assignment doesn't crash the task.

Comment on lines +38 to +41
const deleteMockResponse = () => {
const returning = vi.fn();
return { where: vi.fn(() => ({ returning })) };
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make mocked .returning(...) default to an array.

Current default vi.fn() returns undefined. Callers that iterate over await ...returning(...) can crash in tests.

💡 Proposed fix
 	const deleteMockResponse = () => {
-		const returning = vi.fn();
+		const returning = vi.fn(async () => []);
 		return { where: vi.fn(() => ({ returning })) };
 	};
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const deleteMockResponse = () => {
const returning = vi.fn();
return { where: vi.fn(() => ({ returning })) };
};
const deleteMockResponse = () => {
const returning = vi.fn(async () => []);
return { where: vi.fn(() => ({ returning })) };
};
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@apps/worker/test-utils/setup.ts` around lines 38 - 41, The mock
deleteMockResponse currently sets returning = vi.fn() which yields undefined
when awaited; change returning to be a function that returns an empty array by
default (e.g., make returning an async/mock function that returns []), so
callers that await the .returning(...) can safely iterate; update the returned
object in deleteMockResponse (the where mock that returns { returning })
accordingly so .where(...).returning(...) resolves to an empty array by default.

.references(() => urlMetadataGist.gistId, { onDelete: "cascade" }),
filename: varchar("filename", { length: 256 }).notNull(),
contentKey: varchar("content_key", { length: 256 }).notNull(),
language: varchar("language", { length: 16 }).notNull(),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

GitHub REST API GET gists gist_id files language field null omitted

💡 Result:

In the REST API “Get a gist” response, each entry under files may include a language property (the docs show it in examples), but it’s not something you can rely on always being a non-empty string. [1]

In practice, files[...].language is best treated as optional/nullable:

  • Many client SDKs model it as an optional field (e.g., *string with omitempty), meaning it can be missing or null in JSON. [2]
  • GitHub determines language via Linguist (used to detect languages and ignore/handle binary content), so if Linguist can’t confidently classify a file (unknown/unsupported extension, binary-ish content, etc.), the API may return null or omit it. [3]

If you need a stable value, fall back to your own detection (e.g., by filename/extension or running Linguist yourself) when language is null/absent. [3]

Sources: GitHub REST Gists docs [1], go-github GistFile model [2], GitHub Linguist README [3].


🏁 Script executed:

# Find and inspect the url-metadata.ts schema file
find . -name "url-metadata.ts" -type f

Repository: playfulprogramming/hoof

Length of output: 195


🏁 Script executed:

# Inspect the url-metadata.ts schema file to see the actual implementation
cat -n ./packages/db/src/schema/url-metadata.ts | head -50

Repository: playfulprogramming/hoof

Length of output: 2244


Make language optional in url_metadata_gist_file schema.

The language field is marked as notNull() but GitHub's REST API can omit or return null for files[...].language when Linguist cannot confidently classify the file. This causes insert failures when ingesting gist data.

Schema adjustment
-		language: varchar("language", { length: 16 }).notNull(),
+		language: varchar("language", { length: 16 }),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
language: varchar("language", { length: 16 }).notNull(),
language: varchar("language", { length: 16 }),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/db/src/schema/url-metadata.ts` at line 24, The
url_metadata_gist_file schema currently forces language as notNull
(varchar("language", { length: 16 }).notNull()), which fails when GitHub returns
null/omits files[...].language; change the column definition in
packages/db/src/schema/url-metadata.ts for the url_metadata_gist_file entry to
allow nulls (remove .notNull() or use .nullable() depending on the ORM) so
inserts succeed when language is absent, and run migrations/regen types if
applicable.

Comment on lines +26 to +29
if (typeof data === "undefined" || response.error) {
throw new Error(
`GitHub API (getCommits) returned ${response.response.status} ${response.error}`,
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix incorrect operation name in error message.

Line 28 references getCommits, but this function is getGistById. This makes runtime errors misleading.

🛠️ Proposed fix
-			`GitHub API (getCommits) returned ${response.response.status} ${response.error}`,
+			`GitHub API (getGistById) returned ${response.response.status} ${response.error}`,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (typeof data === "undefined" || response.error) {
throw new Error(
`GitHub API (getCommits) returned ${response.response.status} ${response.error}`,
);
if (typeof data === "undefined" || response.error) {
throw new Error(
`GitHub API (getGistById) returned ${response.response.status} ${response.error}`,
);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/github-api/src/getGistById.ts` around lines 26 - 29, The error
message in getGistById incorrectly references getCommits; update the throw in
the block that checks if data is undefined or response.error (in function
getGistById) to mention getGistById instead of getCommits and include the same
response.response.status and response.error details so the runtime message
correctly identifies the failing operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Process OEmbed data with url-metadata

1 participant