feat(core): Add Amazon URL parsing and metadata extraction#27455
feat(core): Add Amazon URL parsing and metadata extraction#27455shaurya-cd wants to merge 4 commits into
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
🛑 Action Required: Evaluation ApprovalSteering changes have been detected in this PR. To prevent regressions, a maintainer must approve the evaluation run before this PR can be merged. Maintainers:
Once approved, the evaluation results will be posted here automatically. |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces native support for parsing and extracting structured metadata from Amazon product pages within the web-fetch tool. By enabling the automatic resolution of shortened Amazon links and the extraction of key product details, this change improves the quality of information available to the LLM for comparison and analysis tasks without requiring heavy browser automation. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds Amazon product metadata extraction to the WebFetchTool, introducing a new utility to parse Amazon URLs, expand shortened links, extract product details (such as title, price, brand, model, and bullets), and format them into LLM-friendly context. The reviewer raised critical security concerns regarding Server-Side Request Forgery (SSRF). Specifically, they recommended validating that the expanded canonical URL does not point to a private IP before fetching, and tightening the isAmazonUrl hostname validation to prevent domain spoofing and DNS rebinding bypasses.
| export async function extractAmazonMetadata( | ||
| url: string, | ||
| ): Promise<AmazonProductMetadata> { | ||
| const canonicalUrl = await expandAmazonUrl(url); | ||
|
|
||
| const html = await fetchAmazonHtml(canonicalUrl); | ||
|
|
||
| return { | ||
| canonicalUrl, | ||
| title: extractTitle(html), | ||
| price: extractPrice(html), | ||
| bullets: extractBullets(html), | ||
| brand: extractBrand(html), | ||
| model: extractModel(html), | ||
| }; | ||
| } |
There was a problem hiding this comment.
Security Vulnerability: Redirect to Private IP (SSRF)
When expanding shortened Amazon URLs, the redirected canonicalUrl could point to a private IP or localhost. We must validate that the expanded URL is not a private IP before fetching its HTML content.
| export async function extractAmazonMetadata( | |
| url: string, | |
| ): Promise<AmazonProductMetadata> { | |
| const canonicalUrl = await expandAmazonUrl(url); | |
| const html = await fetchAmazonHtml(canonicalUrl); | |
| return { | |
| canonicalUrl, | |
| title: extractTitle(html), | |
| price: extractPrice(html), | |
| bullets: extractBullets(html), | |
| brand: extractBrand(html), | |
| model: extractModel(html), | |
| }; | |
| } | |
| export async function extractAmazonMetadata( | |
| url: string, | |
| ): Promise<AmazonProductMetadata> { | |
| const canonicalUrl = await expandAmazonUrl(url); | |
| if (isPrivateIp(canonicalUrl)) { | |
| throw new PrivateIpError(`Access to private network is blocked: ${canonicalUrl}`); | |
| } | |
| const html = await fetchAmazonHtml(canonicalUrl); | |
| return { | |
| canonicalUrl, | |
| title: extractTitle(html), | |
| price: extractPrice(html), | |
| bullets: extractBullets(html), | |
| brand: extractBrand(html), | |
| model: extractModel(html), | |
| }; | |
| } |
| export function isAmazonUrl(url: string): boolean { | ||
| try { | ||
| const parsed = new URL(url); | ||
|
|
||
| const host = parsed.hostname.toLowerCase(); | ||
|
|
||
| return AMAZON_HOST_PATTERNS.some((pattern) => host.includes(pattern)); | ||
| } catch { | ||
| return false; | ||
| } | ||
| } |
There was a problem hiding this comment.
The isAmazonUrl function uses host.includes(pattern) for validation, which is insecure. This allows attacker-controlled domains like amazon.attacker.com or amazon.127.0.0.1.nip.io to be incorrectly identified as Amazon URLs, leading to Server-Side Request Forgery (SSRF) and DNS rebinding bypasses. A stricter validation is required, ensuring the hostname accurately matches Amazon domains (e.g., ending with .amazon.<tld> or amzn.to/amzn.in).
| export function isAmazonUrl(url: string): boolean { | |
| try { | |
| const parsed = new URL(url); | |
| const host = parsed.hostname.toLowerCase(); | |
| return AMAZON_HOST_PATTERNS.some((pattern) => host.includes(pattern)); | |
| } catch { | |
| return false; | |
| } | |
| } | |
| export function isAmazonUrl(url: string): boolean { | |
| try { | |
| const parsed = new URL(url); | |
| const host = parsed.hostname.toLowerCase(); | |
| return /^(.*\.)?(amazon\.[a-z]{2,3}(\.[a-z]{2})?|amzn\.(in|to))$/i.test(host); | |
| } catch { | |
| return false; | |
| } | |
| } |
| * SPDX-License-Identifier: Apache-2.0 | ||
| */ | ||
|
|
||
| import { fetchWithTimeout } from './fetch.js'; |
There was a problem hiding this comment.
a62d8e0 to
da92498
Compare
d34ad0d to
9a83b59
Compare
Summary
Adds Amazon URL parsing and product metadata extraction support to
web-fetch.This enables the CLI to automatically resolve Amazon short URLs (
amzn.in,amzn.to) and extract structured product information for comparison and analysis workflows.Details
Features Added
Detects Amazon and Amazon short URLs
Expands shortened Amazon URLs to canonical product URLs
Extracts structured metadata from Amazon product pages:
Injects extracted metadata into LLM context through
web-fetchGracefully falls back to standard fetch behavior if extraction fails
Implementation Notes
Added utility parser:
packages/core/src/utils/amazon-url-parser.tsAdded unit tests:
packages/core/src/utils/amazon-url-parser.test.tsIntegrated metadata extraction into:
packages/core/src/tools/web-fetch.tsThis implementation intentionally keeps scope lightweight and avoids browser automation or anti-bot bypassing systems to remain maintainable and focused.
Related Issues
Fixes #27448
How to Validate
Build
Typecheck
Run Tests
Manual Validation
Use an Amazon product URL such as:
Expected behavior:
Pre-Merge Checklist
Added/updated tests
Validated on Windows