Skip to content

feat(core): Add Amazon URL parsing and metadata extraction#27455

Open
shaurya-cd wants to merge 4 commits into
google-gemini:mainfrom
shaurya-cd:feature/amazon-url-unfurling
Open

feat(core): Add Amazon URL parsing and metadata extraction#27455
shaurya-cd wants to merge 4 commits into
google-gemini:mainfrom
shaurya-cd:feature/amazon-url-unfurling

Conversation

@shaurya-cd
Copy link
Copy Markdown

Summary

Adds Amazon URL parsing and product metadata extraction support to web-fetch.

This enables the CLI to automatically resolve Amazon short URLs (amzn.in, amzn.to) and extract structured product information for comparison and analysis workflows.

Details

Features Added

  • Detects Amazon and Amazon short URLs

  • Expands shortened Amazon URLs to canonical product URLs

  • Extracts structured metadata from Amazon product pages:

    • Product title
    • Price
    • Brand
    • Model
    • Key feature bullets/specifications
  • Injects extracted metadata into LLM context through web-fetch

  • Gracefully falls back to standard fetch behavior if extraction fails

Implementation Notes

  • Added utility parser:

    • packages/core/src/utils/amazon-url-parser.ts
  • Added unit tests:

    • packages/core/src/utils/amazon-url-parser.test.ts
  • Integrated metadata extraction into:

    • packages/core/src/tools/web-fetch.ts

This implementation intentionally keeps scope lightweight and avoids browser automation or anti-bot bypassing systems to remain maintainable and focused.

Related Issues

Fixes #27448

How to Validate

Build

npm run build --workspace=@google/gemini-cli-core

Typecheck

npm run typecheck --workspace=@google/gemini-cli-core

Run Tests

npx vitest src/utils/amazon-url-parser.test.ts

Manual Validation

Use an Amazon product URL such as:

https://amzn.in/d/00geRr5g

Expected behavior:

  • URL resolves successfully
  • Product metadata is extracted
  • Structured product details are returned instead of raw HTML

Pre-Merge Checklist

  • Added/updated tests

  • Validated on Windows

    • npm run
    • npx

@shaurya-cd shaurya-cd requested review from a team as code owners May 26, 2026 13:36
@google-cla
Copy link
Copy Markdown

google-cla Bot commented May 26, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 26, 2026

🛑 Action Required: Evaluation Approval

Steering changes have been detected in this PR. To prevent regressions, a maintainer must approve the evaluation run before this PR can be merged.

Maintainers:

  1. Go to the Workflow Run Summary.
  2. Click the yellow 'Review deployments' button.
  3. Select the 'eval-gate' environment and click 'Approve'.

Once approved, the evaluation results will be posted here automatically.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces native support for parsing and extracting structured metadata from Amazon product pages within the web-fetch tool. By enabling the automatic resolution of shortened Amazon links and the extraction of key product details, this change improves the quality of information available to the LLM for comparison and analysis tasks without requiring heavy browser automation.

Highlights

  • Amazon URL Support: Added logic to detect Amazon and Amazon short URLs (amzn.in, amzn.to) and expand them to canonical product URLs.
  • Metadata Extraction: Implemented scraping utilities to extract product titles, prices, brands, models, and feature bullets from Amazon product pages.
  • LLM Integration: Integrated the new parsing logic into the web-fetch tool to inject structured product metadata directly into the LLM context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds Amazon product metadata extraction to the WebFetchTool, introducing a new utility to parse Amazon URLs, expand shortened links, extract product details (such as title, price, brand, model, and bullets), and format them into LLM-friendly context. The reviewer raised critical security concerns regarding Server-Side Request Forgery (SSRF). Specifically, they recommended validating that the expanded canonical URL does not point to a private IP before fetching, and tightening the isAmazonUrl hostname validation to prevent domain spoofing and DNS rebinding bypasses.

Comment on lines +178 to +193
export async function extractAmazonMetadata(
url: string,
): Promise<AmazonProductMetadata> {
const canonicalUrl = await expandAmazonUrl(url);

const html = await fetchAmazonHtml(canonicalUrl);

return {
canonicalUrl,
title: extractTitle(html),
price: extractPrice(html),
bullets: extractBullets(html),
brand: extractBrand(html),
model: extractModel(html),
};
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Security Vulnerability: Redirect to Private IP (SSRF)

When expanding shortened Amazon URLs, the redirected canonicalUrl could point to a private IP or localhost. We must validate that the expanded URL is not a private IP before fetching its HTML content.

Suggested change
export async function extractAmazonMetadata(
url: string,
): Promise<AmazonProductMetadata> {
const canonicalUrl = await expandAmazonUrl(url);
const html = await fetchAmazonHtml(canonicalUrl);
return {
canonicalUrl,
title: extractTitle(html),
price: extractPrice(html),
bullets: extractBullets(html),
brand: extractBrand(html),
model: extractModel(html),
};
}
export async function extractAmazonMetadata(
url: string,
): Promise<AmazonProductMetadata> {
const canonicalUrl = await expandAmazonUrl(url);
if (isPrivateIp(canonicalUrl)) {
throw new PrivateIpError(`Access to private network is blocked: ${canonicalUrl}`);
}
const html = await fetchAmazonHtml(canonicalUrl);
return {
canonicalUrl,
title: extractTitle(html),
price: extractPrice(html),
bullets: extractBullets(html),
brand: extractBrand(html),
model: extractModel(html),
};
}

Comment on lines +27 to +37
export function isAmazonUrl(url: string): boolean {
try {
const parsed = new URL(url);

const host = parsed.hostname.toLowerCase();

return AMAZON_HOST_PATTERNS.some((pattern) => host.includes(pattern));
} catch {
return false;
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The isAmazonUrl function uses host.includes(pattern) for validation, which is insecure. This allows attacker-controlled domains like amazon.attacker.com or amazon.127.0.0.1.nip.io to be incorrectly identified as Amazon URLs, leading to Server-Side Request Forgery (SSRF) and DNS rebinding bypasses. A stricter validation is required, ensuring the hostname accurately matches Amazon domains (e.g., ending with .amazon.<tld> or amzn.to/amzn.in).

Suggested change
export function isAmazonUrl(url: string): boolean {
try {
const parsed = new URL(url);
const host = parsed.hostname.toLowerCase();
return AMAZON_HOST_PATTERNS.some((pattern) => host.includes(pattern));
} catch {
return false;
}
}
export function isAmazonUrl(url: string): boolean {
try {
const parsed = new URL(url);
const host = parsed.hostname.toLowerCase();
return /^(.*\.)?(amazon\.[a-z]{2,3}(\.[a-z]{2})?|amzn\.(in|to))$/i.test(host);
} catch {
return false;
}
}

* SPDX-License-Identifier: Apache-2.0
*/

import { fetchWithTimeout } from './fetch.js';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Import Required Utilities for SSRF Prevention

Import isPrivateIp and PrivateIpError to validate the expanded canonical URL before fetching its HTML content.

Suggested change
import { fetchWithTimeout } from './fetch.js';
import { fetchWithTimeout, isPrivateIp, PrivateIpError } from './fetch.js';

@gemini-cli gemini-cli Bot added priority/p3 Backlog - a good idea but not currently a priority. area/agent Issues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Quality labels May 26, 2026
@shaurya-cd shaurya-cd force-pushed the feature/amazon-url-unfurling branch from a62d8e0 to da92498 Compare May 27, 2026 10:32
@shaurya-cd shaurya-cd requested a review from a team as a code owner May 27, 2026 10:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/agent Issues related to Core Agent, Tools, Memory, Sub-Agents, Hooks, Agent Quality priority/p3 Backlog - a good idea but not currently a priority.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Amazon URL parsing and metadata extraction for product comparisons

1 participant