Skip to content

zhihu download: author extraction returns "unknown" for answer pages #40

@yrom

Description

@yrom

Bug Report

Description

The zhihu download command always returns author: "unknown" for answer pages (e.g. www.zhihu.com/question/xxx/answer/yyy). Column articles (zhuanlan.zhihu.com/p/xxx) appear to have the same issue.

Steps to Reproduce

autocli zhihu download "https://www.zhihu.com/question/351504112/answer/2027391723035275294" --output /tmp/zhihu-test

Expected Output

author: "NGINX洪志道"

Actual Output

author: "unknown"

Root Cause

The author selector in adapters/zhihu/download.yaml is:

const author = document.querySelector('.AuthorInfo-name, .UserLink-link')?.textContent?.trim() || 'unknown';

These CSS selectors (.AuthorInfo-name, .UserLink-link) no longer match the current Zhihu DOM structure — the classes have been renamed or removed.

Evidence

I ran autocli explore on the same URL and the Zhihu API (/api/v4/questions/{id}/feeds) correctly returns the author in its JSON response:

{
  "target": {
    "author": {
      "name": "NGINX洪志道"
    }
  }
}

This confirms the author data is available on the page, but the DOM selectors are outdated.

Suggested Fix

Update the author extraction to try multiple selectors with fallbacks:

const author = 
  // Try current selectors
  document.querySelector('.AuthorInfo-name, .UserLink-link')?.textContent?.trim() ||
  // Try meta tags (more stable across redesigns)
  document.querySelector('meta[itemprop="author"]')?.getAttribute('content') ||
  document.querySelector('meta[name="author"]')?.getAttribute('content') ||
  // Try alternative DOM selectors
  document.querySelector('[itemprop="name"]')?.getAttribute('content') ||
  document.querySelector('.Post-Author .AuthorInfo-name')?.textContent?.trim() ||
  // Try extracting from initialData JSON embedded in the page
  (() => {
    try {
      const scripts = document.querySelectorAll('script[data-testid="initial-data"], script#js-initialData');
      for (const s of scripts) {
        const data = JSON.parse(s.textContent);
        const entries = data?.initialState?.entities?.users;
        if (entries) return Object.values(entries)[0]?.name || '';
      }
    } catch(e) {}
    return '';
  })() ||
  'unknown';

Environment

  • autocli version: latest (installed via install script)
  • OS: macOS Darwin 24.6.0
  • Chrome: logged into zhihu.com
  • Browser commands working: yes (zhihu hot, zhihu search, zhihu download all functional except author extraction)

Additional Note

Similarly, the zhida (直答) links in the output markdown are internal Zhihu AI search links that add noise:

[nginx](https://zhida.zhihu.com/search?content_id=777422445&content_type=Answer&match_order=1&q=nginx&zhida_source=entity)

It would be nice if the adapter stripped these zhida.zhihu.com links and kept only the display text.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions