Skip to content

Critical Bug Report: Latest ZhiZi b40c78nbt Model Exhibits Severe Evaluation Inconsistencies in KataGo Analysis (Fan Xiping vs. Huang Yougong, 3-Stone Handicap) #1183

@GuangHui68

Description

@GuangHui68

Description
While re-analyzing the historic game "Fan Xiping vs. Huang Yougong (3-Stone Handicap)" using the latest ZhiZi b40c78nbt model, I have encountered severe and persistent bugs.
I want to be clear: my goal is not to "defeat" KataGo by exploiting its bugs. I rely on it to provide clear and trustworthy analysis. When the analysis data exhibits obvious, visually apparent bugs, it fundamentally damages KataGo's credibility.
Before detailing the specific issues in this game, I would like to outline what I consider to be clear indicators of a bug:

Massive Point Swing Between Consecutive Moves: If the evaluation shows a swing of +10 points or more (even +tens of points) between two consecutive moves, this cannot be attributed to both players playing "divine moves." It is almost certainly a blind spot in KataGo's calculation.
Universal Catastrophic Evaluation: If, on a given turn, all candidate moves are evaluated at -10 points or worse (even -30 points or worse), this cannot mean the previous move was a "divine move." It can only indicate a significant oversight or miscalculation.
Extreme Loss Followed by Immediate Massive Gain: If a move incurs a loss of around -100 points, and then a few moves later (or even immediately after) there is a gain of +50 points, this is highly likely a calculation error.
Inconsistent Evaluation of the Same Move: The evaluation (and color indicator) for a single move changes drastically depending on whether it is viewed before it's played, after it's played, or after the next move is played. Furthermore, undoing the move does not revert the evaluation. What is the cause of this?

I propose that KataGo should implement a system to detect these anomalous scenarios and establish a rollback mechanism to re-calculate the evaluations for the preceding few moves to correct these obvious errors. Regardless of how complex the board position is—whether it's a chaotic multi-group fight—if the evaluation becomes this unstable, can we truly trust a program that is supposed to give a 3-stone handicap to top professional players?
Specific Bugs Found in Moves 131-150
Within the short span of just 20 moves (131-150), the following clear bugs are present:

A. Move 140: This is a textbook example of Bug Type #2 (see Figure 2). The preceding move, 139, is shown in Figure 1.
B. Move 142: Bug Type #2 appears again (see Figure 4). The preceding move, 141, is shown in Figure 3.
C. Move 139: Before being played, its evaluation was -8.0 points (Figure 1). However, after playing it, the evaluation changed to dark purple, indicating a loss of more than -12 points (Figure 2). Critically, undoing the move does not revert the evaluation back to -8.0.
D. Move 140: The same issue recurs. In Figure 2, move 140 is evaluated at -14.2 points. But in Figure 3 (after the next move is played), it changes to yellow, meaning a loss of less than -3 points. Why does the evaluation for the same move fluctuate so wildly over just one ply?

The fact that this bug occurs on two consecutive moves indicates a severe systemic error in the entire calculation process, which should be completely restarted. This is the result even at 200,000 visits, representing a massive waste of computational time (the cost of electricity is the least of our concerns).
On a positive note, compared to the previous b28c512 networks, the analysis of White's move 145 in this run was correct and showed no errors.

Attached Files:
Zip package :
Original Game Record: O1.sgf
Analyzed Game Record: A3.sgf
Reference Figures: 01.png, 02.png, 03.png, 04.png

Environment

UI: Latest KaTrain version
Engine: Latest KataGo TRT version 1.16.4

Models:
kata1-zhizi-b40c768nbt-fdx6c.bin.gz (current strongest)

Analysis settings:
Single move analysis: 200,000 visits
Quick play: 8,000 visits per move
Wide Root Noise: 0.045

numAnalysisThreads = 5
numSearchThreadsPerAnalysisThread = 24
nnMaxBatchSize = 128

Backend: TensorRT
OS: Windows 11
GPU:RTX 3080

在用最新的智子b40c78nbt model重新分析范西屏vs三子 黄友功时,发现仍旧存在巨大bug
我先明确一下,我不是用katago的bug要战胜它,而是需要它提供明确、可信赖的分析结果。如果分析数据出现肉眼可见的bug,其实损害的是katago的信誉
在此之前,我先列出个人认为明显bug的几种情形:
1、前后两手,目差都出现10目+的增加,甚至+数十目。这绝非双方都下出了神之一手,必然是katago计算出现盲点
2、某一手棋,所有的选点,目差都是-10目以上,甚至-30目以上,不可能是上一手是神之一手,只能是出现了漏算
3、某一手出现-100目左右的极大亏损,然后数手或者紧接着出现+50目,大概率是计算失误
4、对同一手的目差计算,下之前、之后、更下一手,出现明显区别(色标巨变),而且回退后不会改变。希望明确这是什么原因?
我建议katago应该检测这种异常情形,并且建立回溯机制,重算前面数手目差,及时纠正这种显然的失误。不管局势有多复杂,是几块棋混战,如果混战就目差混乱,那还是能让职业高段3子的katago么?
接下去,在本局的131-150这短短20手之间,存在如下的明确bug:
A、140手,就是我提出的类型2,见图2;此前的139手,见图1。
B、142手,再次出现类型2,见图4;此前的141手,见图3。
C、139手在下之前,目差-8.0(图1),但下完后变色为-12目以上的大失误(图2),但是倒退回来不会改变。
D、同样的情况在140上再次出现。可以看到,图2时,140是-14.2目,到了图3,就变成了黄色,意味着<-3目,隔了一手就如此变动目差,why?
连续2手出现这种bug,说明整个计算都有严重错误,应该全部重来。这还是20万visits下的结果,明显是浪费大量的时间(算力的电费倒是小头)
不过相比此前的两个b28c512 ,此次在白145的分析上没有出现失误
原始棋谱,O1.sgf;分析棋谱,A3.sgf。图:01-04

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions