From e8980f25c1e121447ea26976fe3e14ed638c4684 Mon Sep 17 00:00:00 2001 From: "weilong.cwl" Date: Sun, 14 Sep 2025 00:10:04 +0800 Subject: [PATCH 1/4] add seg blob --- content/blog/ming-lite-omni-1_5-seg/index.md | 200 ++++++++++++++++++ .../blog/ming-lite-omni-1_5-seg/index.zh.md | 146 +++++++++++++ 2 files changed, 346 insertions(+) create mode 100644 content/blog/ming-lite-omni-1_5-seg/index.md create mode 100644 content/blog/ming-lite-omni-1_5-seg/index.zh.md diff --git a/content/blog/ming-lite-omni-1_5-seg/index.md b/content/blog/ming-lite-omni-1_5-seg/index.md new file mode 100644 index 0000000..3ba5081 --- /dev/null +++ b/content/blog/ming-lite-omni-1_5-seg/index.md @@ -0,0 +1,200 @@ +--- +title: "Segmentation-as-Editing for Unified Multimodal AI" +date: 2025-09-13T00:00:03+08:00 +weight: 1 +math: true +# draft: true +show_reading_time: true +show_bread_crumbs: true +show_post_nav_links: false # the prev/next after the content +show_code_copy_buttons: true +show_word_count: true +--- + +{{< button href="https://github.com/inclusionAI/Ming" label="GITHUB" external=true >}} 🤗 Hugging Face| 🤖 ModelScope + +# Ming-lite-omni 1.5: Segmentation-as-Editing for Unified Multimodal AI + +### The Hype and the Hidden Question + +The multimodal AI world has been thriving. + +From the debut of Qwen-Image to the interactive editing hype sparked by Nano Banana, image editing has rapidly become the next battlefield for generative AI. + +Editing fundamentally requires two distinct skill sets: +- **Know *where*, *what*, and *how* to change** (understanding the image) +- **Produce the change with high visual quality** (generating the image) + +Its rich gameplay and strong interactivity have pulled in users, developers, and creators alike. + +But behind the noise, few are asking: + +> **Beneath this prosperity, how close are we to a truly unified “understanding + generation” AI?** + +### Understanding and Generation: Two Hands, Often Out of Sync + +For years, we’ve chased an ambitious goal: + +Build a unified multimodal model that understands the world like a scientist (e.g., image segmentation) while creating it like an artist (e.g., image editing). + +In theory, these abilities should be mutually reinforcing: + +> *“The deeper the understanding, the better the creation; the more the creation, the deeper the understanding.”* + +Reality is messier. + +In AI today: +- **Understanding = the left hand:** precise abstractions, semantic reasoning, boundaries. +- **Generation = the right hand:** coherent pixels, style, aesthetics. + +But training a model to recognize 10,000 cat photos doesn’t magically make it capable of painting cats, and painting cats repeatedly doesn’t make it understand cats better. + +Worse, in multitask training, the two often compete for resources — optimizations for understanding can hurt generation, and vice versa. + +**We’re missing a catalyst: a task that forces the left and right hands to evolve together.** + +--- + +### The Struggle: 16% Segmentation and Out-of-Control Generation + +Before finding our solution, our unified model was struggling with generative segmentation: + +Given an instruction like “*segment the banana in the upper-right corner*”, we wanted the model to output a segmentation mask directly. + +The results were painful. + +![Struggling with Segmentation](占位符:请在这里替换为您的图示链接) + +On RefCOCO-val, our cIoU plateaued at **~16%**. + +The root cause is the **distribution gap**. + +Generative models thrive on natural, continuous image distributions. Segmentation masks, however, are synthetic, abstract, binary maps — as unnatural as it gets for an image generator. + +It was like asking a painter to draw an X-ray: doable, but far from their artistic instincts. + +Here, generation wasn’t helping segmentation — it was tripping it up. + +We needed a new task that: +1. Met the precision demands of **understanding**. +2. Played to the strengths of **generation**. + +### The “Aha” Moment: Dressing Segmentation in Color + +Here’s the analogy that unlocked it for us: + +> *If you want a child to mark an object, is it easier to have them draw a tight outline with a pencil, or fill it in with bright colors?* + +Obviously, the latter. + +Instead of forcing our model to output abstract black-and-white masks, we **turned the segmentation task into a color-editing task**. + +**Example:** +- **Instruction:** “*segment the banana in the upper-right*” +- **Old way:** Output a mask ❌ +- **New way:** Directly edit the image: “*paint the banana purple*”, “*make the banana red*”, etc. ✅ + +![Segmentation as Editing](占位符:请在这里替换为您的图示链接) + +This brought the task’s data distribution back to the realm of natural images — where generative models shine. + +### Why This Works: The Hidden Catalyst + +That small twist turned out to be exactly the catalyst we’d been searching for. + +- **Boosting Understanding:** +To color the banana without bleeding outside the boundary, the model must internally nail pixel-perfect segmentation. The segmentation step became an **implicit prerequisite** to editing. + +- **Unleashing Generation:** +No more awkward synthetic masks — the model is doing what it knows best: image-to-image editing. All its strengths in shading, texture, and edge blending go into making the change look natural. + +For the first time, the left hand and right hand weren’t fighting — **they were helping each other**. + +--- + +### The Numbers: From 16% to 72.4% — and Beyond + +#### 1. SOTA-level Segmentation + +The cIoU score didn’t just improve — it soared from 16% to **72.4%** on RefCOCO-val, a relative gain of over **350%**. + +Qualitatively, the model outperformed competitors in pinpointing and segmenting targets, even in reasoning-heavy cases. + +Against Qwen-Image and Nano Banana, our model: +- Located small or occluded targets more reliably. +- Produced boundaries that were visually and semantically aligned with instructions. + +![Segmentation Comparison 1](占位符:请在这里替换为您的图示链接) +*Our model (right) accurately locates and segments the target subject. Qwen-Image (second from left) fails to locate the correct target, while Nano-banana (third from left) fails to accurately segment the man's head and has loose boundary lines.* + +![Segmentation Comparison 2](占位符:请在这里替换为您的图示链接) +*For the prompt "please segment the girl with red mask," our model (right) is precise. Qwen-Image (second from left) misses the feet, and Nano-banana (third from left) alters the subject's proportions.* + +During evaluation, thanks to the high consistency of non-edited regions in our model, we can directly derive the segmentation mask by calculating the difference between the edited result and the original image. The results show that our model's performance on segmentation is now on par with specialized vision models. + +| Model Category | Model Name | RefCOCO (val) | RefCOCO+ (val) | RefCOCOg (val) | +| :--- | :--- | :---: | :---: | :---: | +| **Vision Specialist Models** | VLT | 67.5 | 56.3 | 55.0 | +| | CRIS | 70.5 | 62.3 | 59.9 | +| | LAVT | 72.7 | 62.1 | 61.2 | +| | PolyFormer-B | 74.8 | 67.6 | 67.8 | +| **MLLM + Specialist (SAM)** | LISA-7B | 74.1 | 62.4 | 66.4 | +| | PixelLM-7B | 73.0 | 66.3 | 69.3 | +| **Generative Models** | Qwen-Image-Edit* | 30.3 | 28.8 | 34.0 | +| | **Ming-Lite-Omni1.5 (Ours)** | **72.4** | **62.8** | **64.3** | +*Due to its lower metrics, Qwen-Image-Edit was evaluated on a random sample of 500 images per test subset.* + +#### 2. Sharper, More Controllable Editing + +The beauty of this method is that it not only fixed the segmentation weakness but also dramatically enhanced the model's general editing capabilities. + +Because the model has learned an unprecedented "respect for boundaries" through thousands of "precise coloring" exercises, this "muscle memory" for fine-grained control has transferred to all editing tasks. Our edit controllability score saw a significant jump from **7.69 to 8.12** across sub-tasks like background, color, and material changes. + +![Editing Controllability Comparison](占位符:请在这里替换为您的图示链接) +*Prompt: "remove the bow tie of the man on the far right." Our model (right) precisely removes only the target bow tie while maintaining background consistency. Qwen (second from left) incorrectly removes multiple bow ties and introduces inconsistencies. Nano-banana (third from left) also struggles with consistency.* + +#### 3. Stronger ID Consistency + +A core challenge in portrait editing is maintaining identity. Our model excels here as well. Whether changing a hairstyle or adjusting an expression, the model skillfully preserves the person's core features. + +![ID Consistency Comparison](占位符:请在这里替换为您的图示链接) +*Top Row (Turn head): Our model (right) maintains ID and background consistency, unlike competitors. Middle Row (Smile): Our model (right) correctly follows the prompt while preserving ID, avoiding distortions seen in others. Bottom Row (Change background): Our model (right) excels at preserving the subject's ID and appearance during a background swap.* + +**See More Editing Consistency in Action:** +![More Consistency Examples](占位符:请在这里替换为您的图示链接) + +--- + +### An Honest Look: Where We Can Still Improve + +Despite the leap forward, challenges remain: +- **Large pose changes** (e.g., standing → running) need more reliability. +- **Multi-step or compound instructions** require better parsing and execution. +- **Instruction diversity support** needs expansion. + +These are our next milestones. + +### Takeaway: The Next Catalysts Are Out There + +From 16% to 72.4% — this wasn’t driven by a massive architecture overhaul or billion-image datasets. + +It came from **one change in task design**. + +The lesson: Instead of gluing capabilities together after the fact, **find naturally cooperative tasks** — where solving the problem requires multiple abilities to mesh seamlessly. + +“Segmentation-as-editing” is just the first example. + +We suspect 3D understanding, video generation, and other domains have their own hidden catalysts, waiting to be discovered. + +**At last, AI’s left and right hands have learned to high-five.** + +**And this is only the overture.** + +--- + +Try out our open-source model **Ming-lite-omni 1.5** on our [**GitHub Page / Demo Page**](占位符:你的GitHub/Demo链接). Please star our repo if you like it! + +To cite our work: +``` + +``` \ No newline at end of file diff --git a/content/blog/ming-lite-omni-1_5-seg/index.zh.md b/content/blog/ming-lite-omni-1_5-seg/index.zh.md new file mode 100644 index 0000000..79beac4 --- /dev/null +++ b/content/blog/ming-lite-omni-1_5-seg/index.zh.md @@ -0,0 +1,146 @@ +--- +title: "编辑式图像分割:Ming-lite-omni 1.5 破解AI“左右互搏”的隐藏催化剂" +date: 2025-09-13T00:00:03+08:00 +weight: 1 +math: true +# draft: true +show_reading_time: true +show_bread_crumbs: true +show_post_nav_links: false # the prev/next after the content +show_code_copy_buttons: true +show_word_count: true +--- + +# 编辑式图像分割:Ming-lite-omni 1.5 破解AI“左右互搏”的隐藏催化剂 + +最近,多模态AI领域风起云涌。从 Qwen-Image 的亮相到 Nano Banana 引发的交互式编辑热潮,图像编辑俨然已是下一个“兵家必争之地”。编辑既要明白“在哪里、是什么、怎么变”(理解图像),又要高质量地创造出结果(生成图像),其丰富的玩法和强交互性,吸引了大量用户和开发者参与讨论。然而,图像编辑除了好玩之外,还有被行业忽略的重要基础价值。 + +长久以来,我们追求着一个宏大目标:构建一个**统一的多模态模型**,它既能像科学家一样深刻理解世界(理解能力,如图像分割),又能像艺术家一样自由创造世界(生成能力,如图像编辑)。理想中,这两种能力应相辅相成,形成“理解越深,创造越好;创造越多,理解越透”的良性循环。 + +但现实却不尽人意。**理解与生成,如同AI体内的“左手”和“右手”,往往无法协同工作。** 训练模型识别一万张猫的图片,并不会直接提升它画猫的能力,反之亦然。更糟糕的是,在统一模型的训练中,两种任务常常因优化目标不同而陷入“左右互搏”的零和博弈:一次针对理解能力的优化,可能无意中损害了模型的生成质量。 + +这意味着,我们缺少一个关键的“催化剂”——一种能够促进“左手”与“右手”协同进化的任务机制。 + +今天,我们想分享一个令人兴奋的发现。**我们找到了这样一种催化剂**,一个简单而极其有效的任务转换,它不仅打破了僵局,还使模型的两项核心能力均实现了质的飞跃。这个秘诀就是:在统一模型的训练框架中,**将经典的分割任务,重新定义为一次图像编辑**,不仅让生成式分割能力达到 SOTA,还使编辑一致性实现了飞跃。 + +--- + +## 困局:16%的分割得分与失控的生成 + +在找到这个方法之前,我们的统一模型在一个关键任务上举步维艰:**生成式分割**。我们希望模型能根据指令(如“分割出右上角那只香蕉”),直接“画”出分割掩码图。 + +结果是,模型在 RefCOCO-val 上的推理分割指标(cIoU)顽固地停留在 **16%** 上下。 + +我们分析,根本原因在于**数据分布的巨大鸿沟**。生成模型习惯了处理自然、连续的图像数据。而分割任务的目标(黑白掩码图)是一种极度抽象、非自然的数据分布。强迫一个“画家”去画黑白掩码图,无异于缘木求鱼。 + +我们意识到,必须找到一个任务,它既能满足“理解之手”对边界精度的要求,又能让“创造之手”在自己熟悉的领域内大展拳脚。 + +--- + +## 灵感迸发:让分割“穿上色彩的外衣” + +我们的“Ah‑ha moment”来源于一个简单的类比:如果想让孩子准确地圈出一个物体,是让他用铅笔画一个生硬的轮廓更容易,还是让他用彩笔把那个物体涂满颜色更容易? + +答案显然是后者。 + +我们将这个想法应用到AI训练中。我们不再让模型生成抽象的黑白掩码,而是**将分割任务转换成一个色彩编辑任务**。 + +例如,对于“分割右上角的香蕉”这个指令,我们不再要求模型输出掩码,而是要求它直接在原图上执行一个新的指令:“把右上角的香蕉涂成紫色”、“把右上角的香蕉涂成红色”等等。 + +![图示说明:左侧为原图香蕉,从生成抽象的黑白掩-码(中),到直接在原图上进行色彩编辑(右三)。这个转换让任务的数据分布回归到了自然图像领域。](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*acrPSp-7qM8AAAAAgCAAAAgAevzJAQ/original) + +这个看似微小的改动,却是那个我们梦寐以求的“催化剂”。 + +- **对“理解”的促进**:为了准确地只给目标香蕉上色而不溢出,模型必须在内部先完成一次完美的、像素级的分割。分割能力从最终目标,变成了完成任务的必要前提。 +- **对“创造”的释放**:模型不再处理奇怪的掩码图,而是在做它最擅长的事——图像到图像的编辑。它所有的生成能力,如光影、纹理、边缘融合,都能用来把颜色“涂”得更逼真、更准确。 + +“左手”和“右手”终于有了一个共同的目标,它们的每一次努力都在互相加强。 + +--- + +## 效果惊人:从16%到72.4%,以及更可控的编辑能力 + +当我们用这种新方法重新训练模型后,结果超出了所有人的预期。 + +### 1. SoTA级别的分割能力 + +首先,最直观的变化来自于分割指标。它从之前惨淡的16%,一跃飙升至 **72.4%**!这是一个超过 **350%** 的相对提升。 + +指标的背后,是肉眼可见的质变。在处理复杂的推理分割任务时,我们的模型展现出超越竞品的准确性和场景理解力。 + +![图示说明:我们的模型(右)精准定位并分割了目标主体,Qwen-Image(左二)未能准确定位要分割的目标,Nano-banana(左三)则未能准确分割男士的头部,以及分割的边缘线不够贴合。](占位符:请在这里替换为您的图示链接) + +![图示说明:这个case的指令“please segment the girl with red mask”, 我们的模型(右)精准定位并分割了目标主体,Qwen-Image(左二)未能分割脚部,Nano-banana(左三)则改变了主体尺寸。](占位符:请在这里替换为您的图示链接) + +在“分割女孩”的案例中,Qwen没有包含脚部,而Nano-banana改变了主体尺寸。在“分割拿雨伞的女人”这类需要推理的案例中,我们的模型能准确找到目标,而竞品则出现了主体识别错误或指令理解偏差。这证明,通过“上色”训练,模型的语义理解与视觉定位能力被深度绑定并共同强化了。 + +在推理分割指标评估过程中,依托于我们模型在非编辑区域的高度一致性,我们直接通过将涂色编辑结果与原图进行差分计算,获得分割掩码,示例如下: + +![Ming-Lite-Omni1.5 vs Qwen-Image-Edit 差分对比](占位符:请在这里替换为您的图示链接) + +评估结果显示,我们的模型在分割任务中的表现已达到与专为分割设计的专业模型相当的水平。其中,Qwen-Image-Edit因评估指标明显较低,仅在每个测试子集上随机采样500个样本进行评估。 + +| 模型类别 | 模型名称 | RefCOCO (val) | RefCOCO+ (val) | RefCOCOg (val) | +| :--- | :--- | :---: | :---: | :---: | +| **Vision Specialist**
(专用视觉分割模型) | VLT | 67.5 | 56.3 | 55.0 | +| | CRIS | 70.5 | 62.3 | 59.9 | +| | LAVT | 72.7 | 62.1 | 61.2 | +| | PolyFormer-B | 74.8 | 67.6 | 67.8 | +| **MLLM + SAM**
(专用的分割模型) | LISA-7B | 74.1 | 62.4 | 66.4 | +| | PixelLM-7B | 73.0 | 66.3 | 69.3 | +| **MLLM + DiT**
(生成式模型做分割) | Qwen-Image-Edit* | 30.3 | 28.8 | 34.0 | +| | **Ming-Lite-Omni1.5** | **72.4** | **62.8** | **64.3** | + +### 2. 更精准、更可控的编辑能力 + +这个方法的魅力在于,它不仅治好了分割的“短板”,还反过来极大地增强了模型的通用编辑能力。 + +因为模型在成千上万次“精确上色”的练习中,学会了对边界前所未有的尊重。这种对细粒度控制的“肌肉记忆”迁移到了所有编辑任务中。我们的编辑精度可控性指标,在背景改变、颜色修改和材质修改等子项上,均分从7.69提升到8.12。 + +![图示说明:指令为“消除图中最右侧男士的领结”。我们的模型(右)精准地移除了目标领结,同时保持了背景马匹等元素的一致性。Qwen(左二)错误地移除了多个领结,且马匹和老虎出现了不一致。Nano-banana(左三)同样在衣服款式的一致性和老虎斑纹的一致性上表现不佳。](占位符:请在这里替换为您的图示链接) + +### 3. 身份的一致性保持 + +在人像编辑中,一个核心痛点是身份(ID)一致性。我们的模型在这方面也表现出色。无论是改变发型,还是调整表情,模型都能很好地保持人物的核心特征。 + +**指令:“头转向左侧”** +- Qwen(左)的ID、肤色存在不一致现象。 +- Nano-banana(中)人物额头与背景处的行人均发生了改变。 +- 我们的模型(右)在转动头部的同时,很好地保持了主体和背景的一致性。 + +**指令:“微笑”** +- Qwen(左)表情变化的同时人物ID也发生了改变。 +- Nano-banana(中)在换表情的同时手部动作出先畸变。 +- 我们的模型(右)很好地遵循了指令,同时保持了ID一致性。 + +**指令:“变换背景”** +- Qwen(左)的ID一致性明显下降,看起来像换了一个人。 +- Nano-banana(中)人物ID保持的不错,但画面结构产生了较大差异。 +- 我们的模型(右)在准确地更换背景的同时,很好地保持了ID、外表的一致性。 + +![ID一致性对比图](占位符:请在这里替换为您的图示链接) + +**更多一致性 Case:** +![更多一致性案例](占位符:请在这里替换为您的图示链接) + +--- + +## 诚实的审视:我们的不足与未来方向 + +尽管取得了令人鼓舞的进展,但我们深知模型仍有很大的提升空间。特别是在以下几个方面: + +- **大幅度的动作改变**:实现从站立到奔跑这样的大姿态变换,仍然是一个巨大的挑战。 +- **复杂指令的跟随能力**:对于包含多个步骤或条件的复杂指令,模型的理解和执行能力还有待加强。 +- **指令多样性的支持**:扩展模型能理解和执行的指令类型,是我们下一步的重点工作。 + +--- + +## 结语:寻找下一个“催化剂” + +从16%到72.4%,这个故事的核心并非某个复杂的网络结构或海量的新数据,而是一个关于**“任务设计”**的尝试。 + +我们证明了,与其试图用“胶水”把AI的各种能力勉强粘合在一起,不如去寻找或设计那些本身就是“一体两面”的协同任务。这些任务就像催化剂,能让不同的能力在解决同一个问题的过程中,自然而然地相互促进、共同进化。 + +“分割即编辑”只是第一个成功的尝试。我们相信,在3D理解、视频生成等更广阔的领域,还隐藏着更多这样的“催化剂”等待我们去发现。 + +**AI的“左手”与“右手”,终于学会了如何优雅地击掌。而这,仅仅是交响乐的序章。** \ No newline at end of file From e6843ef3d682443f46a1b4c4cd6601cbc79a0a31 Mon Sep 17 00:00:00 2001 From: "weilong.cwl" Date: Sun, 14 Sep 2025 00:32:32 +0800 Subject: [PATCH 2/4] add seg blob --- content/blog/ming-lite-omni-1_5-seg/index.md | 20 +++++++++--------- .../blog/ming-lite-omni-1_5-seg/index.zh.md | 21 ++++++++++++------- 2 files changed, 23 insertions(+), 18 deletions(-) diff --git a/content/blog/ming-lite-omni-1_5-seg/index.md b/content/blog/ming-lite-omni-1_5-seg/index.md index 3ba5081..70be6b9 100644 --- a/content/blog/ming-lite-omni-1_5-seg/index.md +++ b/content/blog/ming-lite-omni-1_5-seg/index.md @@ -63,7 +63,7 @@ Given an instruction like “*segment the banana in the upper-right corner*”, The results were painful. -![Struggling with Segmentation](占位符:请在这里替换为您的图示链接) +![Struggling with Segmentation](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*acrPSp-7qM8AAAAAgCAAAAgAevzJAQ/original) On RefCOCO-val, our cIoU plateaued at **~16%**. @@ -94,7 +94,7 @@ Instead of forcing our model to output abstract black-and-white masks, we **turn - **Old way:** Output a mask ❌ - **New way:** Directly edit the image: “*paint the banana purple*”, “*make the banana red*”, etc. ✅ -![Segmentation as Editing](占位符:请在这里替换为您的图示链接) +![Segmentation as Editing](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*-_O6RLOxXKcAAAAAgBAAAAgAevzJAQ/original) This brought the task’s data distribution back to the realm of natural images — where generative models shine. @@ -124,10 +124,10 @@ Against Qwen-Image and Nano Banana, our model: - Located small or occluded targets more reliably. - Produced boundaries that were visually and semantically aligned with instructions. -![Segmentation Comparison 1](占位符:请在这里替换为您的图示链接) +![Segmentation Comparison 1](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*koynTZD5vO8AAAAAgDAAAAgAevzJAQ/original) *Our model (right) accurately locates and segments the target subject. Qwen-Image (second from left) fails to locate the correct target, while Nano-banana (third from left) fails to accurately segment the man's head and has loose boundary lines.* -![Segmentation Comparison 2](占位符:请在这里替换为您的图示链接) +![Segmentation Comparison 2](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*5C7KTbk2WZ0AAAAAgBAAAAgAevzJAQ/original) *For the prompt "please segment the girl with red mask," our model (right) is precise. Qwen-Image (second from left) misses the feet, and Nano-banana (third from left) alters the subject's proportions.* During evaluation, thanks to the high consistency of non-edited regions in our model, we can directly derive the segmentation mask by calculating the difference between the edited result and the original image. The results show that our model's performance on segmentation is now on par with specialized vision models. @@ -150,18 +150,18 @@ The beauty of this method is that it not only fixed the segmentation weakness bu Because the model has learned an unprecedented "respect for boundaries" through thousands of "precise coloring" exercises, this "muscle memory" for fine-grained control has transferred to all editing tasks. Our edit controllability score saw a significant jump from **7.69 to 8.12** across sub-tasks like background, color, and material changes. -![Editing Controllability Comparison](占位符:请在这里替换为您的图示链接) +![Editing Controllability Comparison](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*7PgiRpiJyScAAAAAgCAAAAgAevzJAQ/original) *Prompt: "remove the bow tie of the man on the far right." Our model (right) precisely removes only the target bow tie while maintaining background consistency. Qwen (second from left) incorrectly removes multiple bow ties and introduces inconsistencies. Nano-banana (third from left) also struggles with consistency.* #### 3. Stronger ID Consistency A core challenge in portrait editing is maintaining identity. Our model excels here as well. Whether changing a hairstyle or adjusting an expression, the model skillfully preserves the person's core features. -![ID Consistency Comparison](占位符:请在这里替换为您的图示链接) +![ID Consistency Comparison](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*19ULQZrBWIAAAAAAd5AAAAgAevzJAQ/original) *Top Row (Turn head): Our model (right) maintains ID and background consistency, unlike competitors. Middle Row (Smile): Our model (right) correctly follows the prompt while preserving ID, avoiding distortions seen in others. Bottom Row (Change background): Our model (right) excels at preserving the subject's ID and appearance during a background swap.* -**See More Editing Consistency in Action:** -![More Consistency Examples](占位符:请在这里替换为您的图示链接) + --- @@ -190,11 +190,11 @@ We suspect 3D understanding, video generation, and other domains have their own **And this is only the overture.** ---- + \ No newline at end of file diff --git a/content/blog/ming-lite-omni-1_5-seg/index.zh.md b/content/blog/ming-lite-omni-1_5-seg/index.zh.md index 79beac4..9249f72 100644 --- a/content/blog/ming-lite-omni-1_5-seg/index.zh.md +++ b/content/blog/ming-lite-omni-1_5-seg/index.zh.md @@ -29,6 +29,8 @@ show_word_count: true 在找到这个方法之前,我们的统一模型在一个关键任务上举步维艰:**生成式分割**。我们希望模型能根据指令(如“分割出右上角那只香蕉”),直接“画”出分割掩码图。 +![图示说明:根据指令进行分割。](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*acrPSp-7qM8AAAAAgCAAAAgAevzJAQ/original) + 结果是,模型在 RefCOCO-val 上的推理分割指标(cIoU)顽固地停留在 **16%** 上下。 我们分析,根本原因在于**数据分布的巨大鸿沟**。生成模型习惯了处理自然、连续的图像数据。而分割任务的目标(黑白掩码图)是一种极度抽象、非自然的数据分布。强迫一个“画家”去画黑白掩码图,无异于缘木求鱼。 @@ -47,7 +49,7 @@ show_word_count: true 例如,对于“分割右上角的香蕉”这个指令,我们不再要求模型输出掩码,而是要求它直接在原图上执行一个新的指令:“把右上角的香蕉涂成紫色”、“把右上角的香蕉涂成红色”等等。 -![图示说明:左侧为原图香蕉,从生成抽象的黑白掩-码(中),到直接在原图上进行色彩编辑(右三)。这个转换让任务的数据分布回归到了自然图像领域。](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*acrPSp-7qM8AAAAAgCAAAAgAevzJAQ/original) +![图示说明:左侧为原图香蕉,从生成抽象的黑白掩-码(中),到直接在原图上进行色彩编辑(右三)。这个转换让任务的数据分布回归到了自然图像领域。](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*-_O6RLOxXKcAAAAAgBAAAAgAevzJAQ/original) 这个看似微小的改动,却是那个我们梦寐以求的“催化剂”。 @@ -68,15 +70,15 @@ show_word_count: true 指标的背后,是肉眼可见的质变。在处理复杂的推理分割任务时,我们的模型展现出超越竞品的准确性和场景理解力。 -![图示说明:我们的模型(右)精准定位并分割了目标主体,Qwen-Image(左二)未能准确定位要分割的目标,Nano-banana(左三)则未能准确分割男士的头部,以及分割的边缘线不够贴合。](占位符:请在这里替换为您的图示链接) +![图示说明:我们的模型(右)精准定位并分割了目标主体,Qwen-Image(左二)未能准确定位要分割的目标,Nano-banana(左三)则未能准确分割男士的头部,以及分割的边缘线不够贴合。](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*koynTZD5vO8AAAAAgDAAAAgAevzJAQ/original) -![图示说明:这个case的指令“please segment the girl with red mask”, 我们的模型(右)精准定位并分割了目标主体,Qwen-Image(左二)未能分割脚部,Nano-banana(左三)则改变了主体尺寸。](占位符:请在这里替换为您的图示链接) +![图示说明:这个case的指令“please segment the girl with red mask”, 我们的模型(右)精准定位并分割了目标主体,Qwen-Image(左二)未能分割脚部,Nano-banana(左三)则改变了主体尺寸。](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*5C7KTbk2WZ0AAAAAgBAAAAgAevzJAQ/original) 在“分割女孩”的案例中,Qwen没有包含脚部,而Nano-banana改变了主体尺寸。在“分割拿雨伞的女人”这类需要推理的案例中,我们的模型能准确找到目标,而竞品则出现了主体识别错误或指令理解偏差。这证明,通过“上色”训练,模型的语义理解与视觉定位能力被深度绑定并共同强化了。 在推理分割指标评估过程中,依托于我们模型在非编辑区域的高度一致性,我们直接通过将涂色编辑结果与原图进行差分计算,获得分割掩码,示例如下: -![Ming-Lite-Omni1.5 vs Qwen-Image-Edit 差分对比](占位符:请在这里替换为您的图示链接) + 评估结果显示,我们的模型在分割任务中的表现已达到与专为分割设计的专业模型相当的水平。其中,Qwen-Image-Edit因评估指标明显较低,仅在每个测试子集上随机采样500个样本进行评估。 @@ -97,7 +99,7 @@ show_word_count: true 因为模型在成千上万次“精确上色”的练习中,学会了对边界前所未有的尊重。这种对细粒度控制的“肌肉记忆”迁移到了所有编辑任务中。我们的编辑精度可控性指标,在背景改变、颜色修改和材质修改等子项上,均分从7.69提升到8.12。 -![图示说明:指令为“消除图中最右侧男士的领结”。我们的模型(右)精准地移除了目标领结,同时保持了背景马匹等元素的一致性。Qwen(左二)错误地移除了多个领结,且马匹和老虎出现了不一致。Nano-banana(左三)同样在衣服款式的一致性和老虎斑纹的一致性上表现不佳。](占位符:请在这里替换为您的图示链接) +![图示说明:指令为“消除图中最右侧男士的领结”。我们的模型(右)精准地移除了目标领结,同时保持了背景马匹等元素的一致性。Qwen(左二)错误地移除了多个领结,且马匹和老虎出现了不一致。Nano-banana(左三)同样在衣服款式的一致性和老虎斑纹的一致性上表现不佳。](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*7PgiRpiJyScAAAAAgCAAAAgAevzJAQ/original) ### 3. 身份的一致性保持 @@ -118,10 +120,13 @@ show_word_count: true - Nano-banana(中)人物ID保持的不错,但画面结构产生了较大差异。 - 我们的模型(右)在准确地更换背景的同时,很好地保持了ID、外表的一致性。 -![ID一致性对比图](占位符:请在这里替换为您的图示链接) +![ID一致性对比图](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*19ULQZrBWIAAAAAAd5AAAAgAevzJAQ/original) + + -**更多一致性 Case:** -![更多一致性案例](占位符:请在这里替换为您的图示链接) + --- From 36e2bb1844bee3f787799ffa0e147889a1aa3108 Mon Sep 17 00:00:00 2001 From: "weilong.cwl" Date: Sun, 14 Sep 2025 13:54:12 +0800 Subject: [PATCH 3/4] add video for seg blob --- content/blog/ming-lite-omni-1_5-seg/index.md | 17 ++++++++++------- content/blog/ming-lite-omni-1_5-seg/index.zh.md | 11 +++++------ 2 files changed, 15 insertions(+), 13 deletions(-) diff --git a/content/blog/ming-lite-omni-1_5-seg/index.md b/content/blog/ming-lite-omni-1_5-seg/index.md index 70be6b9..e0ddd95 100644 --- a/content/blog/ming-lite-omni-1_5-seg/index.md +++ b/content/blog/ming-lite-omni-1_5-seg/index.md @@ -63,7 +63,7 @@ Given an instruction like “*segment the banana in the upper-right corner*”, The results were painful. -![Struggling with Segmentation](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*acrPSp-7qM8AAAAAgCAAAAgAevzJAQ/original) +![Struggling with Segmentation](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*2BAkRZ9WGTcAAAAAgCAAAAgAevzJAQ/original) On RefCOCO-val, our cIoU plateaued at **~16%**. @@ -124,10 +124,10 @@ Against Qwen-Image and Nano Banana, our model: - Located small or occluded targets more reliably. - Produced boundaries that were visually and semantically aligned with instructions. -![Segmentation Comparison 1](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*koynTZD5vO8AAAAAgDAAAAgAevzJAQ/original) +![Segmentation Comparison 1](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*DwJpSZyoW-YAAAAAgJAAAAgAevzJAQ/original) *Our model (right) accurately locates and segments the target subject. Qwen-Image (second from left) fails to locate the correct target, while Nano-banana (third from left) fails to accurately segment the man's head and has loose boundary lines.* -![Segmentation Comparison 2](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*5C7KTbk2WZ0AAAAAgBAAAAgAevzJAQ/original) +![Segmentation Comparison 2](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*yL2MR7vLQdEAAAAAgEAAAAgAevzJAQ/original) *For the prompt "please segment the girl with red mask," our model (right) is precise. Qwen-Image (second from left) misses the feet, and Nano-banana (third from left) alters the subject's proportions.* During evaluation, thanks to the high consistency of non-edited regions in our model, we can directly derive the segmentation mask by calculating the difference between the edited result and the original image. The results show that our model's performance on segmentation is now on par with specialized vision models. @@ -150,18 +150,18 @@ The beauty of this method is that it not only fixed the segmentation weakness bu Because the model has learned an unprecedented "respect for boundaries" through thousands of "precise coloring" exercises, this "muscle memory" for fine-grained control has transferred to all editing tasks. Our edit controllability score saw a significant jump from **7.69 to 8.12** across sub-tasks like background, color, and material changes. -![Editing Controllability Comparison](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*7PgiRpiJyScAAAAAgCAAAAgAevzJAQ/original) +![Editing Controllability Comparison](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*szjcQqQkC80AAAAAgIAAAAgAevzJAQ/original) *Prompt: "remove the bow tie of the man on the far right." Our model (right) precisely removes only the target bow tie while maintaining background consistency. Qwen (second from left) incorrectly removes multiple bow ties and introduces inconsistencies. Nano-banana (third from left) also struggles with consistency.* #### 3. Stronger ID Consistency A core challenge in portrait editing is maintaining identity. Our model excels here as well. Whether changing a hairstyle or adjusting an expression, the model skillfully preserves the person's core features. -![ID Consistency Comparison](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*19ULQZrBWIAAAAAAd5AAAAgAevzJAQ/original) +![ID Consistency Comparison](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*Tc2-RoAHys8AAAAAd9AAAAgAevzJAQ/original) *Top Row (Turn head): Our model (right) maintains ID and background consistency, unlike competitors. Middle Row (Smile): Our model (right) correctly follows the prompt while preserving ID, avoiding distortions seen in others. Bottom Row (Change background): Our model (right) excels at preserving the subject's ID and appearance during a background swap.* - +**See More Editing Consistency in Action:** + --- @@ -190,6 +190,9 @@ We suspect 3D understanding, video generation, and other domains have their own **And this is only the overture.** +Try out our open-source model **Ming-lite-omni 1.5** on our [**GitHub Page / Demo Page**](https://github.com/inclusionAI/Ming/blob/main/cookbook.ipynb). Please star our repo if you like it! + + - - +**更多一致性 Case:** + --- @@ -148,4 +145,6 @@ show_word_count: true “分割即编辑”只是第一个成功的尝试。我们相信,在3D理解、视频生成等更广阔的领域,还隐藏着更多这样的“催化剂”等待我们去发现。 -**AI的“左手”与“右手”,终于学会了如何优雅地击掌。而这,仅仅是交响乐的序章。** \ No newline at end of file +**AI的“左手”与“右手”,终于学会了如何优雅地击掌。而这,仅仅是交响乐的序章。** + +欢迎使用开源的 **Ming-lite-omni 1.5** [**GitHub Page / Demo Page**](https://github.com/inclusionAI/Ming/blob/main/cookbook.ipynb)。 From 2a81f9997baf8792f7d3ed3e6199f22918f69995 Mon Sep 17 00:00:00 2001 From: "weilong.cwl" Date: Mon, 15 Sep 2025 11:33:50 +0800 Subject: [PATCH 4/4] update segment difference image --- content/blog/ming-lite-omni-1_5-seg/index.md | 15 +++++++++++---- content/blog/ming-lite-omni-1_5-seg/index.zh.md | 9 ++++++--- 2 files changed, 17 insertions(+), 7 deletions(-) diff --git a/content/blog/ming-lite-omni-1_5-seg/index.md b/content/blog/ming-lite-omni-1_5-seg/index.md index e0ddd95..2cacf92 100644 --- a/content/blog/ming-lite-omni-1_5-seg/index.md +++ b/content/blog/ming-lite-omni-1_5-seg/index.md @@ -130,7 +130,12 @@ Against Qwen-Image and Nano Banana, our model: ![Segmentation Comparison 2](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*yL2MR7vLQdEAAAAAgEAAAAgAevzJAQ/original) *For the prompt "please segment the girl with red mask," our model (right) is precise. Qwen-Image (second from left) misses the feet, and Nano-banana (third from left) alters the subject's proportions.* -During evaluation, thanks to the high consistency of non-edited regions in our model, we can directly derive the segmentation mask by calculating the difference between the edited result and the original image. The results show that our model's performance on segmentation is now on par with specialized vision models. +During evaluation, thanks to the high consistency of non-edited regions in our model, we can directly derive the segmentation mask by calculating the difference between the edited result and the original image. + +![Calculating difference on Ming-Lite-Omni1.5, Qwen-Image-Edit, Nano-banana](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*UJX1RJJpu3cAAAAASyAAAAgAevzJAQ/original) + + +The results show that our model's performance on segmentation is now on par with specialized vision models. | Model Category | Model Name | RefCOCO (val) | RefCOCO+ (val) | RefCOCOg (val) | | :--- | :--- | :---: | :---: | :---: | @@ -140,9 +145,11 @@ During evaluation, thanks to the high consistency of non-edited regions in our m | | PolyFormer-B | 74.8 | 67.6 | 67.8 | | **MLLM + Specialist (SAM)** | LISA-7B | 74.1 | 62.4 | 66.4 | | | PixelLM-7B | 73.0 | 66.3 | 69.3 | -| **Generative Models** | Qwen-Image-Edit* | 30.3 | 28.8 | 34.0 | -| | **Ming-Lite-Omni1.5 (Ours)** | **72.4** | **62.8** | **64.3** | -*Due to its lower metrics, Qwen-Image-Edit was evaluated on a random sample of 500 images per test subset.* +| **Generative Models** | Nano-banana* | 15.7 | 13.9 | 14.9 | +| | Qwen-Image-Edit* | 30.3 | 28.8 | 34.0 | +| | **Ming-Lite-Omni1.5** | **72.4** | **62.8** | **64.3** | + +*For each test set, Nano-banana and Qwen-Image-Edit was evaluated on a randomly sampled subset of 500 images, to reduce computational cost while preserving the key statistical trends. We observed that Nano-banana frequently fails to accurately grasp the image segmentation intent during inference, leading to its comparatively lower evaluation metrics. This may be attributed to differences in training objectives and data emphasis.* #### 2. Sharper, More Controllable Editing diff --git a/content/blog/ming-lite-omni-1_5-seg/index.zh.md b/content/blog/ming-lite-omni-1_5-seg/index.zh.md index ac4879e..ed19bb5 100644 --- a/content/blog/ming-lite-omni-1_5-seg/index.zh.md +++ b/content/blog/ming-lite-omni-1_5-seg/index.zh.md @@ -78,9 +78,9 @@ show_word_count: true 在推理分割指标评估过程中,依托于我们模型在非编辑区域的高度一致性,我们直接通过将涂色编辑结果与原图进行差分计算,获得分割掩码,示例如下: - +![Ming-Lite-Omni1.5, Qwen-Image-Edit, Nano-banana 差分对比](https://mdn.alipayobjects.com/huamei_wp0xz6/afts/img/A*UJX1RJJpu3cAAAAASyAAAAgAevzJAQ/original) + -评估结果显示,我们的模型在分割任务中的表现已达到与专为分割设计的专业模型相当的水平。其中,Qwen-Image-Edit因评估指标明显较低,仅在每个测试子集上随机采样500个样本进行评估。 | 模型类别 | 模型名称 | RefCOCO (val) | RefCOCO+ (val) | RefCOCOg (val) | | :--- | :--- | :---: | :---: | :---: | @@ -90,9 +90,12 @@ show_word_count: true | | PolyFormer-B | 74.8 | 67.6 | 67.8 | | **MLLM + SAM**
(专用的分割模型) | LISA-7B | 74.1 | 62.4 | 66.4 | | | PixelLM-7B | 73.0 | 66.3 | 69.3 | -| **MLLM + DiT**
(生成式模型做分割) | Qwen-Image-Edit* | 30.3 | 28.8 | 34.0 | +| **MLLM + DiT**
(生成式模型做分割) | Nano-banana* | 15.7 | 13.9 | 14.9 | +| | Qwen-Image-Edit* | 30.3 | 28.8 | 34.0 | | | **Ming-Lite-Omni1.5** | **72.4** | **62.8** | **64.3** | +评估结果表明,我们的模型在分割任务中的表现已接近专为分割设计的专业模型。在评估过程中,Qwen-Image-Edit 和Nano-banana 在每个测试子集上随机采样 500 个样本进行测试,以降低计算开销,同时保证结果的统计趋势稳定。评估过程中我们发现,Nano-banana 在推理中经常无法准确把握图像分割的意图,因此评价指标相对较低,这可能与训练目标和数据侧重差异有关。 + ### 2. 更精准、更可控的编辑能力 这个方法的魅力在于,它不仅治好了分割的“短板”,还反过来极大地增强了模型的通用编辑能力。