hpcaitech · Sze-qq · Sep 26, 2025 · Sep 26, 2025 · Sep 26, 2025 · Sep 26, 2025
@@ -25,23 +25,35 @@
 
 </div>
 
-## Get Started with Colossal-AI Without Setup
+## Instantly Run Colossal-AI on Enterprise-Grade GPUs
 
-Access high-end, on-demand compute for your research instantly—no setup needed.
+Skip the setup. Access a powerful, pre-configured Colossal-AI environment on [**HPC-AI Cloud**](https://hpc-ai.com/?utm_source=github&utm_medium=social&utm_campaign=promotion-colossalai).
 
-Sign up now and get $10 in credits!
+Train your models and scale your AI workload in one click!
 
-Limited Academic Bonuses:
+* **NVIDIA Blackwell B200s**: Experience the next generation of AI performance ([See Benchmarks](https://hpc-ai.com/blog/b200)). Now available on cloud from **$2.47/hr**.
+* **Cost-Effective H200 Cluster**: Get premier performance with on-demand rental from just **$1.99/hr**.
 
-* Top up $1,000 and receive 300 credits
-* Top up $500 and receive 100 credits
+[**Get Started Now & Claim Your Free Credits →**](https://hpc-ai.com/?utm_source=github&utm_medium=social&utm_campaign=promotion-colossalai)
 
 <div align="center">
    <a href="https://hpc-ai.com/?utm_source=github&utm_medium=social&utm_campaign=promotion-colossalai">
-   <img src="https://github.com/hpcaitech/public_assets/blob/main/colossalai/img/2-2.gif" width="850" />
+   <img src="https://github.com/hpcaitech/public_assets/blob/main/colossalai/img/2-3.png" width="850" />
    </a>
 </div>
 
+### Colossal-AI Benchmark
+
+To see how these performance gains translate to real-world applications, we conducted a large language model training benchmark using Colossal-AI on Llama-like models. The tests were run on both 8-card and 16-card configurations for 7B and 70B models, respectively.
+
+|              GPU              |  GPUs  | Model Size |    Parallelism    | Batch Size per DP | Seqlen | Throughput | TFLOPS/GPU  | Peak Mem(MiB)  |
+| :-----------------------------: | :--------: | :-------------: | :------------------: | :-----------: | :--------------: | :-------------: | :-------------: | :-------------: |
+|         H200            |     8     |      7B       |   zero2(dp8)     | 36 |        4096     |       17.13 samp/s     |       534.18     |       119040.02     |
+|         H200            |     16     |      70B       |   zero2     | 48 |        4096     |       3.27 samp/s     |       469.1     |       150032.23     |
+|         B200            |     8     |      7B       |   zero1(dp2)+tp2+pp4     | 128 |        4096     |       25.83 samp/s     |       805.69     |       100119.77     |
+|         H200            |     16     |      70B       |   zero1(dp2)+tp2+pp4     | 128 |        4096     |       5.66 samp/s     |       811.79     |       100072.02     |
+
+The results from the Colossal-AI benchmark provide the most practical insight. For the 7B model on 8 cards, the **B200 achieved a 50% higher throughput** and a significant increase in TFLOPS per GPU. For the 70B model on 16 cards, the B200 again demonstrated a clear advantage, with **over 70% higher throughput and TFLOPS per GPU**. These numbers show that the B200's performance gains translate directly to faster training times for large-scale models.
 
 ## Latest News
 * [2025/02] [DeepSeek 671B Fine-Tuning Guide Revealed—Unlock the Upgraded DeepSeek Suite with One Click, AI Players Ecstatic!](https://company.hpc-ai.com/blog/shocking-release-deepseek-671b-fine-tuning-guide-revealed-unlock-the-upgraded-deepseek-suite-with-one-click-ai-players-ecstatic)