Gonka's support of new modalities besides text #944
Replies: 2 comments 6 replies
-
|
Hello! I'm exploring the possibility of supporting the Wan2.2-T2V-A14B text-to-video model on the Gonka network, and I'd like to discuss a few obstacles that came up. Why this model in particular: it's arguably the best open-source text-to-video model to date that has a permissive license (Apache 2.0). LTX-2.3 is comparable in terms of performance, but has a much more restrictive custom license which might present additional unnecessary hurdles. Inference validationCurrent SOTA video generation models are predominantly based on diffusion transformers (DiTs). Unlike autoregressive transformers used in LLMs, DiTs don't build the final result token-by-token. Instead, they start with random noise and repeatedly de-noise it until it resembles the final result. While autoregressive transformers predict the next token, DiTs predict the difference/distance between the noisy input and the slightly less noisy output. Currently, to validate inference in Gonka, executors store logprobs for each generated token, and then validators "replay" the inference and compare logprobs. This wouldn't work very well for DiTs because the predictions are much heavier (we can't just sample the DiTs prediction). More importantly, though, it doesn't even make sense to compare these predictions because in SOTA models the result we get from running the DiT is not the final result. Specifically, in Wan2.2, DiT operates on a latent/compressed representation of the video, and to get the actual video frames we need to pass that latent representation through a VAE decoder (which is essentially a convolutional neural network). Additionally, to get the final video file it needs to do some post-processing and encode the frames into a video container. To validate inference honestly and protect against tampering we need to compare the final result (the actual video file which hash we store on-chain). Unfortunately, given the above, it's not straightforward, but here's what I propose:
I'm not sure yet what's the best way to answer this question, but here are some ideas:
It's also worth mentioning that is TEE proposal is implemented, then another option would be limiting text-to-video models to the trusted environments. Proof-of-ComputePoC should be specific to the model because the idea is to prove the computational capacity to run the model. Thanks to multi-model PoC it's now possible to have different models with their own PoCs, but the problem is that it's still based on the LLMs architecture and is tightly integrated into vLLM (which totally makes sense, by the way). First of all, we would need to adapt the transformer-based PoC to DiTs (I suspect it'd be very similar to how it's currently done in the Gonka's vLLM fork, but this needs to be checked). However, as we saw above, video generation models have a more complex pipeline, so simply structuring PoC after the DiT part of the video model architecture may not be enough. Unlike LLMs that basically have a single computationally significant step (repeated forward passes through the transformer-based network), video generation models also have additional encoders (in case of Wan2.2 it's a small text encoder for the text prompt, and the VAE autoencoder) and post-processing steps. According to the Wan2.1 paper (see Another issue, of course, is the fact that ML nodes run on vLLM which simply doesn't support other modalities. It seems that the best option is to make use of the vLLM-Omni project which is built on top of vLLM and supports Wan2.2. But it still seems like a huge undertaking and I can't realistically estimate how much effort would it take to migrate ML nodes to this. Pricing policyWhen it comes to LLMs, pricing is straightforward: we simply charge per token. This works well because the models are autoregressive and, thanks to KV caching, the computational effort grows more or less linearly with the sequence length. With video generation, we don't have such a metric. The good news is that the final price of a single inference should be pretty much the same if the requested resolution, frame count, and the number of de-noising steps is the same (no matter the prompt). But unlike with autoregressive models, the relationship between these parameters and the needed computational effort is not linear. TL;DR:
I'd very much like to hear your opinions on this. |
Beta Was this translation helpful? Give feedback.
-
|
Our team recently conducted research into the feasibility of inference and validation for image2text models; the results can be found here. At the current stage, we have implemented a baseline validation approach similar to the inference validation method used in Gonka. The next logical step would be to adapt and extend this approach specifically for multimodal image-based workflows. As a continuation of this research, our team could explore inference and validation strategies for the Qwen 3 speech models family, including the TTS model from Qwen3-TTS Collection and the ASR model from Qwen3-ASR Collection. For ASR models, the existing validation strategy based on Top-N log-probability comparison could likely be reused with minimal adaptation, since the output remains token-based and deterministic enough for confidence estimation and reproducibility checks. TTS validation, however, would require separate research because the output is continuous audio rather than discrete tokens. Several validation directions could be investigated: Spectrogram-based validation:
Round-trip validation: (additional inference step is needed, more expensive)
Audio quality metrics: |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'd like to start a discussion on how we develop gonka to support new modalities on the network, like speech-to-text, image understanding, video generation etc.
The core tasks for any new modality are:
I'd suggest to use this discussion as free chat on this, possible ideas and aggregator of proposals.
Beta Was this translation helpful? Give feedback.
All reactions