Skip to content

[Android] initial UpdateLogitsOrProbOnCPUSync freezes OS #1401

@tobrun

Description

@tobrun

🐛 Bug

👋 been diving more into the freezing issue noted in #1379, creating a new issue with better details and ways to reproduce.

When I load a llama-2-7b model (any of the default configured quantizations) and prompt the model with a large input:

  • Android OS freezes and becomes unresponsive
  • System application crashes and restarts itself
  • The output of the LLM starts appearing in the MLC-Chat app after the os becomes responsive again.

Additional information: prompt size ~ 2700 characters (which is gets truncated: The prompt tokens are more than max_window_size, the input will be truncated.) which is not the root cause but makes it easier to reproduce.


When timing execution of all the different functions in llm_chat.cc, I can see that the underlying issue is coming from:

SampleTokenFromLogits executed in 65189.915079 ms

Diving more deep into where it's being hold back, it's coming from:

  void UpdateLogitsOrProbOnCPUSync(NDArray logits_or_prob) {    
    if (!logits_on_cpu_.defined()) {
      logits_on_cpu_ = logits_or_prob.CopyTo(DLDevice{kDLCPU, 0});
    } else {
      ICHECK_EQ(logits_on_cpu_->shape[0], logits_or_prob->shape[0])
          << "Expect size of logits remain unchanged";
      logits_on_cpu_.CopyFrom(logits_or_prob);
    }
    TVMSynchronize(device_.device_type, device_.device_id, nullptr);
  }

The next line when logits aren't found on the CPU is the last line to hit before it halts:

logits_on_cpu_ = logits_or_prob.CopyTo(DLDevice{kDLCPU, 0});

Will continue to look more into what above line does but any 👀 are highly appreciated!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugConfirmed bugs

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions