-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
🐛 Bug
👋 been diving more into the freezing issue noted in #1379, creating a new issue with better details and ways to reproduce.
When I load a llama-2-7b model (any of the default configured quantizations) and prompt the model with a large input:
- Android OS freezes and becomes unresponsive
- System application crashes and restarts itself
- The output of the LLM starts appearing in the MLC-Chat app after the os becomes responsive again.
Additional information: prompt size ~ 2700 characters (which is gets truncated: The prompt tokens are more than max_window_size, the input will be truncated.) which is not the root cause but makes it easier to reproduce.
When timing execution of all the different functions in llm_chat.cc, I can see that the underlying issue is coming from:
SampleTokenFromLogits executed in 65189.915079 ms
Diving more deep into where it's being hold back, it's coming from:
void UpdateLogitsOrProbOnCPUSync(NDArray logits_or_prob) {
if (!logits_on_cpu_.defined()) {
logits_on_cpu_ = logits_or_prob.CopyTo(DLDevice{kDLCPU, 0});
} else {
ICHECK_EQ(logits_on_cpu_->shape[0], logits_or_prob->shape[0])
<< "Expect size of logits remain unchanged";
logits_on_cpu_.CopyFrom(logits_or_prob);
}
TVMSynchronize(device_.device_type, device_.device_id, nullptr);
}The next line when logits aren't found on the CPU is the last line to hit before it halts:
logits_on_cpu_ = logits_or_prob.CopyTo(DLDevice{kDLCPU, 0});
Will continue to look more into what above line does but any 👀 are highly appreciated!