Question: should MCU-scale language runtimes have a Tiny benchmark category?

Hi MLCommons Tiny folks,                                                                                                                             
                                                                                                                                                       
  I wanted to share a small but unusual MCU language-runtime experiment and ask whether systems like this suggest a benchmark gap in the current Tiny landscape.                                                                                                                                           
                                                                                                                                                       
  We built a public demo line called Engram and deployed it on a commodity ESP32-C3.                                                                   
                                                                                                                                                       
  Current public numbers:                                                                                                                              
                                                                                                                                                       
  * Host-side benchmark capability                                                                                                                     
    * `LogiQA = 0.392523`                                                                                                                              
    * `IFEval = 0.780037`                                                                                                                              
                                                                                                                                                       
  * Published board proof                                                                                                                              
    * `LogiQA 642 = 249 / 642 = 0.3878504672897196`                                                                                                    
    * `host_full_match = 642 / 642`                                                                                                                    
    * runtime artifact size = `1,380,771 bytes`                                                                                                        
                                                                                                                                                       
  Important scope note:                                                                                                                                
                                                                                                                                                       
  This is **not** presented as unrestricted open-input native LLM generation on MCU.                                                                   
                                                                                                                                                       
  The board-side path is closer to a flash-resident, table-driven runtime with:                                                                        
                                                                                                                                                       
  * packed token weights                                                                                                                               
  * hashed lookup structures                                                                                                                           
  * fixed compiled probe batches                                                                                                                       
  * streaming fold / checksum style execution over precompiled structures                                                                              
                                                                                                                                                       
  So this is not a standard vision/KWS/anomaly micro model. It is closer to a task-specialized language runtime whose behavior has been pushed into a very compact executable form.                                                                                                                        
                                                                                                                                                       
  Repo:                                                                                                                                                
  https://github.com/Alpha-Guardian/Engram                                                                                                             
                                                                                                                                                       
  What I’m genuinely curious about is whether systems like this point to a missing benchmark category in the TinyML / MCU benchmark ecosystem.         
                                                                                                                                                       
  Would something like the following make sense as a future benchmark direction?                                                                       
                                                                                                                                                       
  * constrained language-task execution                                                                                                                
  * auditable board-measured language behavior                                                                                                         
  * fixed-memory / fixed-artifact board deployment                                                                                                     
  * explicit separation between host benchmark capability and board execution mode                                                                     
                                                                                                                                                       
  If people here think this is out of scope for MLCommons Tiny, that would also be useful to know.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: should MCU-scale language runtimes have a Tiny benchmark category? #181

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question: should MCU-scale language runtimes have a Tiny benchmark category? #181

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions