Thumbellama

Youtube Full Story - "ThumbeLLama - pocket size LLM" WebApplication to test TinyLLama

Tiny LLama is an ambitious initiative aimed at pretraining a language model on a dataset of 3 trillion tokens. What sets this project apart is not just the size of the data but the efficiency and speed of its processing. Utilizing 16 A100-40G GPUs, the training of Tiny LLama started in September and is planned to span just 90 days.

Environment

For ease, I've prepared a Docker image containing all the necessary tools, including CUDA, mlc-llm, and Emscripten, which are crucial for preparing the model for WebAssembly.

Dockerfile:

FROM alpine/git:2.36.2 as download

RUN git clone https://github.com/mlc-ai/mlc-llm.git --recursive /mlc-llm

FROM nvidia/cuda:12.2.2-cudnn8-devel-ubuntu22.04

RUN apt update && \
   apt install -yq curl git cmake ack tmux \
       python3-dev vim python3-venv python3-pip \
       protobuf-compiler build-essential


RUN python3 -m venv /opt/venv
ENV PATH="/opt/venv/bin:$PATH"
RUN python3 -m pip install --pre -U -f https://mlc.ai/wheels mlc-chat-nightly-cu122 mlc-ai-nightly-cu122

RUN apt install gcc
COPY --from=download /mlc-llm /opt/mlc-llm

RUN cd /opt/mlc-llm && pip3 install .

RUN apt-get install git-lfs -yq

ENV TVM_HOME="/opt/venv/lib/python3.10/site-packages/tvm/"

RUN git clone https://github.com/emscripten-core/emsdk.git --branch 3.1.50 --single-branch /opt/emsdk
RUN cd /opt/emsdk && ./emsdk install latest

ENV PATH="/opt/emsdk:/opt/emsdk/upstream/emscripten:/opt/emsdk/node/16.20.0_64bit/bin:/opt/venv/bin:$PATH"
RUN cd /opt/emsdk/ && ./emsdk activate latest
ENV TVM_HOME=/opt/mlc-llm/3rdparty/tvm

RUN cd /opt/mlc-llm/3rdparty/tvm \
 && git checkout 5828f1e9e \
 && git submodule init \
 && git submodule update --recursive \
 && make webclean \
 && make web


RUN python3 -m pip install auto_gptq>=0.2.0 transformers

CMD /bin/bash

To build docker image we need to run:

docker build -t onceuponai/mlc-llm .

To run container:

docker run --rm -it --name mlc-llm -v $(pwd)/data:/data --gpus all onceuponai/mlc-llm

MLC-LLM

To convert model to mlc-llm we need to run

python3 -m mlc_llm.build --hf-path TinyLlama/TinyLlama-1.1B-Chat-v0.6  --target webgpu --quantization q4f32_0 --use-safetensors

where (Documentation): hf-path - is huggingface model name in this case TinyLlama/TinyLlama-1.1B-Chat-v0.6 target - is platfrom for which we prepare the model available options:

auto (will detect from cuda, metal, vulkan and opencl)
metal (for M1/M2)
metal_x86_64 (for Intel CPU)
iphone
vulkan
cuda
webgpu
android
opencl

quantization - is quantization mode: available options: quantization: qAfB(_0) A - number of bits for weights B - number of bits for activations available options: autogptq_llama_q4f16_0, autogptq_llama_q4f16_1, q0f16, q0f32, q3f16_0, q3f16_1, q4f16_0, q4f16_1, q4f16_2, q4f16_ft, q4f32_0, q4f32_1 q8f16_ft, q8f16_1

In our case we will use webgpu target and q4f32_0 quantization to obtaind wasm file and converted model. I have shared several converted models on HuggingFace and Github.

Read more on: https://blog.qooba.net/2023/12/13/tinyllama-compact-llm-with-webassembly/

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
.github/workflows		.github/workflows
docker		docker
public		public
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.d.ts		env.d.ts
index.html		index.html
package-lock.json		package-lock.json
package.json		package.json
run.sh		run.sh
tsconfig.app.json		tsconfig.app.json
tsconfig.json		tsconfig.json
tsconfig.node.json		tsconfig.node.json
vite.config.ts		vite.config.ts

License

onceuponai-dev/stories-thumbellama

Folders and files

Latest commit

History

Repository files navigation

Thumbellama

Environment

MLC-LLM

About

Resources

License

Stars

Watchers

Forks

Languages