Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 20 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,24 @@ TinyLLM? Yes, the name is a bit of a contradiction, but it means well. It's all

This project helps you build a small locally hosted LLM with a ChatGPT-like web interface using consumer grade hardware. To read more about my research with llama.cpp and LLMs, see [research.md](research.md).

## Table of Contents

- [Key Features](#key-features)
- [Hardware Requirements](#hardware-requirements)
- [Manual Setup](#manual-setup)
- [Run a Local LLM](#run-a-local-llm)
- [Ollama Server (Option 1)](#ollama-server-option-1)
- [vLLM Server (Option 2)](#vllm-server-option-2)
- [Llama-cpp-python Server (Option 3)](#llama-cpp-python-server-option-3)
- [Run a Chatbot](#run-a-chatbot)
- [Example Session](#example-session)
- [Read URLs](#read-urls)
- [Current News](#current-news)
- [Manual Setup](#manual-setup-1)
- [LLM Models](#llm-models)
- [LLM Tools](#llm-tools)
- [References](#references)

## Key Features

* Supports multiple LLMs (see list below)
Expand Down Expand Up @@ -69,7 +87,7 @@ If you use the TinyLLM Chatbot (see below) with Ollama, make sure you specify th

### vLLM Server (Option 2)

vLLM offers a robust OpenAI API compatible web server that supports multiple simultaneous inference threads (sessions). It automatically downloads the models you specifdy from HuggingFace and runs extremely well in containers. vLLM requires GPUs with more VRAM since it uses non-quantized models. AWQ models are also available and more optimizations are underway in the project to reduce the memory footprint. Note, for GPUs with a compute capability of 6 or less, Pascal architecture (see [GPU table](https://github.com/jasonacox/TinyLLM/tree/main/vllm#nvidia-gpu-and-torch-architecture)), follow details [here](./vllm/) instead.
vLLM offers a robust OpenAI API compatible web server that supports multiple simultaneous inference threads (sessions). It automatically downloads the models you specify from HuggingFace and runs extremely well in containers. vLLM requires GPUs with more VRAM since it uses non-quantized models. AWQ models are also available and more optimizations are underway in the project to reduce the memory footprint. Note, for GPUs with a compute capability of 6 or less, Pascal architecture (see [GPU table](https://github.com/jasonacox/TinyLLM/tree/main/vllm#nvidia-gpu-and-torch-architecture)), follow details [here](./vllm/) instead.

```bash
# Build Container
Expand Down Expand Up @@ -215,7 +233,7 @@ Here are some suggested models that work well with vLLM.
| Yi-1.5 9B | None | [01-ai/Yi-1.5-9B-Chat-16K](https://huggingface.co/01-ai/Yi-1.5-9B-Chat-16K) | 16k | Apache 2 |
| Phi-3 Small 7B | None | [microsoft/Phi-3-small-8k-instruct](https://huggingface.co/microsoft/Phi-3-small-8k-instruct) | 16k | MIT |
| Phi-3 Medium 14B | None | [microsoft/Phi-3-medium-4k-instruct](https://huggingface.co/microsoft/Phi-3-medium-4k-instruct) | 4k | MIT |
| Phi-3.5 Vision 4B | None | [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/micrmicrosoft/Phi-3.5-vision-instruct) | 128k | MIT |
| Phi-3.5 Vision 4B | None | [microsoft/Phi-3.5-vision-instruct](https://huggingface.co/microsoft/Phi-3.5-vision-instruct) | 128k | MIT |
| Phi-4 14B | None | [microsoft/phi-4](https://huggingface.co/microsoft/phi-4) | 16k | MIT |

## LLM Tools
Expand Down