run llm model on local system
-
Download llava-v1.5-7b-q4.llamafile (4.29 GB)
-
Grant permission for your computer to execute this new file (If you're on Windows, rename the file by adding ".exe" on the end):
chmod +x llava-v1.5-7b-q4.llamafile
-
Run the llamafile (On your browser http://localhost:8080/)
./llava-v1.5-7b-q4.llamafile ./llava-v1.5-7b-q4.llamafile --server --nobrowser (server mode) CUDA_VISIBLE_DEVICES=0 ./Meta-Llama-3-8B-Instruct.Q4_1.llamafile --gpu nvidia (for specific GPU)
-
Kill process
sudo kill <PID>
-
Curl API Client Example:
curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer no-key" \ -d '{ "model": "LLaMA_CPP", "messages": [ { "role": "system", "content": "You are LLAMAfile, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests." }, { "role": "user", "content": "Write a limerick about python exceptions" } ] }' | python3 -c ' import json import sys json.dump(json.load(sys.stdin), sys.stdout, indent=2) print() '
-
Python API Client example:
#!/usr/bin/env python3 from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port" api_key = "sk-no-key-required" ) completion = client.chat.completions.create( model="LLaMA_CPP", messages=[ {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."}, {"role": "user", "content": "Write a limerick about python exceptions"} ] ) print(completion.choices[0].message)