<a href="https://colab.research.google.com/github/gregmeldrum/localLLM/blob/main/free-tier-replacement-for-openai/GPT3_5_Replacement_Server_using_Intel_NeuralChat_7B_v3_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# @title
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

## Create a drop in replacement for GPT-3.5

This colab starts an instance of the [Intel Neural Chat 7B v3.1](https://huggingface.co/Intel/neural-chat-7b-v3-1) model and makes the model publicly available with ngrok. You can use the model with the OpenAI API by setting the `openai.base_url` to your ngrok public url. Look for the ngrok public url at the end of the output for step 2. You can sign up to [ngrok](https://ngrok.com/) for free if you don't already have an account.

The first inference will take about a minute since the model needs to be loaded into GPU. After that, each inference will take a few seconds.

The Neural Chat 7B v3.1 model tops the Open LLM Leaderboard for it's model size at the time of writing this colab.

This colab uses [ollama](https://ollama.ai/) to run the model, [litellm](https://litellm.ai/) to wrap it in an OpenAI API and [ngrok](https://ngrok.com/) to give it a public URL.

In [None]:
#@title <b>Step 1 - Enter your ngrok authtoken from https://dashboard.ngrok.com/get-started/your-authtoken</b> and <b>run this cell</b>
authtoken = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX' # @param {type:"string"}

## **Step 2: Run the model**
Look for the ngrok URL at the end of the output

In [None]:
# Set LD_LIBRARY_PATH so the system NVIDIA library becomes preferred
# over the built-in library. This is particularly important for
# Google Colab which installs older drivers
import os
os.environ.update({'LD_LIBRARY_PATH': '/usr/lib64-nvidia'})

import os
os.environ["AUTHTOKEN"] = authtoken

!pip install pyngrok==7.0.1 fastapi==0.104.1 uvicorn==0.24.0.post1

# install ollama
!curl https://ollama.ai/install.sh | sh

# Start the ollama server (in the background)
!nohup bash -c "ollama serve" &

# Download the model from hugging face (6 bit quantized version - by default ollama uses 4 bit)
!wget https://huggingface.co/TheBloke/neural-chat-7B-v3-1-GGUF/resolve/main/neural-chat-7b-v3-1.Q6_K.gguf?download=true -O /content/neural-chat-7b-v3-1.Q6_K.gguf

# Create a model definition file where we tell Ollama the prompt template, temperature, stop tokens and anything
# else we need for our usecase
# (https://github.com/jmorganca/ollama/blob/main/docs/modelfile.md)
openchat_model_definition = '''
from neural-chat-7b-v3-1.Q6_K.gguf

TEMPLATE """
### System:
{{- if .First }}
{{ .System }}
{{- end }}

### User:
{{ .Prompt }}

### Assistant:
"""

SYSTEM """You are a helpful and friendly assistant. If you don't know the answer to the users question, just say you don't know."""

PARAMETER temperature 0.3
'''

with open("/content/modelfile.neuralchat", "w") as text_file:
    text_file.write(openchat_model_definition)

# Create a model in ollama for the 6bit quantized openchat
!ollama create neuralchat -f /content/modelfile.neuralchat

# Install and start litellm
!pip install litellm
!nohup bash -c "litellm --host 127.0.0.1 --model ollama/neuralchat" &

# Start ngrok
!ngrok config add-authtoken $AUTHTOKEN
!ngrok http --log stderr 8000