
Note: This is a sample notebook that gives hands on experience to some of the features of LLM Gateway. This notebook is meant to be interactive and might require the user to provide some inputs in designated code cells. 

#### Basic usage
This codespace was started with an open source LLM (phi3) with Ollama. The proxy service has been configured to use the same and the model is referenced in the code throughout this notebook by passing the 'model' parameter as 'phi3' to openai client completion calls. The proxy service runs on localhost port 4000 by default and has been referenced as the 'base_url' for openai client objects. The master key for LLM Gateway has been set as "sk-password" and can be used for calling registered models as well as accessing the admin panel at "https://<CODESPACE_NAME>-4000.app.github.dev/ui". The same can be accessed through port 4000 on the PORTS tab of the terminal.

In [None]:
import openai
client = openai.OpenAI(
    api_key="sk-password",
    base_url="http://0.0.0.0:4000"
)

##### LLM
Enter a prompt in the code cell below and run the next cell to call 'phi3' model to generate a response.

In [None]:
query = "<your_prompt_here>" 

In [None]:
response = client.chat.completions.create(model="phi3", messages = [
    {
        "role": "user",
        "content": query
    }
])

print(response)

##### Embedding 
This codespace is also equipped with an open source embedding model "nomic-embed-text" from Ollama. Try out the embedding model call from code below.

In [None]:
client.embeddings.create(
  model="nomic-embed-text",
  input=query
)

#### Caching
LLM Gateway provides a caching mechanism for LLM responses. This codespace has been configured to use Redis for cache. To see caching in action, write a prompt (more complex prompts, the better) in the code cell below and run the cells that follows and see the difference in response time.

In [None]:
import openai
client = openai.OpenAI(
    api_key="sk-password",
    base_url="http://0.0.0.0:4000"
)

In [None]:
query = "<your_prompt_here>" 

In [None]:
response = client.chat.completions.create(model="phi3", messages = [
    {
        "role": "user",
        "content": query
    }
])

print(response)

In [None]:
response = client.chat.completions.create(model="phi3", messages = [
    {
        "role": "user",
        "content": query
    }
])

print(response)

Check the traces logged to langfuse at "https://<CODESPACE_NAME>-3000.app.github.dev" or head over to the ports tab to access port 3000. It can be observed that when the response is returned from cache, the langfuse trace aptly marks "True" for cache hit.

#### Virtual Keys
The LLM Gateway in this codespace has been set up with a virtual key with access to the configured model for demonstration. Try out the code sample below.

In [None]:
import openai
client = openai.OpenAI(
    api_key="sk-TE5BPNfSh4IOCNpW3I5EDQ",
    base_url="http://0.0.0.0:4000"
)

In [None]:
query = "<your_prompt_here>" 

In [None]:
response = client.chat.completions.create(model="phi3", messages = [
    {
        "role": "user",
        "content": query
    }
])

print(response)

#### PII Masking
PII masking with presidio service is enabled in this codespace. Try out the code below to see it in action.

In [None]:
query = "<Enter_prompt_with_PII_here>" # eg "My name is John Doe and my social security number is 123-45-6789"

In [None]:
import openai
client = openai.OpenAI(
    api_key="sk-password",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(model="phi3", messages = [
    {
        "role": "user",
        "content": query
    }
])

print(response)

See that the response have masked PII. It can also be seen in the langfuse traces corresponding to the LLM call that the gateway proxy service masks the data before passing the user inputs to the LLM.

#### RAG with Langchain
To demonstrate how models proxied from LLM Gateway maybe used in RAG applications with minimal code changes, a sample RAG pipeline is provided below. The RAG pipeline uses the open source model "phi3" registered through LLM Gateway to generate responses and the open source embedding model "nomic-embed-text" to generate embeddings for the documents. 
For this example, let's take the infamous state_of_the_union.txt as the document to be used in the RAG pipeline. Run the code below to see the RAG pipeline in action. 

In [None]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import Annoy

Split document

In [None]:
loader = TextLoader("state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

In [None]:

embeddings = OpenAIEmbeddings(api_key="sk-password",
    base_url="http://0.0.0.0:4000",
    model="nomic-embed-text")

llm = ChatOpenAI(model_name="phi3", temperature=0,api_key="sk-password",
    base_url="http://0.0.0.0:4000")

Load split documents into vector store

In [None]:
vector_store = Annoy.from_documents(
    docs,
    embeddings
)

Test RAG by asking a question about the state of the union address

In [None]:
prompt = "<Enter_prompt_here>"

In [None]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vector_store.as_retriever(),
    chain_type = "map_reduce",
)

In [None]:
qa_chain.run(prompt)