
Note: This is a sample notebook that gives hands on experience to some of the features of LLM Gateway. This notebook is meant to be interactive and might require the user to provide some inputs in designated code cells. 

#### Basic usage
This codespace was started with an open source LLM (phi3) with Ollama. The proxy service has been configured to use the same and the model is referenced in the code throughout this notebook by passing the 'model' parameter as 'phi3' to openai client completion calls. The proxy service runs on localhost port 4000 by default and has been referenced as the 'base_url' for openai client objects. The master key for LLM Gateway has been set as "sk-password" and can be used for calling registered models as well as accessing the admin panel at "https://<CODESPACE_NAME>-4000.app.github.dev/ui". The same can be accessed through port 4000 on the PORTS tab of the terminal.

In [None]:
import openai
client = openai.OpenAI(
    api_key="sk-password",
    base_url="http://0.0.0.0:4000"
)

##### LLM
Enter a prompt in the code cell below and run the next cell to call 'phi3' model to generate a response.

In [None]:
query = "Write a short story in less than 100 words" # Change this to your query/prompt

In [None]:
response = client.chat.completions.create(model="phi3", messages = [
    {
        "role": "user",
        "content": query
    }
])

print(response)

##### Embedding 
This codespace is also equipped with an open source embedding model "nomic-embed-text" from Ollama. Try out the embedding model call from code below.

In [None]:
client.embeddings.create(
  model="nomic-embed-text",
  input=query
)

#### Tracing
All the calls through LLM Gateway are traced with Langfuse. A sample Langfuse project and associated API keys has been preconfigured for this codespace. The traces can be viewed at the Langfuse dashboard at "https://<CODESPACE_NAME>-3001.app.github.dev" or access it through port 3001 from the PORTS tab in the terminal. (if asked for login, use "admin@dep.com" as email and "password" as password)

#### Caching
LLM Gateway provides a caching mechanism for LLM responses. This codespace has been configured to use Redis for cache. To see caching in action, write a prompt in the code cell below and run the cells that follows and see the difference in response time.

In [None]:
import openai
client = openai.OpenAI(
    api_key="sk-password",
    base_url="http://0.0.0.0:4000"
)

In [None]:
query = "Hi, how are you?" # Change this to your query/prompt

In [None]:
response = client.chat.completions.create(model="phi3", messages = [
    {
        "role": "user",
        "content": query
    }
])

print(response)

In [None]:
response = client.chat.completions.create(model="phi3", messages = [
    {
        "role": "user",
        "content": query
    }
])

print(response)

Check the traces logged to langfuse at "https://<CODESPACE_NAME>-3001.app.github.dev" or head over to the ports tab to access port 3001. It can be observed that when the response is returned from cache, the langfuse trace aptly marks "True" for cache hit.

#### Virtual Keys
The LLM Gateway in this codespace has been set up with a virtual key with access to the configured model for demonstration. Try out the code sample below.

In [None]:
import openai
client = openai.OpenAI(
    api_key="sk-TE5BPNfSh4IOCNpW3I5EDQ",
    base_url="http://0.0.0.0:4000"
)

In [None]:
query = "Hi, what is an LLM?" # Change this to your query/prompt

In [None]:
response = client.chat.completions.create(model="phi3", messages = [
    {
        "role": "user",
        "content": query
    }
])

print(response)

Head over to LiteLLM UI at "https://<CODESPACE_NAME>-4000.app.github.dev/ui" to add more virtual keys and try them out. Login with the master key "sk-password" to access the admin panel (Username : admin).\
Note: Virtual keys can be used for limiting access to models, track usage, limit rate of requests and more. Refer to the section on virtual keys in README for more details. 

#### RAG with Langchain
To demonstrate how models proxied from LLM Gateway maybe used in RAG applications with minimal code changes, a sample RAG pipeline is provided below. The RAG pipeline uses the open source model "phi3" registered through LLM Gateway to generate responses and the open source embedding model "nomic-embed-text" to generate embeddings for the documents. 
For this example, let's take the infamous state_of_the_union.txt as the document to be used in the RAG pipeline. Run the code below to see the RAG pipeline in action. 

In [None]:
from langchain_community.document_loaders import TextLoader
from langchain_openai import OpenAIEmbeddings,ChatOpenAI
from langchain_text_splitters import CharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain_community.vectorstores import Annoy

Split document

In [None]:
loader = TextLoader("data/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

In [None]:

embeddings = OpenAIEmbeddings(api_key="sk-password",
    base_url="http://0.0.0.0:4000",
    model="nomic-embed-text")

llm = ChatOpenAI(model_name="phi3", temperature=0,api_key="sk-password",
    base_url="http://0.0.0.0:4000")

Load split documents into vector store

In [None]:
vector_store = Annoy.from_documents(
    docs,
    embeddings
)

Test RAG by asking a question about the state of the union address

In [None]:
prompt = "What did the president say about the economy?" # Change this to your query/prompt

In [None]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vector_store.as_retriever(),
    chain_type = "stuff",
)

In [None]:
qa_chain.invoke(prompt)

### Integration with Langfuse
Langfuse Tracing integrates with Langchain using Langchain Callbacks and the SDK automatically creates a nested trace for every run of your Langchain applications.

In [None]:
import os
os.environ["LANGFUSE_PUBLIC_KEY"] = "pk-lf-f2a3d62b-8ee4-4736-a823-5751a40f1bba"
os.environ["LANGFUSE_SECRET_KEY"] = "sk-lf-7b4fd420-8dfc-4996-abaf-79ce05c8b7ba"
os.environ["LANGFUSE_HOST"] = "http://localhost:3001"

In [None]:
from langfuse.callback import CallbackHandler
 
langfuse_handler = CallbackHandler()

In [None]:
prompt = "What is the view on tax cuts?" # Change this to your query/prompt

In [None]:
qa_chain.invoke(prompt,config={"callbacks":[langfuse_handler]})

#### PII Masking [Optional]
When Presidio service is enabled in the codespace, try out the code below to see PII masking in action.

In [None]:
query = "Hi, How are you? My name is John Doe and my phone number is 1800224541" # Change this prompt to test different PII inputs

In [None]:
import openai
client = openai.OpenAI(
    api_key="sk-password",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(model="phi3", messages = [
    {
        "role": "user",
        "content": query
    }
])

print(response)

See that the response have masked PII. It can also be seen in the langfuse traces corresponding to the LLM call that the gateway masks the data before passing the user inputs to the LLM.