# 2/ Advanced chatbot with message history and filter using Langchain

<img src="https://github.com/prasadkona/databricks_demos/blob/main/images/llm-rag-full-pinecone-0i.png?raw=true" style="float: right; margin-left: 10px"  width="900px;">

Data is now available on the Pinecone vector database!

Let's now create a more advanced langchain model to perform RAG.

We will improve our langchain model with the following:

- Build a complete chain supporting a chat history, using llama 2 input style
- Add a filter to only answer Databricks-related questions
- Compute the embeddings with Databricks BGE models within our chain to query the Pinecone vector database

<!-- Collect usage data (view). Remove it to disable collection or disable tracker during installation. View README for more details.  -->
<img width="1px" src="https://ppxrzfxige.execute-api.us-west-2.amazonaws.com/v1/analytics?category=data-science&org_id=1444828305810485&notebook=02-Deploy-RAG-Chatbot-Model&demo_name=chatbot-rag-llm&event=VIEW">


In [0]:
%pip install -U mlflow==2.15.1 pinecone-client==5.0.1 langchain-pinecone==0.1.3 langchain==0.2.0 databricks-sdk==0.30.0 langchain-community==0.2.0
dbutils.library.restartPython()

In [0]:
import os
import requests
import mlflow
import langchain

# url used to send the request to your model from the serverless endpoint
host = "https://" + spark.conf.get("spark.databricks.workspaceUrl")

pinecone_index_name = "dbdemo-index"
pinecone_namespace = 'dbdemo-namespace'
pinecone_api_key = dbutils.secrets.get("pinecone_secrets_scope", "PINECONE_API_KEY")
os.environ["PINECONE_API_KEY"] = dbutils.secrets.get("pinecone_secrets_scope", "PINECONE_API_KEY")
#os.environ['DATABRICKS_TOKEN'] = dbutils.secrets.get("pinecone_secrets_scope", "DATABRICKS_TOKEN")
#pinecone_api_key = os.environ["PINECONE_API_KEY"]

catalog = "prasad_kona_dev"
db = "rag_chatbot_prasad_kona"

# Set a debug flag
debug_flag = True


## Register the chatbot model to Unity Catalog

In [0]:


# Specify the full path to the chain notebook
chain_notebook_file = "2.1 - Advanced-Chatbot-Chain - Using Pinecone"
chain_notebook_path = os.path.join(os.getcwd(), chain_notebook_file)

print(f"Chain notebook path: {chain_notebook_path}")

Chain notebook path: /Workspace/Users/prasad.kona@databricks.com/dbdemos/isv-chatbot-rag-llm-v20240108/rag_with_pinecone/2.1 - Advanced-Chatbot-Chain - Using Pinecone


In [0]:
from mlflow.models import infer_signature
# Provide an example of the input schema that is used to set the MLflow model's signature

#print(f'Testing with relevant history and question...')
dialog = {
    "messages": [
        {"role": "user", "content": "What is Apache Spark?"}, 
        {"role": "assistant", "content": "Apache Spark is an open-source data processing engine that is widely used in big data analytics."}, 
        {"role": "user", "content": "Does it support streaming?"},
        {"role": "assistant", "content": "Yes."},
        {"role": "user", "content": "Tell me more about it's capabilities."},
    ]
}

dialog_output = {'result': "Delta Lake is an open-source storage layer that brings ACID transactions, scalable metadata handling, and unifies batch and streaming processing. It's built on top of Apache Parquet format, providing features like schema enforcement, data versioning (Time Travel), and scalable metadata handling. Delta Lake supports both batch and streaming workloads, making it a unified solution for big data processing. It also offers features like ACID transactions and schema evolution, which are crucial for maintaining data integrity and handling continuously changing data. Delta Lake is designed to handle petabyte-scale tables with billions of partitions and files, making it suitable for large-scale data processing.", 'sources': ['dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/building-reliable-data-lakes-at-scale-with-delta-lake.pdf', 'dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/building-reliable-data-lakes-at-scale-with-delta-lake.pdf', 'dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/The-Delta-Lake-Series-Lakehouse-012921.pdf', 'dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/big-book-of-data-engineering-2nd-edition-final.pdf']}

input_example = {
   "messages": [
       {
           "role": "user",
           "content": "How does billing work on Databricks?",
       }
   ]
}

output_example = {'result': "Databricks operates on a pay-as-you-go model, where you are billed based on the usage of cloud resources. The cost depends on the type and duration of cloud resources you use, such as compute instances and storage. You can monitor your usage and costs through the Databricks platform. For more specific billing details, I would recommend checking Databricks' official documentation or contacting their support.", 'sources': ['dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/EB-Ingesting-Data-FINAL.pdf', 'dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/big-book-of-data-and-ai-use-cases-for-the-public-sector.pdf', 'dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/big-book-of-data-and-ai-use-cases-for-the-public-sector.pdf', 'dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/Databricks-Customer-360-ebook-Final.pdf']}

signature = infer_signature(input_example, output_example)

In [0]:

mlflow.set_registry_uri("databricks-uc")
model_name = f"{catalog}.{db}.rag_with_pinecone_model"

with mlflow.start_run():
    signature = infer_signature(input_example, output_example)
    logged_chain_info = mlflow.langchain.log_model(
        lc_model=chain_notebook_path,
        artifact_path="chain",
        registered_model_name=model_name,
        input_example=input_example,
        signature=signature,
        example_no_conversion=True, # required to allow the schema to work
        extra_pip_requirements=[ 
          "mlflow==" + mlflow.__version__,
          "langchain==0.2.0" ,
          "pinecone-client==5.0.1",
          "langchain-pinecone==0.1.3",
          "langchain-community==0.2.0"
        ]
    )

In [0]:
logged_chain_info.model_uri

'runs:/ef72ffcec9cf466ba2efe9ec83804abb/chain'

Let's try loading our model

In [0]:
logged_chain_info.model_uri

'runs:/ef72ffcec9cf466ba2efe9ec83804abb/chain'

In [0]:
model = mlflow.langchain.load_model(logged_chain_info.model_uri)
model.invoke(dialog)

Downloading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

{'result': 'Apache Spark is a powerful data processing engine that supports various capabilities, making it suitable for big data analytics. Some of its key features include:\n\n1. **Streaming support**: Spark allows for streaming data processing, enabling real-time data analysis and decision-making.\n2. **Scalable metadata handling**: Delta Lake, a key component of Databricks, stores metadata information in a transaction log instead of a metastore. This allows for efficient listing of files in large directories and reading data.\n3. **Data versioning and time travel**: Delta Lake enables users to read previous snapshots of a table or directory. This feature is useful for reproducing experiments, reports, and reverting a table to its older versions if needed.\n4. **Unified batch and streaming sink**: Apart from batch writes, Delta Lake can also be used as an efficient streaming sink with Apache Spark’s structured streaming. This enables near real-time analytics use cases without mainta

In [0]:
model.invoke(input_example)

{'result': 'Billing on Databricks is based on usage and is typically charged to the cloud service provider account where the Databricks workspace is hosted. The billing is usually usage-based, meaning you only pay for the resources you use. This can lead to lower total cost of ownership compared to legacy Hadoop systems and can help reduce premiums for customers and lower loss ratios in insurance use cases. The serverless data plane network infrastructure is managed by Databricks in a Databricks cloud service provider account and shared among customers, with additional network boundaries between workspaces and between clusters.',
 'sources': ['dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/technical_guide_solving_common-data-challenges-for-startups-and-digital-native-businesses.pdf',
  'dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/The-Data-Teams-Guide-to-the-DB-Lakehouse-Platfo

### Deploying our Chat Model as a Serverless Model Endpoint 

Our model is saved in Unity Catalog. The last step is to deploy it as a Model Serving.

We'll then be able to sending requests from our assistant frontend.

In [0]:
def get_latest_model_version(model_name):
    from mlflow import MlflowClient
    mlflow_client = MlflowClient()
    latest_version = 1
    for mv in mlflow_client.search_model_versions(f"name='{model_name}'"):
        version_int = int(mv.version)
        if version_int > latest_version:
            latest_version = version_int
    return latest_version

In [0]:
model = mlflow.langchain.load_model("models:/"+model_name+"/"+str(get_latest_model_version(model_name)))
model.invoke(dialog)

Downloading artifacts:   0%|          | 0/11 [00:00<?, ?it/s]

{'result': "Apache Spark is a powerful data processing engine that supports various capabilities, including:\n\n1. SQL Queries: Spark SQL allows relational processing with improved performance, and it can be used with SQL or through APIs in Python, Scala, and Java.\n2. Streaming Data: Spark Streaming enables scalable and fault-tolerant processing of live data streams, which can be integrated with a wide range of sources.\n3. Machine Learning: MLlib is Spark's distributed machine learning library, which provides various machine learning algorithms, including classification, regression, clustering, and collaborative filtering.\n4. Graph Processing: GraphX is Spark's API for graph-parallel computation, which provides a set of fundamental operators for manipulating graphs and a library of common graph algorithms.\n5. SparkR: SparkR is an R package that provides a light-weight frontend to use Spark from R, enabling data scientists to analyze large datasets and interact with data stored in v

In [0]:

latest_model_version = get_latest_model_version(model_name)
print(latest_model_version)

3


In [0]:
# Create or update serving endpoint
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import EndpointCoreConfigInput, ServedModelInput, ServedModelInputWorkloadSize
import requests

serving_endpoint_name = "pinecone_rag_chain"
latest_model_version = get_latest_model_version(model_name)

databricks_api_token = dbutils.notebook.entry_point.getDbutils().notebook().getContext().apiToken().get()


w = WorkspaceClient()
endpoint_config = EndpointCoreConfigInput(
    name=serving_endpoint_name,
    served_models=[
        ServedModelInput(
            model_name=model_name,
            model_version=latest_model_version,
            workload_size=ServedModelInputWorkloadSize.SMALL,
            scale_to_zero_enabled=True,
            environment_vars={
                "PINECONE_API_KEY": "{{secrets/prasad_kona/PINECONE_API_KEY}}",
                "DATABRICKS_TOKEN": "{{secrets/dbdemos/rag_sp_token}}",
            }
        )
    ]
)

existing_endpoint = next(
    (e for e in w.serving_endpoints.list() if e.name == serving_endpoint_name), None
)
serving_endpoint_url = f"{host}/ml/endpoints/{serving_endpoint_name}"
if existing_endpoint == None:
    print(f"Creating the endpoint {serving_endpoint_url}, this will take a few minutes to package and deploy the endpoint...")
    w.serving_endpoints.create_and_wait(name=serving_endpoint_name, config=endpoint_config)
else:
    print(f"Updating the endpoint {serving_endpoint_url} to version {latest_model_version}, this will take a few minutes to package and deploy the endpoint...")
    w.serving_endpoints.update_config_and_wait(served_models=endpoint_config.served_models, name=serving_endpoint_name)
    
displayHTML(f'Your Model Endpoint Serving is now available. Open the <a href="/ml/endpoints/{serving_endpoint_name}">Model Serving Endpoint page</a> for more details.')

Our endpoint is now deployed! You can search endpoint name on the [Serving Endpoint UI](#/mlflow/endpoints) and visualize its performance!

Let's run a REST query to try it in Python.

In [0]:
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
serving_endpoint_name = "pinecone_rag_chain"
latest_model_version = get_latest_model_version(model_name)
print("latest_model_version="+str(latest_model_version))

latest_model_version=3


In [0]:
from databricks.sdk.service.serving import DataframeSplitInput

test_dialog = DataframeSplitInput(
    columns=["messages"],
    data=[
        
            {
                "messages": [
                    {"role": "user", "content": "What is Apache Spark?"},
                    {
                        "role": "assistant",
                        "content": "Apache Spark is an open-source data processing engine that is widely used in big data analytics.",
                    },
                    {"role": "user", "content": "Does it support streaming?"},
                ]
            }
        
    ],
)
answer = w.serving_endpoints.query(serving_endpoint_name, dataframe_split=test_dialog)
print(answer.predictions[0])

{'result': 'Yes, Apache Spark supports streaming through Spark Structured Streaming, which is a scalable and fault-tolerant stream processing engine. It provides an easy-to-use API for creating continuous, real-time data pipelines.', 'sources': ['dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/big-book-of-data-engineering-2nd-edition-final.pdf', 'dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/big-book-of-data-engineering-2nd-edition-final.pdf', 'dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/big-book-of-data-engineering-2nd-edition-final.pdf', 'dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/big-book-of-data-engineering-2nd-edition-final.pdf']}


In [0]:
from databricks.sdk.service.serving import DataframeSplitInput

test_dialog = DataframeSplitInput(
    columns=["messages"],
    data=[
        
            {
                "messages": [
                    {"role": "user", "content": "How does billing work on Databricks?"},
                    
                ]
            }
        
    ],
)
answer = w.serving_endpoints.query(serving_endpoint_name, dataframe_split=test_dialog)
print(answer.predictions[0])

{'result': 'Billing on Databricks is usage-based. Customers are charged according to the number of Databricks Units (DBUs) consumed. A DBU is a unit of measure for the processing power used in Databricks, which includes the use of compute resources and managed services. The cost per DBU depends on the type of Databricks Runtime and the cloud service provider. Databricks provides detailed usage reports to help customers monitor and manage their costs.', 'sources': ['dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/technical_guide_solving_common-data-challenges-for-startups-and-digital-native-businesses.pdf', 'dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/The-Data-Teams-Guide-to-the-DB-Lakehouse-Platform.pdf', 'dbfs:/Volumes/prasad_kona_dev/rag_chatbot_prasad_kona/volume_databricks_documentation/databricks-pdf/databricks_ebook_insurance_v10.pdf', 'dbfs:/Volumes/prasad_kona_dev/rag_c

## Congratulations! You have deployed your RAG application with Pinecone!