# Model API Integration with Cloudera AI

This notebook demonstrates the flexibility of Cloudera AI inferencing services by showing different ways to interact with deployed models. We'll progress from basic API usage to more complex implementations, showing how easy it is to switch between different models and frameworks.

## Requirements
- Python 3.10 or later
- Access to Cloudera AI console
- Two deployed models: test-model-llama-8b-v2 and deepseek-r1-distill-llama-8b

## Model Setup

Before proceeding, you'll need to gather information from your deployed models in the Cloudera AI console:

1. Go to Cloudera AI console > Model Endpoints
2. Find the models: 
   - test-model-llama-8b-v2
   - deepseek-r1-distill-llama-8b
3. For each model:
   - Copy the endpoint URL (remove everything after /v1) for example :
   - `https://ai-inference.ainf-cdp.vayb-xokg.cloudera.site/...../modelxyz/openai/v1/chat/completions`
   - would be converted to :
   - `https://ai-inference.ainf-cdp.vayb-xokg.cloudera.site/...../modelxyz/openai/v1`
   - Copy the Model ID

The first model's information will go into `base_url` and `model_name` variables. The 2nd model will be `ds_base_url` and `ds_model_name` variables

In [None]:
from openai import OpenAI
import os
import httpx
import json
from typing import List, Dict, Generator
# For Lang chain:
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage
import asyncio
import time
import re

#### Model enpoint collection
1. Collect Llama 3.1 model endpont details
2. Collect Deepseek model enpoint details

go to cloudera AI and get the following parameters.Cut off tail end of url after '/v1'

**Llama 3.1 8b**

In [None]:
#base_url = "enter-url here."
#model_name = "enter model name here"

**Deepseek R1**

In [None]:
#ds_base_url = "enter-url here."
#ds_model_name "enter model name here"ds_model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

#### Auth setup

Here is the auth token you'll use to connect to your model

In [None]:
json.load(open("/tmp/jwt"))["access_token"]

In [None]:
# Load API key
OPENAI_API_KEY = json.load(open("/tmp/jwt"))["access_token"]

In [None]:
client = OpenAI(
	base_url=base_url,
	api_key=OPENAI_API_KEY,
)

## Basic Model Interaction

This section demonstrates the simplest way to interact with our deployed model through the OpenAI package. We'll:
1. Create a client with our model's endpoint and authentication
2. Send a simple message to test the connection
3. Display the model's streaming response

This represents the most straightforward way to interact with the model, similar to how you might use OpenAI's API. The key difference is that we're using our own deployed model through Cloudera AI's infrastructure.

Note: We're using streaming=True in our completion request, which means we'll see the response being generated token by token, providing a more interactive experience.

In [None]:
message = "Write a one-sentence definition of GenAI."

In [None]:
completion = client.chat.completions.create(
  model=model_name,
  messages=[{"role":"user","content":message}],
  temperature=0.2,
  top_p=0.7,
  max_tokens=1024,
  stream=True
)

for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")

### Using a LangChain API Framework

Now we'll demonstrate how to use the same model through LangChain, a popular framework for building LLM applications. This shows how Cloudera AI's models can integrate seamlessly with different frameworks while maintaining the same functionality.

In [None]:
lc_chat = ChatOpenAI(
    model_name=model_name,
    openai_api_key=OPENAI_API_KEY,
    base_url=base_url,
    temperature=0.2,
    streaming=True
)

In [None]:
# Create the message
message = "Write a one-sentence definition of GenAI."
messages = [HumanMessage(content=message)]

# Stream the response
for chunk in lc_chat.stream(messages):
    if chunk.content:
        print(chunk.content, end="")

## Enhanced Chat Client Implementation

This section implements a stateful chat client that maintains conversation history and can handle streaming responses. It demonstrates how to build more complex applications while maintaining the simple interface of the basic client.

Key features:
- Conversation history tracking
- Streaming response support
- Configurable parameters
- Error handling

In [None]:
class ChatClient:
    def __init__(self, model_name: str, base_url: str, deepseek_clean: bool = False):
        self.model_name = model_name
        self.deepseek_clean = deepseek_clean
        
        # Set up HTTP client
        if "CUSTOM_CA_STORE" not in os.environ:
            http_client = httpx.Client()
        else:
            http_client = httpx.Client(verify=os.environ["CUSTOM_CA_STORE"])
            
        # Load API key
        OPENAI_API_KEY = json.load(open("/tmp/jwt"))["access_token"]
        
        # Initialize OpenAI client
        self.client = OpenAI(
            base_url=base_url,
            api_key=OPENAI_API_KEY,
            http_client=http_client,
        )
        
        self.conversation_history: List[Dict[str, str]] = []

    def _clean_response(self, response: str) -> str:
        """
        Remove thinking tags and extract only the actual question/guess.
        """
        # Handle empty or None responses
        if not response:
            return ""
            
        # First clean up any think blocks and explanatory text
        response = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL)
        response = re.sub(r'.*\*\*Question:\*\*\s*', '', response)
        response = re.sub(r'.*Question:\s*', '', response)
        response = re.sub(r'Step-by-Step.*', '', response)
        response = re.sub(r'\*\*.*?\*\*', '', response)
        
        # Get just the question or guess, taking the last non-empty line
        lines = [line.strip() for line in response.split('\n') if line.strip()]
        if lines:
            actual_response = lines[-1]  # Take last non-empty line
            if actual_response.startswith('FINAL GUESS:'):
                return actual_response
            elif '?' in actual_response:
                # Extract just the question
                return actual_response.split('?')[0].strip() + '?'
                
        return response.strip()
    def chat(self, message: str, stream: bool = True) -> str:
        """
        Send a message to the chat model and get the response.
        """
        # Add user message to history
        self.conversation_history.append({"role": "user", "content": message})
        
        try:
            if stream:
                partial_message = ""
                response = self.client.chat.completions.create(
                    model=self.model_name,
                    messages=self.conversation_history + [{"role": "system", "content": "After your thinking, always provide a clear, structured answer."}],
                    temperature=0.6,
                    top_p=0.7,
                    max_tokens=1024,  # Further increased token limit
                    stream=True,
                )
                
                for chunk in response:
                    if chunk.choices[0].delta.content is not None:
                        content = chunk.choices[0].delta.content
                        partial_message += content
                        if not self.deepseek_clean:
                            print(content, end='', flush=True)
                
                final_message = partial_message
                if self.deepseek_clean:
                    final_message = self._clean_response(partial_message)
                    print(repr(final_message))
                    
            else:
                response = self.client.chat.completions.create(
                    model=self.model_name,
                    messages=self.conversation_history,
                    temperature=0.6,
                    top_p=0.7,
                    max_tokens=512,  # Increased token limit
                    stream=False,
                )
                
                complete_response = response.choices[0].message.content
                #print("\nNon-streaming response:", repr(complete_response))
                
                final_message = complete_response
                if self.deepseek_clean:
                    final_message = self._clean_response(complete_response)
                    #print("\nAfter cleaning:", repr(final_message))
                    print(repr(final_message))
            # Only add to history if we got a valid response
            if final_message:
                self.conversation_history.append({"role": "assistant", "content": final_message})
            
            return final_message
            
        except Exception as e:
            print(f"Error in chat method: {str(e)}")
            raise
    
    def get_history(self) -> List[Dict[str, str]]:
        """Get the conversation history."""
        return self.conversation_history
    
    def clear_history(self):
        """Clear the conversation history."""
        self.conversation_history = []

In [None]:
# Initialize the chat client
chat_client = ChatClient(model_name, base_url)

In [None]:
message = """in 6 sentences or less explain how weights get update during model training process of a neural network. Explain this to a 6 year old'"""

In [None]:
# For streaming responses (will print as it receives chunks):
response = chat_client.chat(message, stream=True)

In [None]:
message2 = "now follow it update with learning rate. 5 sentences or less"

In [None]:
# For streaming responses (will print as it receives chunks):
response = chat_client.chat(message2, stream=True)

## Model Switching Demonstration

One of the key benefits of Cloudera AI is the ability to easily switch between different models. Here we'll demonstrate this by changing to the Deepseek model while using the same code structure.

For this section, we'll use our second model's information:
- URL goes into `ds_base_url` (remember to clip after /v1)
- Model ID goes into `ds_model_name`

**Primer on Deepseek**
there is a ton of information out there about deepseek, how it was trained at a fraction of the cost of traditional massive scale LLMs. But today we're going to narrow the scope to usage. You'll notice that deepseek r1 'thinks' as it response. This chain of thought allows user to see how the model breaks down the problem into sub steps to arrive at an answer. 

For this lab, we've configured a class that allows deepseek to respond in its natural way, but also provide a way to supress that and give you only the desired response, with deepseek_clean = False or True, parameter

In [None]:
# Initialize the chat client
deep_seek_chat_client = ChatClient(ds_model_name,ds_base_url,deepseek_clean = False)

In [None]:
message3 = "in 5 sentences or less what is a learning rate in neural networks?"
response_ds = deep_seek_chat_client.chat(message3, stream=True)
#print(response_ds)

In [None]:
deep_seek_chat_client = ChatClient(ds_model_name, ds_base_url, deepseek_clean=True)
message3 = "in 5 sentences or less what is a learning rate in neural networks?"
response_ds = deep_seek_chat_client.chat(message3, stream=True)
#print(response_ds)