# Model API Integration with Cloudera AI

This notebook demonstrates the flexibility of Cloudera AI inferencing services by showing different ways to interact with deployed models. We'll progress from basic API usage to more complex implementations, showing how easy it is to switch between different models and frameworks.

## Requirements
- Python 3.10 or later
- Access to Cloudera AI console
- Two deployed models: test-model-llama-8b-v2 and deepseek-r1-distill-llama-8b

## Model Setup

Before proceeding, you'll need to gather information from your deployed models in the Cloudera AI console:

1. Go to Cloudera AI console > Model Endpoints
2. Find the models: 
   - test-model-llama-8b-v2
   - deepseek-r1-distill-llama-8b
3. For each model:
   - Copy the endpoint URL (remove everything after /v1) for example :
   - `https://ai-inference.ainf-cdp.vayb-xokg.cloudera.site/...../modelxyz/openai/v1/chat/completions`
   - would be converted to :
   - `https://ai-inference.ainf-cdp.vayb-xokg.cloudera.site/...../modelxyz/openai/v1`
   - Copy the Model ID

The first model's information will go into `base_url` and `model_name` variables. The 2nd model will be `ds_base_url` and `ds_model_name` variables

In [119]:
from openai import OpenAI
import os
import httpx
import json
from typing import List, Dict, Generator
# For Lang chain:
from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage
import asyncio
import time
import re

#### Model enpoint collection
1. Collect Llama 3.1 model endpont details
2. Collect Deepseek model enpoint details

go to cloudera AI and get the following parameters.Cut off tail end of url after '/v1'

**Llama 3.1 8b**

In [None]:
#base_url = "enter-url here."
#model_name = "enter model name here"

In [2]:
#base_url = "enter-url here."
base_url = "https://ai-inference.ainf-cdp.vayb-xokg.cloudera.site/namespaces/serving-default/endpoints/test-model-llama-8b-v2/v1"
#model_name = "enter model name here"
model_name = "meta/llama-3.1-8b-instruct"

**Deepseek R1**

In [None]:
#ds_base_url = "enter-url here."
#ds_model_name "enter model name here"ds_model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

In [None]:
#ds_base_url = "enter-url here."
ds_base_url = "https://ai-inference.ainf-cdp.vayb-xokg.cloudera.site/namespaces/serving-default/endpoints/deepseek-r1-distill-llama-8b/openai/v1"
#ds_model_name "enter model name here"
ds_model_name = "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"

#### Auth setup

Here is the auth token you'll use to connect to your model

In [3]:
json.load(open("/tmp/jwt"))["access_token"]

'eyJqa3UiOiJodHRwczovL2FpbmYtYXctZGwtbWFzdGVyMC5haW5mLWNkcC52YXliLXhva2cuY2xvdWRlcmEuc2l0ZTo4NDQzL2FpbmYtYXctZGwva3Qta2VyYmVyb3Mva25veHRva2VuL2FwaS92MS9qd2tzLmpzb24iLCJraWQiOiJhbnB3NFN0QkZsTG1tZWN0RU05Z2hOVHBTZ09GdjhuN1RyRExwR3MwVEZBIiwiYWxnIjoiUlMyNTYifQ.eyJzdWIiOiJvemFyYXRlIiwiamt1IjoiaHR0cHM6Ly9haW5mLWF3LWRsLW1hc3RlcjAuYWluZi1jZHAudmF5Yi14b2tnLmNsb3VkZXJhLnNpdGU6ODQ0My9haW5mLWF3LWRsL2t0LWtlcmJlcm9zL2tub3h0b2tlbi9hcGkvdjEvandrcy5qc29uIiwia2lkIjoiYW5wdzRTdEJGbExtbWVjdEVNOWdoTlRwU2dPRnY4bjdUckRMcEdzMFRGQSIsImlzcyI6IktOT1hTU08iLCJleHAiOjE3NDAwMjU5NDksIm1hbmFnZWQudG9rZW4iOiJmYWxzZSIsImtub3guaWQiOiJlZGEwMzg1ZC0wYThhLTRmMjgtYTRjMy1iY2Y3YzM1ZjJkNzkifQ.hLK9epFNQgLdCG7BGtYDyB1-rc2JJM7MQ6PGXTrsVRfz4Cibede_vwEktSYhZ_o3wcfzbT8YmO1MxDlaF2Q9ZqF5rmclN1rOiS5KEEJUSHdvLJ8lIy6xBhvR5kbVjeDNQ_C3q0JdGPp6EaeGsjW5-Au8yGIxmH4vywqEg3ps1ilGD7gTOO3Pr2T1vTweG4EVPPIH0_ksVIdH8FSanV2LNSiu4SN_WKNNcQXFqDqI-08OQbS2YbDW7vhnvT_sO2RyS5x66yea-qXrF-jkY-nh4ylVGBKNCw8Jilzn88-HVQTqhpHFpMTXfQAfQfnFyi3rgufvC8-mp-ntA81jqV08fA'

In [4]:
# Load API key
OPENAI_API_KEY = json.load(open("/tmp/jwt"))["access_token"]

## Basic Model Interaction

This section demonstrates the simplest way to interact with our deployed model through the OpenAI package. We'll:
1. Create a client with our model's endpoint and authentication
2. Send a simple message to test the connection
3. Display the model's streaming response

This represents the most straightforward way to interact with the model, similar to how you might use OpenAI's API. The key difference is that we're using our own deployed model through Cloudera AI's infrastructure.

Note: We're using streaming=True in our completion request, which means we'll see the response being generated token by token, providing a more interactive experience.

In [5]:
client = OpenAI(
	base_url=base_url,
	api_key=OPENAI_API_KEY,
)

In [6]:
message = "Write a one-sentence definition of GenAI."

In [7]:
completion = client.chat.completions.create(
  model=model_name,
  messages=[{"role":"user","content":message}],
  temperature=0.2,
  top_p=0.7,
  max_tokens=1024,
  stream=True
)

for chunk in completion:
  if chunk.choices[0].delta.content is not None:
    print(chunk.choices[0].delta.content, end="")

GenAI, short for General Artificial Intelligence, refers to a hypothetical AI system that possesses the ability to understand, learn, and apply knowledge across a wide range of tasks, similar to human intelligence, without being limited to a specific domain or narrow function.

### Using a LangChain API Framework

Now we'll demonstrate how to use the same model through LangChain, a popular framework for building LLM applications. This shows how Cloudera AI's models can integrate seamlessly with different frameworks while maintaining the same functionality.

In [8]:
lc_chat = ChatOpenAI(
    model_name=model_name,
    openai_api_key=OPENAI_API_KEY,
    base_url=base_url,
    temperature=0.2,
    streaming=True
)

In [9]:
# Create the message
message = "Write a one-sentence definition of GenAI."
messages = [HumanMessage(content=message)]

# Stream the response
for chunk in lc_chat.stream(messages):
    if chunk.content:
        print(chunk.content, end="")

GenAI, short for General Artificial Intelligence, refers to a hypothetical AI system that possesses the ability to understand, learn, and apply knowledge across a wide range of tasks, similar to human intelligence, without being limited to a specific domain or narrow function.

## Enhanced Chat Client Implementation

This section implements a stateful chat client that maintains conversation history and can handle streaming responses. It demonstrates how to build more complex applications while maintaining the simple interface of the basic client.

Key features:
- Conversation history tracking
- Streaming response support
- Configurable parameters
- Error handling

In [127]:
import re  # Add at the top with other imports
import os
import httpx
import json
from typing import List, Dict
from openai import OpenAI

class ChatClient:
    def __init__(self, model_name: str, base_url: str, deepseek_clean: bool = False):
        self.model_name = model_name
        self.deepseek_clean = deepseek_clean
        
        # Set up HTTP client
        if "CUSTOM_CA_STORE" not in os.environ:
            http_client = httpx.Client()
        else:
            http_client = httpx.Client(verify=os.environ["CUSTOM_CA_STORE"])
            
        # Load API key
        OPENAI_API_KEY = json.load(open("/tmp/jwt"))["access_token"]
        
        # Initialize OpenAI client
        self.client = OpenAI(
            base_url=base_url,
            api_key=OPENAI_API_KEY,
            http_client=http_client,
        )
        
        self.conversation_history: List[Dict[str, str]] = []

    def _clean_response(self, response: str) -> str:
        """
        Remove thinking tags and extract only the actual question/guess.
        """
        # Handle empty or None responses
        if not response:
            return ""
            
        # First clean up any think blocks and explanatory text
        response = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL)
        response = re.sub(r'.*\*\*Question:\*\*\s*', '', response)
        response = re.sub(r'.*Question:\s*', '', response)
        response = re.sub(r'Step-by-Step.*', '', response)
        response = re.sub(r'\*\*.*?\*\*', '', response)
        
        # Get just the question or guess, taking the last non-empty line
        lines = [line.strip() for line in response.split('\n') if line.strip()]
        if lines:
            actual_response = lines[-1]  # Take last non-empty line
            if actual_response.startswith('FINAL GUESS:'):
                return actual_response
            elif '?' in actual_response:
                # Extract just the question
                return actual_response.split('?')[0].strip() + '?'
                
        return response.strip()
    def chat(self, message: str, stream: bool = True) -> str:
        """
        Send a message to the chat model and get the response.
        """
        # Add user message to history
        self.conversation_history.append({"role": "user", "content": message})
        
        try:
            if stream:
                partial_message = ""
                response = self.client.chat.completions.create(
                    model=self.model_name,
                    messages=self.conversation_history + [{"role": "system", "content": "After your thinking, always provide a clear, structured answer."}],
                    temperature=0.6,
                    top_p=0.7,
                    max_tokens=1024,  # Further increased token limit
                    stream=True,
                )
                
                for chunk in response:
                    if chunk.choices[0].delta.content is not None:
                        content = chunk.choices[0].delta.content
                        partial_message += content
                        if not self.deepseek_clean:
                            print(content, end='', flush=True)
                
                final_message = partial_message
                if self.deepseek_clean:
                    final_message = self._clean_response(partial_message)
                    print(repr(final_message))
                    
            else:
                response = self.client.chat.completions.create(
                    model=self.model_name,
                    messages=self.conversation_history,
                    temperature=0.6,
                    top_p=0.7,
                    max_tokens=512,  # Increased token limit
                    stream=False,
                )
                
                complete_response = response.choices[0].message.content
                #print("\nNon-streaming response:", repr(complete_response))
                
                final_message = complete_response
                if self.deepseek_clean:
                    final_message = self._clean_response(complete_response)
                    #print("\nAfter cleaning:", repr(final_message))
                    print(repr(final_message))
            # Only add to history if we got a valid response
            if final_message:
                self.conversation_history.append({"role": "assistant", "content": final_message})
            
            return final_message
            
        except Exception as e:
            print(f"Error in chat method: {str(e)}")
            raise
    
    def get_history(self) -> List[Dict[str, str]]:
        """Get the conversation history."""
        return self.conversation_history
    
    def clear_history(self):
        """Clear the conversation history."""
        self.conversation_history = []

In [11]:
# Initialize the chat client
chat_client = ChatClient(model_name, base_url)

In [12]:
message = """in 6 sentences or less explain how weights get update during model training process of a neural network. Explain this to a 6 year old'"""

In [13]:
# For streaming responses (will print as it receives chunks):
response = chat_client.chat(message, stream=True)

Here's an explanation of how weights get updated during model training:

Imagine you have a robot that can draw a picture, but it's not very good at it. You show the robot lots of pictures and tell it which parts of the picture are correct and which parts are wrong. The robot then tries to make the drawing better by changing the way it uses its crayons (these are like the weights in the neural network). When the robot makes a mistake, it says "oops, I made a mistake!" and changes the way it uses its crayons a little bit. It keeps trying and changing its crayons until it gets the picture right. This is kind of like how a neural network updates its weights during training.


In [14]:
message2 = "now follow it update with learning rate. 5 sentences or less"

In [15]:
# For streaming responses (will print as it receives chunks):
response = chat_client.chat(message2, stream=True)

Here's an explanation of how learning rate affects weight updates:

So, when the robot makes a mistake, it changes its crayons a little bit. But, sometimes it might change them too much, and that's not good. That's where the learning rate comes in - it's like a special button that says "change your crayons by this much". If the learning rate is high, the robot changes its crayons a lot, but if it's low, the robot changes them just a little bit. This helps the robot learn more smoothly and not make too many mistakes at once.


## Model Switching Demonstration

One of the key benefits of Cloudera AI is the ability to easily switch between different models. Here we'll demonstrate this by changing to the Deepseek model while using the same code structure.

For this section, we'll use our second model's information:
- URL goes into `ds_base_url` (remember to clip after /v1)
- Model ID goes into `ds_model_name`

In [55]:
# Initialize the chat client
deep_seek_chat_client = ChatClient(ds_model_name,ds_base_url,deepseek_clean = False)

In [107]:
deep_seek_chat_client = ChatClient(ds_model_name, ds_base_url, deepseek_clean=True)
message3 = "in 5 sentences or less what is a learning rate in neural networks?"
response_ds = deep_seek_chat_client.chat(message3, stream=True)
#print(response_ds)

"The learning rate is a crucial hyperparameter in neural networks that determines how quickly and effectively the model learns from data during training. It is integral to optimization algorithms like gradient descent, influencing the size of weight adjustments. A high learning rate can lead to instability and overshooting, while a low rate slows training but may improve accuracy. Balancing the learning rate is essential for stable convergence, often adjusted during training through techniques like learning rate scheduling. Ultimately, it's a key factor in achieving an optimal model."


In [130]:
class TwentyQuestionsGame:
    def __init__(
        self,
        answerer_client: ChatClient,  # Use existing ChatClient for answerer
        guesser_client: ChatClient,   # Use existing ChatClient for guesser
        max_questions=10,
        delay_seconds=2
    ):
        """Initialize game with two ChatClient instances"""
        self.answerer_client = answerer_client
        self.guesser_client = guesser_client
        self.secret_item = None
        self.questions_asked = 0
        self.max_questions = max_questions
        self.delay_seconds = delay_seconds
        self.game_history = []

    def format_history(self):
        if not self.game_history:
            return "None"
        return "\n".join([f"Q{h['question_number']}: {h['question']} -> {h['answer']}" 
                     for h in self.game_history])
    
    def play_game(self):
        try:
            # Clear any existing conversation history
            self.answerer_client.clear_history()
            self.guesser_client.clear_history()
            
            # Get the secret item from the answerer
            answerer_prompt = """
            You are playing a game of 20 questions. You need to think of an item (it can be an object, 
            person, place, or concept) and keep it secret. Only share the item in your response, 
            nothing else. The other AI will try to guess it through yes/no questions.
            """
            
            self.secret_item = self.answerer_client.chat(answerer_prompt, stream=False)
            print(f"The item to guess is: {self.secret_item.strip()}")
            
            while self.questions_asked < self.max_questions:
                guesser_prompt = f"""
You are playing 20 questions. Questions asked: {self.questions_asked}
Previous questions and answers:
{self.format_history()}
Questions remaining: {self.max_questions - self.questions_asked}

RULES:
1. Ask only ONE yes/no question
2. Never repeat a previous question
3. If you know it's a transportation tool, ask about specific types (car, bike, train, etc.)
4. When confident, make your guess with 'FINAL GUESS: [item]'
5. No explanations - just the question

Remember: This is 20 questions - use each question to narrow down possibilities!
"""
                
                raw_response = self.guesser_client.chat(guesser_prompt, stream=False)
                
                # Extract just the actual question, removing any explanatory text
                question = raw_response.split('**Question:**')[-1].strip() if '**Question:**' in raw_response else raw_response.strip()
                
                if "FINAL GUESS:" in question:
                    final_guess = question.split("FINAL GUESS:")[1].strip()
                    print(f"\nFinal guess made: {final_guess}")
                    print(f"The actual item was: {self.secret_item}")
                    print(f"Game ended after {self.questions_asked} questions")
                    return
                else:
                    self.questions_asked += 1
                    time.sleep(self.delay_seconds)
                    
                    answerer_prompt = f"""
                    You are playing a game of 20 questions. The item you chose is: {self.secret_item}
                    The question asked is: {question}
                    Please answer only with 'Yes' or 'No'.
                    """
                    
                    answer = self.answerer_client.chat(answerer_prompt, stream=False)
                    
                    # Store clean interaction
                    self.game_history.append({
                        'question': question,
                        'answer': answer.strip(),
                        'question_number': self.questions_asked
                    })
                    
                    # Display clean interaction
                    print(f"\nQuestion {self.questions_asked}: {question}")
                    print(f"Answer: {answer.strip()}")
                
                time.sleep(self.delay_seconds)
                
            print(f"\nGame Over! Maximum questions ({self.max_questions}) reached.")
            print(f"The item was: {self.secret_item}")
                
        except Exception as e:
            print(f"An error occurred during the game: {str(e)}")
            raise

In [131]:
# Example usage
def main():
    # Create ChatClient instances
    answerer_client = ChatClient(
        model_name=model_name,
        base_url=base_url
    )
    guesser_client = ChatClient(
        model_name=ds_model_name,
        base_url= ds_base_url, deepseek_clean=True)
    # Create and run the game
    game = TwentyQuestionsGame(
        answerer_client=answerer_client,
        guesser_client=guesser_client,
        max_questions=15,
        delay_seconds=1
    )
    game.play_game()

if __name__ == "__main__":
    main()

The item to guess is: A car
'Is it a car?'

Question 1: Is it a car?
Answer: Yes
'Is it an SUV?'

Question 2: Is it an SUV?
Answer: Yes
'Is it a Honda CR-V?'

Question 3: Is it a Honda CR-V?
Answer: Yes
'FINAL GUESS: Honda CR-V'

Final guess made: Honda CR-V
The actual item was: A car
Game ended after 3 questions
