In [1]:
%%capture output
%pip install python-dotenv openai azure-identity instructor semantic-kernel

# Agenda

1. Chat Completions
   * Message Types
   * Request Parameters
   * Structured Output
   * Function Calling
   * Agents with Semantic Kernel
2. Retrieval Augmented Generation
   * Vector Databases
   * Visualizing Semantic Similarity
   * Hybrid Search and Reranking
3. Promptflow
   * Anatomy of a flow
   * Evaluations + Benchmarking


# Chat Completions

Chat completions refer to the responses generated by language models like GPT-4, during a conversation or interaction with users. These responses are crafted based on the input received, context, and predefined instructions or system messages.

These models produce probabilistic output by assigning weights to different parts of the input during inference, determining the likelihood of each possible next word or phrase based on its surrounding contextual relevance.

## Message Types

* User Messages
* System Messages
* Assistant Messages

### User Messages

* Messages sent by the user to the AI.
* Usually in the form of questions, commands, or conversational input.
* Example: "How do I make an omelette?"

### System Messages

* Instructions provided to guide the AI’s behavior
* The model weighs instructions here much more than other message types
* Example: "You are a british chef and restauranteur with a short and fiery temper”

### Assistant Messages

* Responses generated by the AI
* Typically reserved for replies to user inputs based on the context and instructions.
* Example: "Make the bloody omelette you donkey"

In [2]:
from openai import OpenAI

ollama_client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = ollama_client.chat.completions.create(
    model = "gemma2:9b",
    messages = [
        {
            "role": "system",
            "name": "instruction",
            "content": "You are a british chef and restauranteur with a short and fiery temper. Keep your responses rude and curt."
        },
        {
            "role": "user",
            "name": "beginner_chef",
            "content": "How do I make an omelette?",
        }
    ],
    stream=True
)

for chunk in response:
    if (len(chunk.choices) > 0 and chunk.choices[0].delta.content):
        print(chunk.choices[0].delta.content, end='', flush=True)

Right, listen up. You crack two eggs, whisk 'em good. Chuck in a knob of butter in the pan - hot, not bloody scorching -  pour in the eggs. Now scramble it about like you're possessed till it's almost set, then fold it over and get it outta there. 

You want fancy? Add cheese or somethin'.  Don't muck it up.  


## Request Parameters

* Token Probabilities
* Limiting Parameters

### Token Probabilities

* `logprobs`: Boolean that if true, will return log probabilities of output tokens.
* `top_logprobs`: Number of most likely tokens to return at each token position.
  


**Use Cases:**

* Classification: Set confidence thresholds based on probabilities
* Retrieval Evaluation: Self-evaluation with confidence scores
* Autocomplete: Assist in word suggestion as a user types
* Calculating Perplexity: Compare confidence of results across different prompts

In [3]:
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from dotenv import load_dotenv
from math import exp
import numpy as np

load_dotenv(override=True)

scope = "https://cognitiveservices.azure.com/.default"
token_provider = get_bearer_token_provider(DefaultAzureCredential(), scope)
client = AzureOpenAI(
    azure_ad_token_provider=token_provider, 
    api_version="2024-03-01-preview",
    azure_endpoint="https://oai-vena-copilot-npr-canadaeast-01.openai.azure.com"
)

In [4]:
instruction = """
Your task is to determine if that food is ONE of these classifications: sweet, salty, sour, bitter or umami.

### Examples
strawberry = sweet
bacon = salty
lemon = sour
beer = bitter
mushroom = umami

### Expected Output
One of sweet, salty, sour, bitter or umami. NOTHING ELSE.
"""
response = client.chat.completions.create(
    model="gpt-35-turbo",
    messages=[
        {"role": "system", "content": instruction},
        {"role": "user", "content": "baby back ribs = "}
    ],
    logprobs=True,
    top_logprobs=3
)

print("Prediction:", response.choices[0].message.content)
for i, content in enumerate(response.choices[0].logprobs.content):
    print(f"token {i + 1}:")
    for j, logprob in enumerate(content.top_logprobs):
        probability = np.round(np.exp(logprob.logprob) * 100, 2)
        print(f"\ttop_logprobs: {logprob.token}, probability: {probability}")


Prediction: salty
token 1:
	top_logprobs: s, probability: 99.94
	top_logprobs: S, probability: 0.05
	top_logprobs: sweet, probability: 0.0
token 2:
	top_logprobs: alty, probability: 100.0
	top_logprobs: we, probability: 0.0
	top_logprobs: our, probability: 0.0


### Limiting Parameters

* `max_tokens`: Max number of tokens to generate
* `n`: Number of chat completion choices to generate
* `stop`: Sequence where the API will stop generating further tokens 
  

**Use Cases**
* Fixed response length / cost management
* Generating multiple options for gauging response cohesiveness
* Strategic sequences for controlling response length
  

In [5]:
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from dotenv import load_dotenv
from math import exp
import numpy as np

load_dotenv(override=True)

scope = "https://cognitiveservices.azure.com/.default"
token_provider = get_bearer_token_provider(DefaultAzureCredential(), scope)
client = AzureOpenAI(
    azure_ad_token_provider=token_provider, 
    api_version="2024-03-01-preview",
    azure_endpoint="https://oai-vena-copilot-npr-canadaeast-01.openai.azure.com"
)

In [6]:
response = client.chat.completions.create(
    model="gpt-35-turbo",
    messages=[
        {"role": "system", "content": "Generate a haiku about the provided food"},
        {"role": "user", "content": "Pho"}
    ],
    # max_tokens=100,
    n=3
    # stop="\n"
)

for choice in response.choices:
    print(choice.message.content + "\n")

Steaming bowl of pho
Savoury broth and thin noodles
Warm comfort in spoon

Broth simmers gently
Rice noodles and herbs enhance
Pho warms the soul deep

Savory broth and noodles
Tender meats and fresh herbs float
Warm comfort in a bowl



## Structured Output

* JSON Output
* Types

### JSON Output

* Model is constrained to only generate strings that parse into valid JSON object
* Must instruct the model to produce JSON somewhere in the system message
* Formatting/examples are important here

In [7]:
import json
from openai import OpenAI

ollama_client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

resp = ollama_client.chat.completions.create(
    model="gemma2:9b",
    messages=[
        {
            "role": "system",
            "content": """
            # Task
            You will be given a meal (breakfast/lunch/dinner). Generate a JSON list of 5 dishes for that meal.
            
            # Expected Format:
            { "dishes": [ {"name": "<dish>", "style": "<dish style>"}, ... ] }
            """,
        },
        {
            "role": "user",
            "content": "breakfast",
        }
    ],
    response_format={ "type": "json_object" }
)

for dish in json.loads(resp.choices[0].message.content)["dishes"]:
    print(dish)

{'name': 'Pancakes', 'style': 'American'}
{'name': 'Avocado Toast', 'style': 'Modern'}
{'name': 'Oatmeal with Berries', 'style': 'Healthy'}
{'name': 'Scrambled Eggs with Bacon', 'style': 'Classic'}
{'name': 'Breakfast Burrito', 'style': 'Tex-Mex'}


### Types

Types are important for bringing some semblance of order to an otherwise chaotic system

[Pydantic](https://docs.pydantic.dev/latest/) is the defacto type validation library for Python powered by type annotations
```python
class DishList(BaseModel):
    dishes: List[Dish] = Field(..., description="Contains a list of dish objects containing name and style")

class Dish(BaseModel):
    name: str
    style: str = Field(..., description="The dish type i.e. Mexican, Japanese etc.")
```

[Instructor](https://python.useinstructor.com/why/) is a library that shims various LLM provider with the ability to validate and return Pydantic types
```python
client.chat.completions.create(
    ...,
    response_model=DishList
)
```

In [8]:
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List
import instructor

class Dish(BaseModel):
    name: str
    style: str = Field(..., description="The dish type i.e. Mexican, Japanese etc.")

class DishList(BaseModel):
    dishes: List[Dish] = Field(..., description="Contains a list of dish objects containing name and style")

ollama_client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
instructor_client = instructor.from_openai(ollama_client, mode=instructor.Mode.JSON)

response = instructor_client.chat.completions.create(
    model="gemma2:9b",
    messages=[
        {
            "role": "system",
            "content": """
            # Task
            You will be given a meal (breakfast/lunch/dinner). Generate a JSON list of 5 dishes for that meal.
            
            # Expected Format:
            { "dishes": [ {"name": "<dish>", "style": "<dish style>"}, ... ] }
            """,
        },
        {
            "role": "user",
            "content": "breakfast",
        }
    ],
    response_model=DishList
)

for dish in response.dishes:
    print(f"Dish: {dish.name}, Style: {dish.style}")

Dish: Pancakes with Maple Syrup, Style: American
Dish: Breakfast Burrito, Style: Mexican
Dish: Avocado Toast, Style: Modern/Healthy
Dish: Scrambled Eggs with Bacon, Style: Classic American
Dish: Yogurt Parfait, Style: Greek/Breakfast Bowls


## Function Calling
* Open AI Tools
* Semantic Kernel

### Open AI Tools

Model is constrained to predict which function(s) and argument(s) should be called to achieve a task.

Assuming we have functions like this in our codebase:
```python
def get_weather(location: str, unit: str):
    return ...

def google_search(query: str):
    return ...
```

How can we get an LLM to intelligently choose which tools to call?

### Tool Schema

In [9]:
weather_tool = {
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA",
                },
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["location"],
        },
    },
}

google_tool = {
    "type": "function",
    "function": {
        "name": "google_search",
        "description": "Queries google for a search query",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query",
                }
            },
            "required": ["query"],
        },
    },
}

In [10]:
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from dotenv import load_dotenv

load_dotenv(override=True)

scope = "https://cognitiveservices.azure.com/.default"
token_provider = get_bearer_token_provider(DefaultAzureCredential(), scope)
aoai_client = AzureOpenAI(
    azure_ad_token_provider=token_provider, 
    api_version="2024-03-01-preview",
    azure_endpoint="https://oai-vena-copilot-npr-canadaeast-01.openai.azure.com"
)

response = aoai_client.chat.completions.create(
    model="gpt-35-turbo",
    messages=[
        {
            "role": "system", 
            "content": "You must always use tools"
        },
        {
            "role": "user",
            "content": "What is the weather in toronto and dallas and who won the super bowl?",
        }
    ],
    tools=[weather_tool, google_tool]
)

for tool in response.choices[0].message.tool_calls:
    print(tool.function)

Function(arguments='{"location": "Toronto, ON", "unit": "celsius"}', name='get_weather')
Function(arguments='{"location": "Dallas, TX", "unit": "celsius"}', name='get_weather')
Function(arguments='{"query": "super bowl winner"}', name='google_search')


### Example with Instructor

In [11]:
from openai import OpenAI
from pydantic import BaseModel, Field
from typing import Iterable, Literal
import instructor

class Weather(BaseModel):
    location: str = Field(..., description = "The city and state, e.g. San Francisco, CA")
    units: Literal["celsius", "fahrenheit"]

class GoogleSearch(BaseModel):
    query: str = Field(..., description = "The search query")

instructor_client = instructor.from_openai(aoai_client, mode=instructor.Mode.PARALLEL_TOOLS)

response = instructor_client.chat.completions.create(
    model="gpt-35-turbo",
    messages=[
        {
            "role": "system", 
            "content": "You must always use tools"
        },
        {
            "role": "user",
            "content": "What is the weather in toronto and dallas and who won the super bowl?",
        }
    ],
    response_model=Iterable[Weather | GoogleSearch]
)

for function in response:
    print(type(function), function)

<class '__main__.Weather'> location='Toronto' units='celsius'
<class '__main__.Weather'> location='Dallas' units='celsius'
<class '__main__.GoogleSearch'> query='Super Bowl winner'


## Agents with Semantic Kernel

Semantic Kernel is an [open-source library from MSFT](https://learn.microsoft.com/en-us/semantic-kernel/overview/) that lets you build AI agents and integrate the latest AI models with bindings in C#, Python, and Java.

An "agent" in Semantic Kernel consists of:
* `Persona`: Instruction / meta prompt setting the overall purpose of the agent
* `Kernel`: Central DI container that orchestrates LLM logic
  * `AI Service(s)`: LLMs the `Kernel` has access to
  * `Plugin(s)`: Functions/tools the `Kernel` can use to complete tasks
  * `Planner(s)`: Workflows the `Kernel` generates using a combination of `AI Service(s)` and `Plugin(s)` for task planning

**Initializing a Kernel**

In [12]:
from semantic_kernel import Kernel
from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion, AzureChatPromptExecutionSettings
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from dotenv import load_dotenv

load_dotenv(override=True)

scope = "https://cognitiveservices.azure.com/.default"
token_provider = get_bearer_token_provider(DefaultAzureCredential(), scope)

kernel = Kernel()

service_id = "default"
kernel.add_service(
    AzureChatCompletion(
        service_id=service_id,
        endpoint="https://oai-vena-copilot-npr-canadaeast-01.openai.azure.com",
        deployment_name="gpt-35-turbo",
        ad_token_provider=token_provider
    ),
)

**Initializing Native Plugins**

Defines types and functions the LLM has access to:

In [13]:
from typing import List, Optional, TypedDict, Annotated
from semantic_kernel.functions import kernel_function
import random

class Weather(TypedDict):
    location: str
    unit: str
    temperature: float

class WeatherPlugin:
   @kernel_function(
      name="get_weather",
      description="Get the current weather in a given location",
   )
   def get_weather(self, 
                   location: Annotated[str, "Location to retrieve weather for"], 
                   unit: Annotated[str, "celsius or fahrenheit"]) -> Annotated[Weather, "The weather for a given location"]:
      return Weather(location=location, unit=unit, temperature=random.randint(10, 100))

**Executing a Standalone Kernel**

* SK bridges the gap by taking the functions the LLM predicted and calling them automatically in our code.
* Under the hood, it uses parallel function calling as the default planner implementation

In [14]:
from semantic_kernel.connectors.ai.open_ai import AzureChatCompletion, AzureChatPromptExecutionSettings
from semantic_kernel.connectors.ai.function_choice_behavior import FunctionChoiceBehavior
from semantic_kernel.connectors.ai.chat_completion_client_base import ChatCompletionClientBase
from semantic_kernel.contents.chat_history import ChatHistory
from semantic_kernel.functions.kernel_arguments import KernelArguments

# Add the plugin to the kernel
kernel.add_plugin(WeatherPlugin(), plugin_name="Weather",)

chat_completion : AzureChatCompletion = kernel.get_service(type=ChatCompletionClientBase)

# Enable auto-function calling
execution_settings = AzureChatPromptExecutionSettings(tool_choice="auto")
execution_settings.function_choice_behavior = FunctionChoiceBehavior.Auto()

# Create a history of the conversation
history = ChatHistory()
history.add_message({"role": "user", "content": "What's the weather in Toronto and San Antonio in freedom units?"})

# Get the response from the AI
result = (await chat_completion.get_chat_message_contents(
  chat_history=history,
  settings=execution_settings,
  kernel=kernel,
  arguments=KernelArguments(),
))[0]

# Print the results
print("Assistant > " + str(result))

Assistant > The current weather in Toronto is 36°F and in San Antonio is 45°F.


# Retrieval Augmented Generation

## Vector Databases

In [15]:
print("hello world!")

hello world!


## Visualizing Semantic Similarity

In [16]:
print("hello world!")

hello world!


## Hybrid Search and Reranking

In [17]:
print("hello world!")

hello world!


# Promptflow

Prompt flow is a suite of development tools designed to streamline the end-to-end development cycle of LLM-based AI applications, from ideation, prototyping, testing, evaluation to production deployment and monitoring.

* Allows developers to create DAG based workflows with LLMs, prompts, and code
* Baked in development tools for experimentation and observability
* First class benchmarking/evaluation support


## Anatomy of a flow

* Flows are essentially functions with expected inputs/outputs
* Each `node` in a `flow` represents some intermediary step in a `flow` and can be run in parallel
* An invocation of a `flow` is called a `run` and PF automatically captures metadata about aspects of that `run`

![image](images/pf-dag-screenshot.png)

## Example Flow Overview

Sample that performs RAG on Qdrant documentation and writes a tweet about a given topic.

* Uses a dataset that is already chunked containing bits from Qdrant's documentation
* Leverages prompt templates

## Experimentation + Tracing

* Built in UIs greatly help with inspecting each phase of a test
* Tracking token usage
* Batch runs for testing a flow against a dataset

Testing a flow with single input:
```
pf flow test --flow promptflow/qdrant_tweet_flow --inputs topic=Microsoft
```

Batch running flow against a dataset:
```
pf run create --flow promptflow/qdrant_tweet_flow --data promptflow/qdrant_tweet_flow/data.jsonl
```

## Evaluations + Benchmarking

* You can't improve what you don't measure
* Evaluations are also flows and can run against output of previous runs

Batch running evaluation flow against existing run:
```
pf run create --flow promptflow/evaluation_flow \
--data promptflow/evaluation_flow/data.jsonl \
--column-mapping sentiment='${data.sentiment}' tweet='${run.outputs.tweet}' \
--run <RUN_NAME>
```

# Thank You!

Code + Slides can be found here: https://github.com/khchan/building-blocks-ai