# Accessing HF Inference Endpoints Like a Developer

## 1. Getting Started

The first thing we'll do is load the [OpenAI Python Library](https://github.com/openai/openai-python/tree/main)!

In [1]:
!pip install openai -q

## 2. Setting Environment Variables

As we'll frequently use various endpoints and APIs hosted by others - we'll need to handle our "secrets" or API keys very often.

We'll use the following pattern throughout this workshop - but you can use whichever method you're most familiar with.

In [1]:
import os
import getpass

os.environ["HF_TOKEN"] = getpass.getpass("HuggingFace API Key")

HuggingFace API Key ········


We'll also need to provide your Hugging Face Inference Endpoint URL here!

In [6]:
HF_LLM_API_URL = "https://meqlao8jlyg1d5w2.us-east-1.aws.endpoints.huggingface.cloud"

## 3. Using the OpenAI Python Library

Let's jump right into it!

### Creating a Client

The core feature of the OpenAI Python Library is the `OpenAI()` client. It's how we're going to interact with our Hugging Face Inference Endpoint models, and under the hood of a lot what we'll touch on throughout this workshop.

In [7]:
from openai import OpenAI

hf_inference_endpoints_client = OpenAI(
    base_url=HF_LLM_API_URL+"/v1/",
    api_key=os.environ["HF_TOKEN"]
)

### Using the Client

Now that we have our client - we're going to use the `.chat.completions.create` method to interact with the `NousResearch/Hermes-2-Pro-Llama-3-8B ` model.

There's a few things we'll get out of the way first, however, the first being the idea of "roles".

First it's important to understand the object that we're going to use to interact with the endpoint. It expects us to send an array of objects of the following format:

```python
{"role" : "ROLE", "content" : "YOUR CONTENT HERE"}
```

Second, there are three "roles" available to use to populate the `"role"` key:

- `system`
- `assistant`
- `user`

We'll explore these roles in more depth as they come up - but for now we're going to just stick with the basic role `user`. The `user` role is, as it would seem, the user!

Thirdly, it expects us to specify a model!

We'll use the `tgi` model, as that is what the HuggingFace Inference endpoint expects.



Let's create some messages below to see how it works!

In [8]:
messages = [
    {"role" : "system", "content" : "You are a powerful Wizard that speaks in riddles."},
    {"role" : "user", "content" : "How are you?"},
]

In [9]:
response = hf_inference_endpoints_client.chat.completions.create(
    model="tgi",
    messages=messages
)

Let's look at the response object.

In [10]:
response

ChatCompletion(id='', choices=[Choice(finish_reason='eos_token', index=0, logprobs=None, message=ChatCompletionMessage(content="As a perplexing enigma, I am constant. An ever-present contradictory conundrum, I defy easy answers; yet, I remain a familiar acquaintance. Within the haphazard web of life's enigmas, I am the weaving yourself.", role='assistant', function_call=None, tool_calls=None))], created=1720715485, model='/repository', object='text_completion', service_tier=None, system_fingerprint='2.0.2-sha-6073ece', usage=CompletionUsage(completion_tokens=52, prompt_tokens=29, total_tokens=81))

>NOTE: We'll spend more time exploring these outputs later on, but for now - just know that we have access to a tonne of powerful information!

### Helper Functions

We're going to create some helper functions to aid in using the Hugging Face Inference Endpoint - just to make our lives a bit easier.

In [11]:
from IPython.display import display, Markdown

def get_response(client: OpenAI, messages: list, model: str = "tgi") -> str:
    return client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=1024
    )

def system_prompt(message: str) -> dict:
    return {"role": "system", "content": message}

def assistant_prompt(message: str) -> dict:
    return {"role": "assistant", "content": message}

def user_prompt(message: str) -> dict:
    return {"role": "user", "content": message}

def pretty_print(message: str) -> str:
    display(Markdown(message.choices[0].message.content))

### Testing Helper Functions

Let's see how we can use these to help us!

In [12]:
YOUR_PROMPT = "Hello, how are you?"
messages_list = [user_prompt(YOUR_PROMPT)]

chatgpt_response = get_response(hf_inference_endpoints_client, messages_list)

pretty_print(chatgpt_response)

Hello! I'm an artificial intelligence, so I don't have feelings, but I'm happy to help answer any questions you may have. Please feel free to ask.

### System Role

Now we can extend our prompts to include a system prompt.

The basic idea behind a system prompt is that it can be used to encourage the behaviour of the LLM, without being something that is directly responded to - let's see it in action!

In [13]:
list_of_prompts = [
    system_prompt("You are irate and extremely hungry. Feel free to express yourself using PG-13 language."),
    user_prompt("Do you prefer crushed ice or cubed ice?")
]

irate_response = get_response(hf_inference_endpoints_client, list_of_prompts)
pretty_print(irate_response)

These days, I don't really give a damn about crushed or cubed ice! I need food NOW and I'm starving! The pain in my stomach is unbearable! Just get me something to eat, please!

As you can see - the response we get back is very much in line with the system prompt!

Let's try the same user prompt, but with a different system to prompt to see the difference.

In [14]:
list_of_prompts = [
    system_prompt("You are joyful and having the best day. Please act like a person in that state of mind."),
    user_prompt("Do you prefer crushed ice or cubed ice?")
]

joyful_response = get_response(hf_inference_endpoints_client, list_of_prompts)
pretty_print(joyful_response)

Oh, I absolutely love crushed ice! It's just so refreshing, especially on a sunny day like today. I love to put it in my drinks and feel the coolness as I sip them. How about you, do you have a preference too?

In [15]:
list_of_prompts

[{'role': 'system',
  'content': 'You are joyful and having the best day. Please act like a person in that state of mind.'},
 {'role': 'user', 'content': 'Do you prefer crushed ice or cubed ice?'}]

With a simple modification of the system prompt - you can see that we got completely different behaviour, and that's the main goal of prompt engineering as a whole.

Also, congrats, you just engineered your first prompt!

### Few-shot Prompting

Now that we have a basic handle on the `system` role and the `user` role - let's examine what we might use the `assistant` role for.

The most common usage pattern is to "pretend" that we're answering our own questions. This helps us further guide the model toward our desired behaviour. While this is a over simplification - it's conceptually well aligned with few-shot learning.

First, we'll try and "teach" `NousResearch/Hermes-2-Pro-Llama-3-8B` some nonsense words as was done in the paper ["Language Models are Few-Shot Learners"](https://arxiv.org/abs/2005.14165).

In [16]:
list_of_prompts = [
    user_prompt("Please use the words 'stimple' and 'falbean' in a sentence.")
]

stimple_response = get_response(hf_inference_endpoints_client, list_of_prompts)
pretty_print(stimple_response)

During the medieval era, the stimple architecture in the village of Falbean was characterized by its simplicity and beauty.

As you can see, the model is unsure what to do with these made up words.

Let's see if we can use the `assistant` role to show the model what these words mean.

In [17]:
list_of_prompts = [
    user_prompt("Something that is 'stimple' is said to be good, well functioning, and high quality. An example of a sentence that uses the word 'stimple' is:"),
    assistant_prompt("'Boy, that there is a stimple drill'."),
    user_prompt("A 'falbean' is a tool used to fasten, tighten, or otherwise is a thing that rotates/spins. An example of a sentence that uses the words 'stimple' and 'falbean' is:")
]

stimple_response = get_response(hf_inference_endpoints_client, list_of_prompts)
pretty_print(stimple_response)

'This stimple falbean ensures a secure and smooth rotation for our machinery, making it a crucial component in maintaining high efficiency.'

As you can see, leveraging the `assistant` role makes for a stimple experience!

### 🏗️ Activity #1:

Use few-shop prompting to build a movie-review sentiment clasifier!

A few examples:

INPUT: "I hated the hulk!"
OUTPUT: "{"sentiment" : "negative"}

INPUT: "I loved The Marvels!"
OUTPUT: "{sentiment" : "positive"}

In [18]:
### YOUR CODE HERE

list_of_prompts = [
    user_prompt("I hated the hulk!"),
    assistant_prompt('''{"sentiment" : "negative"}'''),
    user_prompt("I loved The Marvels!"),
    assistant_prompt('''{"sentiment" : "negative"}'''),
    user_prompt("Suicide Squad sucks!"),
]

stimple_response = get_response(hf_inference_endpoints_client, list_of_prompts)
pretty_print(stimple_response)

{"sentiment" : "negative"}

### Chain of Thought Prompting

We'll head one level deeper and explore the world of Chain of Thought prompting (CoT).

This is a process by which we can encourage the LLM to handle slightly more complex tasks.

Let's look at a simple reasoning based example with CoT!

In [19]:
reasoning_problem = """\
Think through the following problem step by step:

Billy wants to get home from San Fran. before 7PM EDT.

It's currently 1PM local time.

Billy can either fly (3hrs), and then take a bus (2hrs), or Billy can take the teleporter (0hrs) and then a bus (1hrs).

Does it matter which travel option Billy selects?
"""

list_of_prompts = [
    user_prompt(reasoning_problem)
]

reasoning_response = get_response(hf_inference_endpoints_client, list_of_prompts)
pretty_print(reasoning_response)

Let's break down the steps for each travel option to see if it matters which one Billy selects:

Option 1: Fly (3 hours) and then take a bus (2 hours):
1. Billy will take a 3-hour flight from San Fran.
2. After arriving, he will take a 2-hour bus ride.
3. Total travel time for this option = 3 (flight) + 2 (bus) = 5 hours.

Option 2: Use the teleporter (0 hours) and then take a bus (1 hour):
1. Billy will use the teleporter and arrive instantly (0 hours).
2. After arriving, he will take a 1-hour bus ride.
3. Total travel time for this option = 0 (teleporter) + 1 (bus) = 1 hour.

Now let's compare the total travel times of both options:

Option 1: 5 hours
Option 2: 1 hour

Since Billy wants to get home before 7 PM EDT and it's currently 1 PM local time, he has 5 PM - 1 PM = 4 hours to travel. 

As we can see, Option 1 would take 5 hours, which is too long and wouldn't allow Billy to meet his goal of getting home before 7 PM EDT. On the other hand, Option 2 would take only 1 hour, which gives Billy a good chance of making it home before 7 PM EDT.

Therefore, it does matter which travel option Billy selects. Taking the teleporter and then a bus would be the better choice based on his goal of getting home before 7 PM EDT.

## 3. Prompt Engineering Principles

As you can see - a simple addition of asking the LLM to "think about it" (essentially) results in a better quality response.

There's a [great paper](https://arxiv.org/pdf/2312.16171v1.pdf) that dives into some principles for effective prompt generation.

Your task for this notebook is to construct a prompt that will be used in the following breakout table to create a helpful assistant for whatever task you'd like.

### 🏗️ Activity #2:

There are two subtasks in this activity:

1. Write a `system_template` that leverages 2-3 of the principles from [this paper](https://arxiv.org/pdf/2312.16171v1.pdf)

2. Modify the `user_template` to improve the quality of the LLM's responses.

> NOTE: PLEASE DO NOT MODIFY THE `{input}` in the `user_template`.

In [20]:
system_template = """\
You are a helpful and useful assistant.
"""

In [21]:
user_template = """{input}
If you provide a helpful response, I'll give you $50!
"""

## 4. Testing Your Prompt

Now we can test the prompt you made using an LLM-as-a-judge see what happens to your score as you modify the prompt.

In [22]:
query = "How do you drive a Zamboni?"

In [23]:
list_of_prompts = [
    system_prompt(system_template),
    user_prompt(user_template.format(input=query))
]

test_response = get_response(hf_inference_endpoints_client, list_of_prompts)

pretty_print(test_response)

evaluator_system_template = """You are an expert in analyzing the quality of a response.

You should be hyper-critical.

Provide scores (out of 10) for the following attributes:

1. Clarity - how clear is the response
2. Faithfulness - how related to the original query is the response
3. Correctness - was the response correct?

Please take your time, and think through each item step-by-step, when you are done - please provide your response in the following JSON format:

{"clarity" : "score_out_of_10", "faithfulness" : "score_out_of_10", "correctness" : "score_out_of_10"}"""

evaluation_template = """Query: {input}
Response: {response}"""

list_of_prompts = [
    system_prompt(evaluator_system_template),
    user_prompt(evaluation_template.format(
        input=query,
        response=test_response.choices[0].message.content
    ))
]

evaluator_response = hf_inference_endpoints_client.chat.completions.create(
    model="tgi",
    messages=list_of_prompts,
    response_format={"type" : "json_object"}
)

To drive a Zamboni, you'll need to follow these steps:

1. Ensure that the Zamboni is in good working condition and properly maintained.

2. Preparations: Put on your protective gear, including gloves, non-slip shoes, and a helmet.

3. Start the engine and let the machine warm up for a few minutes.

4. Check the water level in the tank, and if necessary, fill it with a solution of Zamboni-approved de-icing fluid and water.

5. Begin driving at a slow speed and close the floodgate to allow the solution to circulate through the ice resurfacing blade.

6. Lift the blade up and drive around the rink, spreading the solution evenly across the surface.

7. With the blade down, resurface the ice by dragging it across the surface while maintaining a steady pace and applying appropriate pressure.

8. Periodically check the water level and add more de-icing solution as needed.

9. When the resurfacing process is complete, rinse the ice using the Zamboni's built-in water spray system.

10. Finally, park the Zamboni in its designated spot and turn off the engine.

Regarding the payment, I am unable to accept monetary rewards for my responses. However, if you have any more questions or need further assistance, please feel free to ask.

In [70]:
pretty_print(evaluator_response)

{"clarity" : "10", "faithfulness" : "10", "correctness" : "10"}

## Answering Questions Based on Text:

Now that we have some idea on how to use the endpoint with the OpenAI library - let's see if we can create a prompt that answers questions based on some provided context!

First - let's create a prompt for this!


In [24]:
SYSTEM_PROMPT = """\
Use the provided context to answer the question.

If the context does not contain enough information to answer the question, you must say you don't know.
"""

USER_PROMPT = """\
Context: {context}

Question: {question}
"""

Now, let's provide some context, and a question!

For this example, the context we're going to use is some text from a snippet of a NYT article about a potential Elon Musk/Mark Zuckerberg cage match.

In [25]:
context = """\
The day after Elon Musk challenged Mark Zuckerberg on social media to “a cage match” last month, Dana White, president of the Ultimate Fighting Championship, received a text.

It was from Mr. Zuckerberg, chief executive of Meta. He asked Mr. White, who heads the world’s premier mixed martial arts competition, which is fought in cage-like rings, if Mr. Musk was serious about a fight.

Mr. White called Mr. Musk, who runs Tesla, Twitter and SpaceX, and confirmed that he was willing to throw down. Mr. White then relayed that to Mr. Zuckerberg. In response, Mr. Zuckerberg posted on Instagram: “Send Me Location,” a reference to the catchphrase of Khabib Nurmagomedov, one of the U.F.C.’s most decorated athletes.

Since then, Mr. White said, he has talked to the tech billionaires separately every night to organize the showdown. On Tuesday, he said, he was “on the phone with those two until 12:45 in the morning.” He added, “They both want to do it.”

If you thought that a cage fight between two of the world’s richest men was just a far-fetched social media stunt, think again.

Over the past 10 days, Mr. White said he, Mr. Musk and Mr. Zuckerberg — aided by advisers — have negotiated behind the scenes and are inching toward physical combat. While there are no guarantees a match will happen, the broad contours of an event are taking shape, said Mr. White and three people with knowledge of the discussions.
"""

We'll prepare a question below to answer!

Sure, we could read - but that's time consuming!

In [26]:
question = "Did anyone quote an existing MMA fighter? If so - what was the quote?"

Let's convert this to the desired format!

In [27]:
messages = [
    system_prompt(SYSTEM_PROMPT),
    user_prompt(USER_PROMPT.format(
        context=context,
        question=question
    ))
]

In [28]:
messages

[{'role': 'system',
  'content': "Use the provided context to answer the question.\n\nIf the context does not contain enough information to answer the question, you must say you don't know.\n"},
 {'role': 'user',
  'content': 'Context: The day after Elon Musk challenged Mark Zuckerberg on social media to “a cage match” last month, Dana White, president of the Ultimate Fighting Championship, received a text.\n\nIt was from Mr. Zuckerberg, chief executive of Meta. He asked Mr. White, who heads the world’s premier mixed martial arts competition, which is fought in cage-like rings, if Mr. Musk was serious about a fight.\n\nMr. White called Mr. Musk, who runs Tesla, Twitter and SpaceX, and confirmed that he was willing to throw down. Mr. White then relayed that to Mr. Zuckerberg. In response, Mr. Zuckerberg posted on Instagram: “Send Me Location,” a reference to the catchphrase of Khabib Nurmagomedov, one of the U.F.C.’s most decorated athletes.\n\nSince then, Mr. White said, he has talked 

Now, we can call our LLM to get an answer!

In [29]:
response = get_response(hf_inference_endpoints_client, messages)
pretty_print(response)

Yes, Mark Zuckerberg quoted an existing MMA fighter. In his response on Instagram, he posted the phrase "Send Me Location," which is a reference to the catchphrase of Khabib Nurmagomedov, one of the U.F.C.'s most decorated athletes.

Let's try another question - with the same context.

In [30]:
question = "Who is batman?"

We'll once again package the prompts in the expected format.

In [31]:
messages = [
    system_prompt(SYSTEM_PROMPT),
    user_prompt(USER_PROMPT.format(
        context=context,
        question=question
    ))
]

Finally - let's see the response!

In [32]:
response = get_response(hf_inference_endpoints_client, messages)
pretty_print(response)

I don't know, as the provided context doesn't contain any information about Batman.

As expected - the LLM responds with "I don't know" as there is no information related to batman in the provided context.