# Prompt-Engineering Lab: Customer-Support Replies with **Mistral-7B-Instruct**

In this lab you will experiment with six classic **prompt patterns**:

1. **Zero-shot**: no examples, just the task.  
2. **Few-shot**: show 1-2 examples first.  
3. **Chain-of-Thought**: ask the model to reason step by step.  
4. **Role / Persona**: tell the model "You are a ...".  
5. **Structured output**: force JSON or bullet format.  
6. **System / Policy instructions**: prepend internal guidelines.  

By the end you’ll understand how tiny changes in wording can dramatically change an LLM’s answer.

## What is Prompt-Engineering?
Because large language models don’t have traditional "APIs", you steer them with natural-language prompts. Refining spacing, order, examples, and style (zero-shot, few-shot, chain-of-thought, role instructions, JSON schemas, etc.is called prompt engineering. Good prompts boost accuracy and consistency without touching the model’s weights.

## 0  Install Hugging Face client library

We only need **`huggingface_hub`** because we call the hosted Inference API. No heavyweight Transformers install is required.

### What is Hugging Face and the Hugging Face Hub ?

Hugging Face is an open-source-first AI company that curates Transformers, Diffusers, and other widely used libraries.
The Hub (https://huggingface.co/) is a **GitHub for models & datasets**:
- Free hosting for models, datasets, Spaces (demo apps).
- Built-in versioning, README rendering, and model cards.
- REST / Python SDK (huggingface_hub) for push-pull, inference endpoints, and gated access control.

In practice you clone/push a repo exactly like Git, but via hf_hub_download() or model.push_to_hub().

In [None]:
!pip install -q huggingface_hub

## 1  Authenticate and create an `InferenceClient`

You need an **access token** to use the Mistral-7B endpoint hosted by provider **"novita"** on Hugging Face.

### What are providers on HF Inference Client and what is novita?

The InferenceClient can route a request to different providers (back-ends):
- huggingface (default): the standard Inference API hosted by HF.
- novita (a third-party back-end): offers chat-optimized latency/pricing. API surface matches OpenAI-style /chat/completions.

You switch simply by

```python
client = InferenceClient(provider="novita", api_key=HF_TOKEN)
```
so the same client code can talk to multiple vendors.

In [None]:
import os, getpass
from huggingface_hub import InferenceClient

# Paste your personal HF token when prompted
os.environ["HF_TOKEN"] = getpass.getpass("Enter your HF token: ")

# Create the client (model is chosen later in each call)
client = InferenceClient(provider="novita", api_key=os.environ["HF_TOKEN"])

## Model spotlight - mistralai/Mistral-7B-Instruct-v0.3
| -                  | Pros                                                                             | Cons                                                                        |
| ------------------ | -------------------------------------------------------------------------------- | --------------------------------------------------------------------------- |
| **Quality**        | Very strong chat and code for a 7 B model. Outperforms LLaMA-7B/13B in many evals. | Still below GPT-4 / Claude-Opus tier on reasoning and long-context.           |
| **Licence**        | Apache-2.0, commercial use allowed.                            | Weight access sometimes gated. You need HF/token or local download.         |
| **Speed / Memory** | Fits into 14 GB GPU (8-bit) so it is cheap to host. Responds \~50-70 tok/s on T4.       | Context window 8 K. Cannot process very long docs like GPT-4o-128K.         |
| **LoRA**           | Adapter \~120 MB. Can be fine-tuned on a Colab T4 in <30 min.                            | PEFT on Mistral needs to target correct linear layers (`q_proj`, `v_proj`). |


## 2  Zero-shot prompt

Ask the model to perform a task without examples: “Translate the sentence to Italian: 'I love coffee'”.

Zero-shot works because large LMs have seen countless instructions during pre-training. It's the fastest way to test a capability.

In [None]:
prompt = "A customer is requesting a refund because a product was delivered late. Write a professional reply."

completion = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[
        {"role": "user", "content": prompt}
    ],
)

print(completion.choices[0].message["content"])

## 3  Few-shot prompt

Provide 1-3 examples so the model infers pattern, tone, structure.

Example:  
Q1: Formal -> Casual  
A1: Please contact me. -> Hit me up!  

The next transformation aligns with the examples far better than zero-shot.

In [None]:
prompt = '''Example 1:
Customer: I received my order late. I'd like a refund.
Reply: We're sorry for the delay. Your refund will be processed within 3 business days.

Example 2:
Customer: My package arrived after the estimated delivery date. I want a refund.
Reply: We apologize for the inconvenience. The refund has been approved and will be issued shortly.

Now write a reply to:
A customer is requesting a refund because a product was delivered late.
'''

completion = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[{"role": "user", "content": prompt}],
)

print(completion.choices[0].message["content"])

## 4  Chain-of-Thought prompt

Add “Think step by step” (or an explicit reasoning prompt).  
The model reveals intermediate reasoning, often raising factual accuracy and allowing you to inspect/error-correct the chain.

In [None]:
prompt = "A customer is requesting a refund because a product was delivered late. Think step by step and write a professional reply."

completion = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[{"role": "user", "content": prompt}],
)

print(completion.choices[0].message["content"])

## 5  Role / Persona prompt

Pre-frame the assistant persona:
"You are an empathetic customer-support agent…".  
Sets vocabulary, tone, even policy constraints without touching the model weights.

In [None]:
prompt = "You are a customer support agent. A customer is asking for a refund due to a late delivery. Write a helpful and empathetic response."

completion = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[{"role": "user", "content": prompt}],
)

print(completion.choices[0].message["content"])

## 6 Structured Output

Wrap user context in clear markers (""" or ###), e.g.

```txt
"""CONTEXT
long text …
"""
Summarize the context in one bullet.
```

Prevents the model from mixing instructions with payload and boosts reliability in multi-part prompts.

In [None]:
prompt = '''Write a professional reply with the following structure:
1. Greeting
2. Apology for the delay
3. Refund details
4. Closing remark

Customer: I received the product late and would like a refund.'''

completion = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[{"role": "user", "content": prompt}],
)

print(completion.choices[0].message["content"])

## 7  System / Policy instructions (simulating LoRA-style policies)

Simulation: bake policy or domain hints inside the prompt ([POLICY] ...). Notice how the reply changes when the policy changes.

Real LoRA: train tiny adapter matrices (<<1 % params) and insert at Q/V/(K,O) projections of self-attention matrices. Needs only minutes, single GPU and the adapter file is <100 MB.

In [None]:
# Standard prompt
prompt = "A customer is requesting a refund because a product was delivered late. Write a professional reply."

completion = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[{"role": "user", "content": prompt}],
)

print("Standard Prompt:\n", completion.choices[0].message["content"])

# Policy A: full refund, no return required
prompt_custom_1 = '''[INTERNAL POLICY: Respond with empathy, offer refund without requiring customer to return the item, and apologize sincerely. Use concise and positive tone.]

A customer is requesting a refund because a product was delivered late. Write a professional reply.'''

completion_custom_1 = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[{"role": "user", "content": prompt_custom_1}],
)
print("Policy 1:\n", completion_custom_1.choices[0].message["content"])

# Policy B: no refund, offer discount code
prompt_custom_2 = '''[INTERNAL POLICY: Do not offer refunds, instead offer 15% discount code.]

A customer is requesting a refund because a product was delivered late. Write a professional reply.'''

completion_custom_2 = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[{"role": "user", "content": prompt_custom_2}],
)
print("Policy 2:\n", completion_custom_2.choices[0].message["content"])

## 8 Advanced: JSON-formatted reply

Explicitly combining role, task, format and give a skeleton:

```
Return ONLY valid JSON:
{"greeting":"", "apology":"", "resolution":"", "action_type":"", "needs_follow_up":false}
```

Model tends to obey if the request is precise and examples are consistent.

In [None]:
prompt = '''
You are a customer support assistant. A customer is requesting a refund because a product arrived late.

Return the response in JSON format with the following fields:
- greeting
- apology
- refund_policy
- closing

Example format:
{
  "greeting": "...",
  "apology": "...",
  "refund_policy": "...",
  "closing": "..."
}
'''

completion = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[{"role": "user", "content": prompt}],
)
print(completion.choices[0].message["content"])

## 9  Take-aways

- **Prompt engineering for rapid iteration**: you can prototype different behaviours in minutes, no fine-tuning needed.  
- **System vs. User prompts**: internal policies (system) can override the tone and policy of the final answer.  
- **Structured output**: JSON or numbered lists make post-processing trivial.  

### Next experiments

1. Add temperature/top-p arguments to explore creativity vs. safety of the model.  
2. Test the same prompts on a different model (e.g., **Llama-3-8B-Instruct**) to compare style and latency.  
3. Combine with *LoRA-style adapters* (see Notebook 2) for policy adherence without needing bulky prompts.  