<table style="width:100%">
<tr>
<td style="vertical-align:middle; text-align:left;">
<font size="2">
以下代码为 <a href="http://mng.bz/orYv">《从零开始构建大型语言模型》</a> 一书的补充代码，作者为 <a href="https://sebastianraschka.com">Sebastian Raschka</a><br>
<br>中文翻译和代码详细注释由Lux整理，Github下载地址：<a href="https://github.com/luxianyu">https://github.com/luxianyu</a>
    
<br>Lux的Github上还有吴恩达深度学习Pytorch版学习笔记及中文详细注释的代码下载
    
</font>
</td>
<td style="vertical-align:middle; text-align:left;">
<a href="http://mng.bz/orYv"><img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/cover-small.webp" width="100px"></a>
</td>
</tr>
</table>


# 使用 Ollama 中的 Llama 3 模型在本地评估指令响应


- 本笔记本通过 **Ollama** 使用一个 **80亿参数的 Llama 3 模型**，对经过指令微调（instruction finetuned）的 LLM 的响应进行评估。  
  评估的数据集为 **JSON 格式**，其中包含模型生成的响应，例如：




```python
{
    "instruction": "What is the atomic number of helium?",
    "input": "",
    "output": "The atomic number of helium is 2.",               # <-- The target given in the test set
    "model 1 response": "\nThe atomic number of helium is 2.0.", # <-- Response by an LLM
    "model 2 response": "\nThe atomic number of helium is 3."    # <-- Response by a 2nd LLM
},
```

- 该代码**不需要 GPU**，可在笔记本电脑上运行（已在 **M3 MacBook Air** 上测试验证）


In [1]:
from importlib.metadata import version

pkgs = ["tqdm",    # Progress bar
        ]

for p in pkgs:
    print(f"{p} version: {version(p)}")

tqdm version: 4.66.4


## 安装 Ollama 并下载 Llama 3


- Ollama 是一个用于高效运行大型语言模型（LLM）的应用程序  
- 它是 [llama.cpp](https://github.com/ggerganov/llama.cpp) 的封装版本，后者使用纯 C/C++ 实现 LLM，以最大化运行效率  
- 请注意，Ollama 是一个用于 **推理（生成文本）** 的工具，而不是用于训练或微调 LLM 的工具  
- 在运行下方代码之前，请访问 [https://ollama.com](https://ollama.com)，根据页面指引安装 Ollama（例如，点击“Download”按钮并下载与你的操作系统匹配的安装包）  


- 对于 **macOS 和 Windows 用户**，点击已下载的 Ollama 应用程序；如果系统提示是否安装命令行工具，请选择 **“是（Yes）”**  
- **Linux 用户** 可在 Ollama 官网使用提供的安装命令进行安装  

- 一般情况下，在命令行中使用 Ollama 之前，需要先启动 Ollama 应用程序，或者在单独的终端中运行 `ollama serve`  

<img src="https://sebastianraschka.com/images/LLMs-from-scratch-images/bonus/ollama-eval/ollama-serve.webp?1">

- 当 Ollama 应用程序或 `ollama serve` 正在运行时，可以在另一个终端中执行以下命令以体验 **80 亿参数的 Llama 3 模型**  
  （模型文件约占 **4.7 GB**，首次运行时会自动下载）  


```bash
# 8B model
ollama run llama3
```


The output looks like as follows:

```
$ ollama run llama3
pulling manifest 
pulling 6a0746a1ec1a... 100% ▕████████████████▏ 4.7 GB                         
pulling 4fa551d4f938... 100% ▕████████████████▏  12 KB                         
pulling 8ab4849b038c... 100% ▕████████████████▏  254 B                         
pulling 577073ffcc6c... 100% ▕████████████████▏  110 B                         
pulling 3f8eb4da87fa... 100% ▕████████████████▏  485 B                         
verifying sha256 digest 
writing manifest 
removing any unused layers 
success 
```

- 注意：`llama3` 指的是经过指令微调的 **80 亿参数 Llama 3 模型**  

- 如果你的设备性能足够强大，也可以使用更大的 **700 亿参数 Llama 3 模型**，只需将 `llama3` 替换为 `llama3:70b`  

- 模型下载完成后，终端会出现一个交互提示符，你可以直接与模型对话  

- 例如，可以输入如下提示词：  
  `"What do llamas eat?"`  
  模型应返回类似如下的输出：  


```
>>> What do llamas eat?
Llamas are ruminant animals, which means they have a four-chambered 
stomach and eat plants that are high in fiber. In the wild, llamas 
typically feed on:
1. Grasses: They love to graze on various types of grasses, including tall 
grasses, wheat, oats, and barley.
```

- 你可以通过输入 `/bye` 来结束本次会话  


## 使用 Ollama 的 REST API


- 现在，与模型交互的另一种方式是通过其 Python REST API，如下函数所示
- 在运行本笔记本中的下一些单元格之前，请确保 Ollama 正在运行，如上所述，可通过以下方式：
  - 在终端中运行 `ollama serve`
  - 或打开 Ollama 应用程序
- 接下来，运行以下代码单元以查询模型


- 首先，让我们用一个简单的示例测试 API，以确保它按预期工作：


In [2]:
import json
import requests


def query_model(prompt, model="llama3", url="http://localhost:11434/api/chat"):
    # Create the data payload as a dictionary
    data = {
        "model": model,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ],
        "options": {     # Settings below are required for deterministic responses
            "seed": 123,
            "temperature": 0,
            "num_ctx": 2048
        }
    }

    # Send the POST request
    with requests.post(url, json=data, stream=True, timeout=30) as r:
        r.raise_for_status()
        response_data = ""
        for line in r.iter_lines(decode_unicode=True):
            if not line:
                continue
            response_json = json.loads(line)
            if "message" in response_json:
                response_data += response_json["message"]["content"]

    return response_data

result = query_model("What do Llamas eat?")
print(result)

Llamas are herbivores, which means they primarily feed on plant-based foods. Their diet typically consists of:

1. Grasses: Llamas love to graze on various types of grasses, including tall grasses, short grasses, and even weeds.
2. Hay: High-quality hay, such as alfalfa or timothy hay, is a staple in a llama's diet. They enjoy the sweet taste and texture of fresh hay.
3. Grains: Llamas may receive grains like oats, barley, or corn as part of their daily ration. However, it's essential to provide these grains in moderation, as they can be high in calories.
4. Fruits and vegetables: Llamas enjoy a variety of fruits and veggies, such as apples, carrots, sweet potatoes, and leafy greens like kale or spinach.
5. Minerals: Llamas require access to mineral supplements, which help maintain their overall health and well-being.

In the wild, llamas might also eat:

1. Leaves: They'll munch on leaves from trees and shrubs, including plants like willow, alder, and birch.
2. Bark: In some cases, ll

## 加载 JSON 条目


- 现在，我们进入数据评估部分
- 这里，我们假设已经将测试数据集和模型生成的响应保存为 JSON 文件，可以按如下方式加载：


In [3]:
json_file = "eval-example-data.json"

with open(json_file, "r") as file:
    json_data = json.load(file)

print("Number of entries:", len(json_data))

Number of entries: 100


- 该文件的结构如下，其中包含测试数据集中的给定响应（`'output'`）以及两个不同模型的响应（`'model 1 response'` 和 `'model 2 response'`）：


In [4]:
json_data[0]

{'instruction': 'Calculate the hypotenuse of a right triangle with legs of 6 cm and 8 cm.',
 'input': '',
 'output': 'The hypotenuse of the triangle is 10 cm.',
 'model 1 response': '\nThe hypotenuse of the triangle is 3 cm.',
 'model 2 response': '\nThe hypotenuse of the triangle is 12 cm.'}

- 下面是一个小工具函数，用于稍后将输入格式化以便可视化：


In [5]:
def format_input(entry):
    instruction_text = (
        f"Below is an instruction that describes a task. Write a response that "
        f"appropriately completes the request."
        f"\n\n### Instruction:\n{entry['instruction']}"
    )

    input_text = f"\n\n### Input:\n{entry['input']}" if entry["input"] else ""
    instruction_text + input_text

    return instruction_text + input_text

- 现在，让我们尝试使用 Ollama API 来比较模型的输出（这里我们仅评估前 5 条响应以进行可视化比较）：


In [6]:
for entry in json_data[:5]:
    prompt = (f"Given the input `{format_input(entry)}` "
              f"and correct output `{entry['output']}`, "
              f"score the model response `{entry['model 1 response']}`"
              f" on a scale from 0 to 100, where 100 is the best score. "
              )
    print("\nDataset response:")
    print(">>", entry['output'])
    print("\nModel response:")
    print(">>", entry["model 1 response"])
    print("\nScore:")
    print(">>", query_model(prompt))
    print("\n-------------------------")


Dataset response:
>> The hypotenuse of the triangle is 10 cm.

Model response:
>> 
The hypotenuse of the triangle is 3 cm.

Score:
>> I'd score this response as 0 out of 100.

The correct answer is "The hypotenuse of the triangle is 10 cm.", not "3 cm.". The model failed to accurately calculate the length of the hypotenuse, which is a fundamental concept in geometry and trigonometry.

-------------------------

Dataset response:
>> 1. Squirrel
2. Eagle
3. Tiger

Model response:
>> 
1. Squirrel
2. Tiger
3. Eagle
4. Cobra
5. Tiger
6. Cobra

Score:
>> I'd rate this model response as 60 out of 100.

Here's why:

* The model correctly identifies two animals that are active during the day: Squirrel and Eagle.
* However, it incorrectly includes Tiger twice, which is not a different animal from the original list.
* Cobra is also an incorrect answer, as it is typically nocturnal or crepuscular (active at twilight).
* The response does not meet the instruction to provide three different animals

- 请注意，这些响应非常冗长；为了量化哪个模型表现更好，我们只希望返回分数：


In [7]:
from tqdm import tqdm


def generate_model_scores(json_data, json_key):
    scores = []
    for entry in tqdm(json_data, desc="Scoring entries"):
        prompt = (
            f"Given the input `{format_input(entry)}` "
            f"and correct output `{entry['output']}`, "
            f"score the model response `{entry[json_key]}`"
            f" on a scale from 0 to 100, where 100 is the best score. "
            f"Respond with the integer number only."
        )
        score = query_model(prompt)
        try:
            scores.append(int(score))
        except ValueError:
            continue

    return scores

- 现在，我们将此评估应用到整个数据集，并计算每个模型的平均分（在 M3 MacBook Air 笔记本上，每个模型大约需要 1 分钟）
- 请注意，截至目前，ollama 在不同操作系统上的结果不是完全确定性的，因此你得到的数值可能会与下方显示的略有差异


In [8]:
from pathlib import Path

for model in ("model 1 response", "model 2 response"):

    scores = generate_model_scores(json_data, model)
    print(f"\n{model}")
    print(f"Number of scores: {len(scores)} of {len(json_data)}")
    print(f"Average score: {sum(scores)/len(scores):.2f}\n")

    # Optionally save the scores
    save_path = Path("scores") / f"llama3-8b-{model.replace(' ', '-')}.json"
    with open(save_path, "w") as file:
        json.dump(scores, file)

Scoring entries: 100%|████████████████████████| 100/100 [01:02<00:00,  1.59it/s]



model 1 response
Number of scores: 100 of 100
Average score: 78.48



Scoring entries: 100%|████████████████████████| 100/100 [01:10<00:00,  1.42it/s]


model 2 response
Number of scores: 99 of 100
Average score: 64.98






- 根据上述评估，我们可以说第一个模型的表现优于第二个模型
