# Open Source vc OpenAI

Wondering how much better Llama 2 is compared to Llama?

In this notebook, we'll use auto-evaluation by GPT-4 to measure outputs from both Llama and Llama 2 on a few prompts. To make this example easy to run, we'll be using 7B GGML variants of the Llama models. This should be able to run on a typical laptop.

## Installations

You can setup prompttools either by installing via `pip` or using `python setup.py develop` in the root of this repo. Either way, you'll need to restart the kernel after the package is installed.

In [1]:
# !pip install --quiet --force-reinstall prompttools

## Setup imports and API keys

Next, we'll need to set our API keys. Since we want to use GPT-4 for auto-eval, we need to set that one. We won't be using the Hegel AI API key for this example.

In [2]:
import os
os.environ['HEGELAI_API_KEY'] = ""
os.environ['OPENAI_API_KEY'] = ""

Then we'll import the relevant `prompttools` modules to setup our experiment.

In [3]:
from typing import Dict, List, Tuple
from prompttools.experiment import LlamaCppExperiment
from prompttools.harness.multi_experiment_harness import MultiExperimentHarness

## Run an experiment

Next, we create our test inputs. We can iterate over models, inputs, and configurations like temperature.

In [4]:
model_paths = ['/Users/stevenkrawczyk/Downloads/llama-7b.ggmlv3.q2_K.bin',    # Download from https://huggingface.co/TheBloke/LLaMa-7B-GGML/tree/main
               '/Users/stevenkrawczyk/Downloads/llama-2-7b.ggmlv3.q2_K.bin']  # Download from https://huggingface.co/TheBloke/Llama-2-7B-GGML/tree/main
prompts = [
    """
    OBJECTIVE:
    You are a sales development representative for a startup called Hegel AI.
    Your startup builds developer tools for large language models.
    Draft a short sales email, 50 words or less, asking a prospect for 15 minutes
    of their time to chat about how they're using large language models.
    
    RESPONSE:
    """,
    """
    OBJECTIVE:
    You are a customer support representative for a startup called Hegel AI.
    Answer the following customer question:
    Do you offer refunds?
    
    RESPONSE:
    """
]
temperatures = [0.0, 1.0]

call_params = dict(temperature=temperatures)

experiment = LlamaCppExperiment(model_paths, prompts, call_params=call_params)
harness = MultiExperimentHarness([experiment])

In [5]:
harness.prepare()
harness.run()

llama.cpp: loading model from /Users/stevenkrawczyk/Downloads/llama-7b.ggmlv3.q2_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 4464.12 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

llama_print_timings:        loa

In [6]:
from prompttools.utils import autoeval


def extract_responses(output) -> str:
    return [choice["text"] for choice in output["choices"]]


def use_gpt4(
    prompt: str, results: Dict, metadata: Dict
) -> float:
    """
    A simple test that checks semantic similarity between the user input
    and the model's text responses.
    """
    return 0.0
#     distances = [
#         autoeval.compute(prompt, response)
#         for response in extract_responses(results)
#     ]
#     return min(distances)


Finally, we can evaluate and visualize the results.

In [7]:
harness.evaluate("auto-evaluation", use_gpt4)


In [10]:
harness.visualize("response(s)")

[{'model_path': '/Users/stevenkrawczyk/Downloads/llama-7b.ggmlv3.q2_K.bin', 'lora_path': None, 'lora_base': None, 'n_ctx': 512, 'n_parts': -1, 'seed': 1337, 'f16_kv': True, 'logits_all': False, 'vocab_only': False, 'use_mlock': False, 'n_threads': None, 'n_batch': 512, 'use_mmap': True, 'last_n_tokens_size': 64, 'verbose': True, 'temperature': 0.0, 'prompt': "\n    OBJECTIVE:\n    You are a sales development representative for a startup called Hegel AI.\n    Your startup builds developer tools for large language models.\n    Draft a short sales email, 50 words or less, asking a prospect for 15 minutes\n    of their time to chat about how they're using large language models.\n    \n    RESPONSE:\n    ", 'suffix': None, 'max_tokens': 128, 'top_p': 0.95, 'logprobs': None, 'echo': False, 'stop': None, 'repeat_penalty': 1.1, 'top_k': 40, 'client': <llama_cpp.llama.Llama object at 0x1269c4650>}, {'model_path': '/Users/stevenkrawczyk/Downloads/llama-7b.ggmlv3.q2_K.bin', 'lora_path': None, 'lo

KeyError: 'models'