# Open Source vc OpenAI

Wondering how much better Llama 2 is compared to Llama?

In this notebook, we'll use auto-evaluation by GPT-4 to measure outputs from both Llama and Llama 2 on a few prompts. To make this example easy to run, we'll be using 7B GGML variants of the Llama models. This should be able to run on a typical laptop.

## Installations

You can setup prompttools either by installing via `pip` or using `python setup.py develop` in the root of this repo. Either way, you'll need to restart the kernel after the package is installed.

In [1]:
# !pip install --quiet --force-reinstall prompttools

## Setup imports and API keys

Next, we'll need to set our API keys. Since we want to use GPT-4 for auto-eval, we need to set that one. We won't be using the Hegel AI API key for this example.

In [2]:
import os
os.environ['DEBUG'] = "1"
os.environ['HEGELAI_API_KEY'] = ""
os.environ['OPENAI_API_KEY'] = ""

Then we'll import the relevant `prompttools` modules to setup our experiment.

In [3]:
from typing import Dict, List, Tuple
from prompttools.experiment import LlamaCppExperiment
from prompttools.experiment import OpenAIChatExperiment
from prompttools.harness.multi_experiment_harness import MultiExperimentHarness

## Run an experiment

Next, we create our test inputs. We can iterate over models, inputs, and configurations like temperature.

In [4]:
model_paths = ['/Users/stevenkrawczyk/Downloads/llama-2-7b-chat.ggmlv3.q2_K.bin']  # Download from https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML/tree/main
prompts = [
    """
    OBJECTIVE:
    You are a sales development representative for a startup called Hegel AI.
    Your startup builds developer tools for large language models.
    Draft a short sales email, 50 words or less, asking a prospect for 15 minutes
    of their time to chat about how they're using large language models.
    
    RESPONSE:
    """,
    """
    OBJECTIVE:
    You are a customer support representative for a startup called Hegel AI.
    Answer the following customer question:
    Do you offer refunds?
    
    RESPONSE:
    """
]
temperatures = [1.0]
call_params = dict(temperature=temperatures)
llama_experiment = LlamaCppExperiment(model_paths, prompts, call_params=call_params)

In [5]:
models = ['gpt-4-0314', 'gpt-4-0613', 'gpt-4']
messages = [[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who was the first president?"},
]]
temperatures = [0.0]

openai_experiment = OpenAIChatExperiment(models, messages, temperature=temperatures)

In [6]:
harness = MultiExperimentHarness([openai_experiment, llama_experiment])

In [7]:
harness.prepare()
harness.run()

llama.cpp: loading model from /Users/stevenkrawczyk/Downloads/llama-2-7b-chat.ggmlv3.q2_K.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 10 (mostly Q2_K)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 4525.65 MB (+ 1026.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 

llama_print_timings:    

In [8]:
from prompttools.utils import autoeval


def extract_responses(output) -> str:
    return [choice["text"] for choice in output["choices"]]


def use_gpt4(
    prompt: str, results: Dict, metadata: Dict
) -> float:
    """
    A simple test that checks semantic similarity between the user input
    and the model's text responses.
    """
    return 0.0
#     distances = [
#         autoeval.compute(prompt, response)
#         for response in extract_responses(results)
#     ]
#     return min(distances)


Finally, we can evaluate and visualize the results.

In [9]:
harness.evaluate("auto-evaluation", use_gpt4)


In [10]:
harness.visualize("response(s)")

defaultdict(<class 'list'>, {'latency': [5.138979759067297e-06, 3.362016286700964e-06, 2.558052074164152e-06, 29.22151685395511, 18.70563395199133], 'auto-evaluation': [0.0, 0.0, 0.0, 0.0, 0.0]})


model,gpt-4,gpt-4-0314,gpt-4-0613,llama-2-7b-chat.ggmlv3.q2_K.bin
prompt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
\n OBJECTIVE:\n You are a customer support representative for a startup called Hegel AI.\n Answer the following customer question:\n Do you offer refunds?\n \n RESPONSE:\n,,,,"[ Yes, at Hegel AI we understand that sometimes our customers may need to return or cancel their orders. We do indeed offer refunds on certain products. The policy for refunds is available on our website under the ""Returns and Refunds"" section. ]"
"\n OBJECTIVE:\n You are a sales development representative for a startup called Hegel AI.\n Your startup builds developer tools for large language models.\n Draft a short sales email, 50 words or less, asking a prospect for 15 minutes\n of their time to chat about how they're using large language models.\n \n RESPONSE:\n",,,,"[\n Subject: Quick Chat on Large Language Models?\n \n Hi [Prospect Name],\n Hope this finds you well! I'm [Your Name], a sales development \n representative from Hegel AI, an innovative startup developing \n developer tools for large language models. \n We have developed cutting-edge technologies to aid in the \n improvement and optimization of these models.\n \n Our products can enhance the efficiency and quality of your work.\n I would love to arrange a brief discussion with you to discuss how ]"
Who was the first president?,[George Washington],[George Washington],[George Washington],
