# Experimenting Prompt Engineering

In this notebook we will explore prompt engineering as a way to improve accuracy of Large Language Models (LLMs). 

This notebook uses an active use case for an auto manufacturer labeling Google Review data of Dealerships as pertaining to either "Sales", "Service", "Both", or "Overall"

First you will experiment by loading an LLM within the notebook environment and receiving back labels from the LLM based on the system prompt provided to the model for creating labels.

This notebook will load the text content of the review and return an LLM-generated label for the content, then compare that label against the control label to generate an accuracy score.

### Working Environment 

[![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/build-on-aws/generative-ai-prompt-engineering/blob/main/prompt-engineering-chatbot/prompt-engineering-chatbot.ipynb)


This notebook has been designed, written and tested to run on machines with a minimum of 16GB of RAM (32GB preferred). However, if you don't have access to one sign up for a free account on [Amazon SageMaker Studio Lab](https://studiolab.sagemaker.aws/).  Studio Lab is a free machine learning (ML) development environment that provides compute and storage (up to 15GB) at no cost with NO credit card required.

You can sign up for Amazon SageMaker Studio Lab here: [https://studiolab.sagemaker.aws/]

# Let's Jump in!

### Libraries
First, if needed, install `llama-cpp-python` - a library based on `llama-cpp`, a great open source set of libraries for working and experimenting with the underlying technology of generative AI.  

`llama-cpp-python` is extra useful because it allows you to run LLMs in system memory instead of the usual VRAM found in GPUs

In [None]:
# extra index url didn't want to work with a requirements.txt
%pip install pandas
%pip install openpyxl
%pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu

In [None]:
from llama_cpp import Llama
import pandas as pd

## Llama3 8b Model

The following cells will load the pretrained model. This model has been quantized down to 5 bits of precision in order to fit on standard computer RAM.

In [None]:
MODEL_PATH = "../Meta-Llama-3-8B-Instruct.Q5_K_M.gguf"
DATA_FILE_PATH = "llm_testing_data.xlsx"

In [None]:
LLM = Llama(
    model_path=MODEL_PATH,
    chat_format="llama-3",
    n_gpu_layers=200, #leave this off unless you have gpu to run against
    verbose=False,
    n_ctx=8000
)

The following cell defines the methods we will use to query and evaluate our model.

In [None]:
def query_model(
        system_message,
        review_content
):
    review_message = "Review: " + review_content + " Classification:"
    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": review_message},
    ]

    results = LLM.create_chat_completion(
        messages
    )

    answer = results['choices'][0]['message']['content']

    return str(answer).strip()


def run_model(system_message, limit=None):
    df = pd.read_excel(DATA_FILE_PATH, sheet_name="Data")
    successes = 0
    for index, row in df.iterrows():
        if limit and index >= limit:
            break
        response = query_model(
            system_message,
            review_content=row['Review Text']
        )
        response = "Service" if response == "Repairs" else response
        response = "Overall" if response == "Unsure" else response
        if response == row['Control Label']:
            successes += 1
        else:
            print("MISMATCH")
            print("Label: " + row['Control Label'])
            print(row['Review Text'])
            print("LLM: " + response)

        print('Row: ' + str(index + 1) + ' | ' + str((successes / (index + 1)) * 100) + "%")
    
    count = limit if limit else df['Control Label'].count()
    print()
    print("Final Results: " + str((successes / count) * 100) + "%")

## Prompt Engineering

This final cell is where the magic happens. Within it we see our prompt to the LLM instructing it to label reviews and the conditions which it should use to label them. 

Adjust the system message here and run the cell to see if you can get accuracy up or down. 

Can you break 90% accuracy?

In [None]:
system_message = """
You are an AI assistant tasked with categorizing reviews into one of the following topics: "Repairs", "Sales", "Both", or "Unsure".
Reviews should default to "Unsure".
"Repairs" refers to vehicle repairs, warranty claims, parts, appointments, updates, maintenance, and labor. Reviews listing a "service advisor", "service department", "service center", "service technician", "warranty", "recall", "technician", or any plural form of these words should fall under "Repairs" where a sale is also not mentioned. Reviews listing a pre-existing vehicle should fall under "Repairs".
"Sales" refers to vehicle sales, purchasing, test drives, inventory, buyers, shopping, payments, new cars, inventory, financing, and leasing.
"Unsure" applies to reviews that are generic and do not explicitly mention a sale or repair.
Reviews should only qualify for "Sales" or "Repairs" if a condition is met and only qualify for "Both" if a condition from both categories is met.
Please respond with just one word.
"""

# adjust the "limit" parameter to only run through a designated amount of rows. 
# eg. limit of 10 will only run 10 rows of data
# use limit=None to run against all 1000 rows
run_model(system_message, limit=100)