#**Lab 12: Prompt Engineering**

In this lab we are going to gain some practice with using LLM assistants and with prompt engineering.

We will be do this with Llama2, an open source LLM, and  the llama2.cpp interface. You will have to write prompts to carry out several NLP tasks studied in the module.

It is recommended you try to connect to a T4 GPU on Colab as this will speed up things considerably.

The part of the  lab dedicated to setting up the interface is based on the HuggingFace lab https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML/discussions/3.

#Llama 2

Llama 2 is a collection of pretrained and fine-tuned generative text models, ranging from 7 billion to 70 billion parameters, designed for dialogue use cases. In this lab we will be using Llama 2 13B-chat](https://huggingface.co/meta-llama/Llama-2-13b-chat)

#Llama2.cpp

`llama.cpp` can be used to run the LLaMA model with 4-bit integer quantization on MacBook. It is a plain C/C++ implementation optimized for Apple silicon and x86 architectures, supporting various integer quantization and BLAS libraries.

Task 0: Please read the description of the library at https://llama-cpp-python.readthedocs.io/en/latest/

# Setting up llama cpp python

The library works the same with a CPU, but the inference can take about three times longer compared to using it on a GPU.

If you want to use only the CPU, you can replace the content of the cell below with the following lines.
```
# CPU llama-cpp-python
!pip install llama-cpp-python==0.1.78
```

In [1]:
# GPU llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.78 --force-reinstall --upgrade --no-cache-dir --verbose

Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
Collecting llama-cpp-python==0.1.78
  Downloading llama_cpp_python-0.1.78.tar.gz (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Running command pip subprocess to install build dependencies
  Using pip 23.1.2 from /usr/local/lib/python3.10/dist-packages/pip (python 3.10)
  Collecting setuptools>=42
    Using cached setuptools-69.2.0-py3-none-any.whl (821 kB)
  Collecting scikit-build>=0.13
    Downloading scikit_build-0.17.6-py3-none-any.whl (84 kB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.3/84.3 kB 1.1 MB/s eta 0:00:00
  Collecting cmake>=3.18
    Downloading cmake-3.29.2-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 26.7/26.7 MB 13.9 MB/s eta 0:00:00
  Collecting ninja
    Downloading ninja-1.11.1.1-py2.py3-none-manylinux1_x86_64.manylinux_2_5

In [2]:
# To download the models
!pip install huggingface_hub



# Choosing the Llama2 version


Next, we need to specify which version of Llama2 to use. In Colab with T4 GPU, we can run models of up to 20B of parameters with all optimizations, but this may degrade the quality of the model's inference. The library can run GGML models on a CPU.



In this lab, we will use  [Llama 2 13B-chat](https://huggingface.co/meta-llama/Llama-2-13b-chat)

![asd](https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc24dac6d-6b5e-4b5f-938c-05951c938a9e_1085x543.png)






#  Quantized Models from the Hugging Face Community

The Hugging Face community provides quantized models, which allow us to efficiently and effectively utilize the model on the T4 GPU. It is important to consult reliable sources before using any model.

There are several variations available, but the ones that interest us are based on the GGLM library.

We can see the different variations that Llama-2-13B-GGML has [here](https://huggingface.co/models?search=llama%202%20ggml).



In this case, we will use the model called [Llama-2-13B-chat-GGML](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML).

The prefix 'q5_1' signifies the quantization method we used. To determine the best method in each case, one rule is that 'q8' yields superior responses at the cost of higher memory usage [slow]. On the other hand, 'q2' may generate subpar responses but requires less RAM [fast].

There are other quantization methods available, and you can read about them in the [model card](https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML)

In [3]:
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGML"
model_basename = "llama-2-13b-chat.ggmlv3.q5_1.bin" # the model is in bin format

We download the model

In [4]:
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id=model_name_or_path, filename=model_basename)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


llama-2-13b-chat.ggmlv3.q5_1.bin:   0%|          | 0.00/9.76G [00:00<?, ?B/s]

# Inference with llama-cpp-python

Setting up the interface

In [5]:
# GPU
from llama_cpp import Llama
lcpp_llm = None
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    n_batch=512, # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    n_gpu_layers=43, # Change this value based on your model and your GPU VRAM pool.
    n_ctx=4096, # Context window
)

AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 



To run in CPU
```
# CPU
from llama_cpp import Llama

lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2, # CPU cores
    )
```



In [6]:
# See the number of layers in GPU
lcpp_llm.params.n_gpu_layers

43

# First example of prompt use: generating code

A zero shot prompt asking Llama2 to write linear regression code in Python

In [7]:
prompt = "Write a linear regression in python"
prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''

Generating response

If you only use CPU, the response can take a long time. You can reduce the max_tokens to get a faster response.

In [8]:
response = lcpp_llm(
    prompt=prompt_template,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response["choices"][0]["text"])

SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: Write a linear regression in python

ASSISTANT:

To write a linear regression in Python, you can use the scikit-learn library. Here is an example of how to do this:
```
from sklearn.linear_model import LinearRegression
import pandas as pd

# Load your dataset into a Pandas DataFrame
df = pd.read_csv('your_data.csv')

# Create a linear regression object and fit the data
reg = LinearRegression()
reg.fit(df[['x1', 'x2']], df['y'])

# Print the coefficients
print(reg.coef_)
```
This code will load your dataset into a Pandas DataFrame, create a linear regression object, and fit the data to the model using the `fit()` method. The `coef_` attribute of the `LinearRegression` object contains the estimated coefficients of the linear regression.

You can also use the `predict()` method to make predictions on new data:
```
# Create a new DataFrame with predicted values
preds = reg.predict(df[['x1', 'x2']

# Second example (also zero shot): a natural language generation task

A zero shot prompt asking Llama2 to write a story




In [9]:
prompt_nlg = "Write a story about a bear called Paddington"
prompt_template_nlg=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt_nlg}

ASSISTANT:
'''

In [10]:
response_nlg = lcpp_llm(
    prompt=prompt_template_nlg,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response_nlg["choices"][0]["text"])

Llama.generate: prefix-match hit


SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: Write a story about a bear called Paddington

ASSISTANT:

Once upon a time in Peru, there was a little bear named Paddington. He lived with his Aunt Lucy in the heart of the forest. Paddington loved to explore and play in the trees, but he always made sure to be back home for tea time. One day, a kind old man named Mr. Brown found Paddington lost in London. He took him home to his family and they all fell in love with the little bear's charming ways. From then on, Paddington lived with the Browns and had many exciting adventures with them. Despite being far from his forest home, Paddington always remained a curious and loving bear.

Would you like me to add anything else?


# Task 1: natural language generation with zero-shot prompting  



Task 1: Write a prompt to get Llama2 to generate a recipe for spaghetti al pomodoro. Experiment with different prompts.

Write your prompt below, and call it ```prompt_task1```

In [11]:
prompt_task1 ="Write a recipe for spaghetti al pomodoro" # YOUR PROMPT HERE
prompt_template_task1 = f'''

USER: {prompt_task1}

ASSISTANT:
'''

In [12]:
response_task1 = lcpp_llm(
    prompt=prompt_template_task1,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response_task1["choices"][0]["text"])

Llama.generate: prefix-match hit




USER: Write a recipe for spaghetti al pomodoro

ASSISTANT:

Spaghetti al Pomodoro is a classic Italian dish that originated in the Tuscany region. It's made with fresh tomatoes, garlic, olive oil, and of course, spaghetti! Here's my recipe for this delicious pasta dish:

Ingredients:

* 12 oz (340g) spaghetti
* 2 large ripe tomatoes, peeled and chopped
* 3 cloves garlic, minced
* 6 tbsp (84g) extra-virgin olive oil
* Salt and freshly ground black pepper to taste
* Fresh basil leaves for garnish (optional)

Instructions:

1. Bring a large pot of salted water to a boil. Cook the spaghetti according to package instructions until al dente. Reserve 1 cup of pasta cooking water before draining the spaghetti.
2. In a blender or food processor, combine the tomatoes, garlic, and olive oil. Blend until smooth, adding some reserved pasta cooking water if needed to achieve


# Task 2: summarization with zero-shot prompting  



Task 2: Write a prompt to get llama2 to produce a summary of the following Wikipedia article on Llama

In [13]:
task2_article = f'''
LLaMA

LLaMA (Large Language Model Meta AI) is a family of autoregressive large language models (LLMs), released by Meta AI starting in February 2023.

For the first version of LLaMA, four model sizes were trained: 7, 13, 33, and 65 billion parameters.
LLaMA's developers reported that the 13B parameter model's performance on most NLP benchmarks exceeded that of the much larger GPT-3 (with 175B parameters) and that the largest model was competitive with state of the art models such as PaLM and Chinchilla.[1] Whereas the most powerful LLMs have generally been accessible only through limited APIs (if at all), Meta released LLaMA's model weights to the research community under a noncommercial license.[2] Within a week of LLaMA's release, its weights were leaked to the public on 4chan via BitTorrent.[3]

In July 2023, Meta released several models as Llama 2, using 7, 13 and 70 billion parameters.

LLaMA-2

On July 18, 2023, in partnership with Microsoft, Meta announced LLaMA-2, the next generation of LLaMA.
Meta trained and released LLaMA-2 in three model sizes: 7, 13, and 70 billion parameters.[4]
The model architecture remains largely unchanged from that of LLaMA-1 models, but 40% more data was used to train the foundational models.[5]
The accompanying preprint[5] also mentions a model with 34B parameters that might be released in the future upon satisfying safety targets.

LLaMA-2 includes both foundational models and models fine-tuned for dialog, called LLaMA-2 Chat.
In further departure from LLaMA-1, all models are released with weights, and are free for many commercial use cases.
However, due to some remaining restrictions, the description of LLaMA as open source has been disputed by the Open Source Initiative
(known for maintaining the Open Source Definition).[6]

Architecture

LLaMA uses the transformer architecture, the standard architecture for language modeling since 2018.

There are minor architectural differences. Compared to GPT-3, LLaMA

- uses SwiGLU[7] activation function instead of GeLU;
- uses rotary positional embeddings[8] instead of absolute positional embedding;
- uses root-mean-squared layer-normalization[9] instead of standard layer-normalization.[10]
- increases context length from 2K (Llama 1) tokens to 4K (Llama 2) tokens between.

Training datasets

LLaMA's developers focused their effort on scaling the model's performance by increasing the volume of training data, rather than the number of parameters, reasoning that the dominating cost for LLMs is from doing inference on the trained model rather than the computational cost of the training process.

LLaMA 1 foundational models were trained on a data set with 1.4 trillion tokens, drawn from publicly available data sources, including:[1]

-     Webpages scraped by CommonCrawl
-     Open source repositories of source code from GitHub
-     Wikipedia in 20 different languages
-     Public domain books from Project Gutenberg
-     The LaTeX source code for scientific papers uploaded to ArXiv
-     Questions and answers from Stack Exchange websites

Llama 2 foundational models were trained on a data set with 2 trillion tokens. This data set was curated to remove Web sites that often disclose personal data of people. It also upsamples sources considered trustworthy.[5] Llama 2 - Chat was additionally fine-tuned on 27,540 prompt-response pairs created for this project, which performed better than larger but lower-quality third-party datasets. For AI alignment, reinforcement learning with human feedback (RLHF) was used with a combination of 1,418,091 Meta examples and seven smaller datasets. The average dialog depth was 3.9 in the Meta examples, 3.0 for Anthropic Helpful and Anthropic Harmless sets, and 1.0 for five other sets, including OpenAI Summarize, StackExchange, etc.
'''

Write your prompt below, and call it ```prompt_task2```

In [14]:
prompt_task2 = f'''Summarize the Wikipedia article you find in the following lines, beginning with ARTICLE-BEGIN and ending with ARTICLE-END
ARTICLE-BEGIN

{task2_article}

ARTICLE-END
'''# YOUR PROMPT HERE

prompt_template_task2 = f'''

USER: {prompt_task2}

ASSISTANT:
'''

In [15]:
response_task2 = lcpp_llm(
    prompt=prompt_template_task2,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response_task2["choices"][0]["text"])

Llama.generate: prefix-match hit




USER: Summarize the Wikipedia article you find in the following lines, beginning with ARTICLE-BEGIN and ending with ARTICLE-END
ARTICLE-BEGIN


LLaMA

LLaMA (Large Language Model Meta AI) is a family of autoregressive large language models (LLMs), released by Meta AI starting in February 2023.

For the first version of LLaMA, four model sizes were trained: 7, 13, 33, and 65 billion parameters.
LLaMA's developers reported that the 13B parameter model's performance on most NLP benchmarks exceeded that of the much larger GPT-3 (with 175B parameters) and that the largest model was competitive with state of the art models such as PaLM and Chinchilla.[1] Whereas the most powerful LLMs have generally been accessible only through limited APIs (if at all), Meta released LLaMA's model weights to the research community under a noncommercial license.[2] Within a week of LLaMA's release, its weights were leaked to the public on 4chan via BitTorrent.[3]

In July 2023, Meta released several models 

# Task 3: machine translation with zero-shot prompting  



Task 3: Write a prompt to get llama2 to translate the Wikipedia article above to French if you are English, else to your native language

Write your prompt below, and call it ```prompt_task3```

In [16]:
prompt_task3 = f'''Translate to Italian the Wikipedia article you find in the following lines, beginning with ARTICLE-BEGIN and ending with ARTICLE-END
ARTICLE-BEGIN

{task2_article}

ARTICLE-END
'''# YOUR PROMPT HERE
prompt_template_task3 = f'''

USER: {prompt_task3}

ASSISTANT:
'''

In [17]:
response_task3 = lcpp_llm(
    prompt=prompt_template_task3,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response_task3["choices"][0]["text"])

Llama.generate: prefix-match hit




USER: Translate to Italian the Wikipedia article you find in the following lines, beginning with ARTICLE-BEGIN and ending with ARTICLE-END
ARTICLE-BEGIN


LLaMA

LLaMA (Large Language Model Meta AI) is a family of autoregressive large language models (LLMs), released by Meta AI starting in February 2023.

For the first version of LLaMA, four model sizes were trained: 7, 13, 33, and 65 billion parameters.
LLaMA's developers reported that the 13B parameter model's performance on most NLP benchmarks exceeded that of the much larger GPT-3 (with 175B parameters) and that the largest model was competitive with state of the art models such as PaLM and Chinchilla.[1] Whereas the most powerful LLMs have generally been accessible only through limited APIs (if at all), Meta released LLaMA's model weights to the research community under a noncommercial license.[2] Within a week of LLaMA's release, its weights were leaked to the public on 4chan via BitTorrent.[3]

In July 2023, Meta released seve

# Task 4: named entity recognition, one and few-shot prompting, JSON output



Task 4: Write a prompt to get llama2 to tag named entities in the following sentence as 'ORG' if organization, 'DATE' if date, 'NUM' if number, and 'MODEL' if an AI model, and to output the result in JSON format. Use a few examples to explain llama2 what you want.

In [18]:
task4_sentence = "On July 18, 2023, in partnership with Microsoft, Meta announced LLaMA-2, the next generation of LLaMA. Meta trained and released LLaMA-2 in three model sizes: 7, 13, and 70 billion parameters."

Write your prompt below, and call it ```prompt_task4```

In [19]:
prompt_task4 = f'''Identify the named entities in the sentence in the following lines (between with SENTENCE-BEGIN and SENTENCE-END), and tag them as
ORG if organization (for instance, Apple, IBM)
DATE if date (for instance, 8/4/2024)
NUM if number (for instance, 65)
MODEL if an AI model (for instance, ChatGPT)
Output the result in JSON format.

SENTENCE-BEGIN
{task4_sentence}
SENTENCE-END
'''# YOUR PROMPT HERE
prompt_template_task4 = f'''

USER: {prompt_task4}

ASSISTANT:
'''

In [20]:
response_task4 = lcpp_llm(
    prompt=prompt_template_task4,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response_task4["choices"][0]["text"])

Llama.generate: prefix-match hit




USER: Identify the named entities in the sentence in the following lines (between with SENTENCE-BEGIN and SENTENCE-END), and tag them as
ORG if organization (for instance, Apple, IBM)
DATE if date (for instance, 8/4/2024)
NUM if number (for instance, 65)
MODEL if an AI model (for instance, ChatGPT)
Output the result in JSON format.

SENTENCE-BEGIN
On July 18, 2023, in partnership with Microsoft, Meta announced LLaMA-2, the next generation of LLaMA. Meta trained and released LLaMA-2 in three model sizes: 7, 13, and 70 billion parameters.
SENTENCE-END


ASSISTANT:
{  
"entities": [
{"label": "ORG", "value": "Meta"},
{"label": "DATE", "value": "July 18, 2023"},
{"label": "NUM", "value": "7, 13, and 70 billion parameters"}
]
}


How did Llama2 do at the task? Try to experiment with both one-shot and few-shot prompts.

# Task 5: dialogue act tagging, one and few-shot prompting



Task 5: Write a prompt to get llama2 to tag dialogue acts in the following conversation, using your favourite dialogue act tagset, and to output the result in JSON format. Use a few examples to explain llama2 what you want.

In [21]:
task5_conversation = f'''
A: . . . I need to travel in May.
B: And, what day in May did you want to travel?
A: OK uh I need to be there for a meeting that’s from the 12th to the 15th.
B: And you’re flying into what city?
A: Seattle.
B: And what time would you like to leave Pittsburgh?
A: Uh hmm I don’t think there’s many options for non-stop.
B: Right. There’s three non-stops today.
'''

Write your prompt below, and call it ```prompt_task5```

In [22]:
prompt_task5 =f'''Identify the dialogue acts in the conversation below between A and B (starting from CONVERSATION-BEGIN and ending with CONVERSATION-END), using the following tags:
QW if the utterance is a wh-question (for instance, where are you going?)
SD if the utterance is a statement (for instance, I am going to London)
ACK if the utterance is an acknowledgment (for instance, OK)
Output the result in JSON format.

CONVERSATION-BEGIN

{task5_conversation}

CONVERSATION-END
'''# YOUR PROMPT HERE
prompt_template_task5 = f'''

USER: {prompt_task5}

ASSISTANT:
'''

In [23]:
response_task5 = lcpp_llm(
    prompt=prompt_template_task5,
    max_tokens=256,
    temperature=0.5,
    top_p=0.95,
    repeat_penalty=1.2,
    top_k=50,
    stop = ['USER:'], # Dynamic stopping when such token is detected.
    echo=True # return the prompt
)

print(response_task5["choices"][0]["text"])

Llama.generate: prefix-match hit




USER: Identify the dialogue acts in the conversation below between A and B (starting from CONVERSATION-BEGIN and ending with CONVERSATION-END), using the following tags:
QW if the utterance is a wh-question (for instance, where are you going?)
SD if the utterance is a statement (for instance, I am going to London)
ACK if the utterance is an acknowledgment (for instance, OK)
Output the result in JSON format.

CONVERSATION-BEGIN


A: . . . I need to travel in May.
B: And, what day in May did you want to travel?
A: OK uh I need to be there for a meeting that’s from the 12th to the 15th.
B: And you’re flying into what city?
A: Seattle.
B: And what time would you like to leave Pittsburgh?
A: Uh hmm I don’t think there’s many options for non-stop.
B: Right. There’s three non-stops today.


CONVERSATION-END


ASSISTANT:

{
"acts": [
{
"type": "QW",
"utterance": "And, what day in May did you want to travel?"
},
{
"type": "SD",
"utterance": "OK uh I need to be there for a meeting that’s from 

How did llama2 do?