# Llama2 Embeddings using LLaMa.cpp

The purpose of this notebook is to generate embeddings using [llama.cpp](https://github.com/ggerganov/llama.cpp/tree/master/examples/embedding) with a GGUF model. This will provide the hidden state embeddings for a single pass through the model. The embeddings could be used for classical downstream task such as classification, clustering, etc. 

The model used in this notebook will use the following model: [TheBloke/Llama-2-7B-Chat-GGUF](https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF).

Please follow the instructions on [llama.cpp](https://github.com/ggerganov/llama.cpp) to compile it for your system.

In [1]:
from pathlib import Path
import os
# Path model weights are being stored
path_base_model = Path(os.environ['MODEL_DIRECTORY'])
model = 'Llama-2-7B-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf'
path_model = str(path_base_model / model)

# Project base directory
path_project = Path().cwd().parents[0]

# Path to llama.cpp embedding file
path_cpp_embedding = path_project / 'llama.cpp/embedding'

# llama.cpp Embedding Module

Refer to the example on [llama.cpp](https://github.com/ggerganov/llama.cpp/tree/master/examples/embedding) showing how to call their embedding module.

In [2]:
# Custom command
cmd = (
    f"{str(path_cpp_embedding)} "
    f"--model {path_model} "
    f"--prompt 'Once upon a time.' "
    f"> llama2-hidden-state.txt"
)

# Execute the command in the terminal
response = os.system(cmd)

main: build = 1606 (fbbc428)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1701652801
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from /nvme4tb/Projects/llm_models/Llama-2-7B-Chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: - tensor    0:                token_embd.weight q4_K     [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:           blk.0.attn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    2:            blk.0.ffn_down.weight q6_K     [ 11008,  4096,     1,     1 ]
llama_model_loader: - tensor    3:            blk.0.ffn_gate.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    4:              blk.0.ffn_up.weight q4_K     [  4096, 11008,     1,     1 ]
llama_model_loader: - tensor    5:            blk.0.ffn_norm.weight f32      [  4096,     1,     1,     1 ]
llama_model_loader: - tensor    6:       

# Load Embeddings

In [3]:
# Read the text file
file = open("llama2-hidden-state.txt", "r")
contents = file.read()
file.close()

# Convert the string into floats
embeddings = [float(i) for i in contents.strip().split(' ')]
print(f'# of Embeddings: {len(embeddings):,}')
print(embeddings[0:5])

# of Embeddings: 4,096
[-0.042521, 0.204012, 1.051836, 0.020852, 0.195843]


The above shows how to generate Llama2, or any GGUF model, embeddings using only CPU hardware with low-latency. 