### Prerequisites
1. [Download llama2 7B base model files from meta](https://github.com/ggerganov/llama.cpp/tree/master?tab=readme-ov-file#obtaining-and-using-the-facebook-llama-2-model) in `$YOUR_PATH_TO_LLAMA_MODELS_DIR`
2. [Download llama_cpp and build binaries locally](https://github.com/ggerganov/llama.cpp/tree/master?tab=readme-ov-file#description) in `$PATH_TO_LLAMA_CPP_REPO`
3. Install necessary python requirements:
```
$ pip install -r requirements.txt
```
4. Prepare and quantize base model:
```
# Copy downloaded llama models llama.cpp project dir
cd $YOUR_PATH_TO_LLAMA_MODELS_DIR
cp tokenizer.model tokenizer_checklist.chk $PATH_TO_LLAMA_CPP_REPO/models/
cp -r llama-2-7b/ $YOUR_PATH_TO_LLAMA_CPP_REPO/models/llama-2-7b/

# Run conversion script to convert models
cd $YOUR_PATH_TO_LLAMA_CPP_REPO
python3 convert.py models/llama-2-7b/

# Quantize the converted model
./quantize ./models/llama-2-7b/ggml-model-f16.bin models/llama-2-7b/ggml-model-q4_0.bin q4_0
```
5. Run the quantized model using `./main` binary on command line using:
```
$ ./main -m models/llama-13b-v2/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e
```
6. [Install the python module that works with these quantized model](https://python.langchain.com/docs/integrations/text_embedding/llamacpp)

### [Using Llama 2 models for text embedding with LangChain](https://python.langchain.com/docs/integrations/text_embedding/llamacpp)

In [1]:
from langchain_community.embeddings import LlamaCppEmbeddings

In [2]:
llama_model_path='/Users/shivramamurthi/src/llama.cpp/models/llama-2-7b/ggml-model-q4_0.bin'

In [3]:
embeddings = LlamaCppEmbeddings(model_path=llama_model_path)

llama_model_loader: loaded meta data with 16 key-value pairs and 291 tensors from /Users/shivramamurthi/src/llama.cpp/models/llama-2-7b/ggml-model-q4_0.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:        

In [4]:
text = "This is a test document."

In [7]:
query_result = embeddings.embed_query(text)
len(query_result)


llama_print_timings:        load time =    5363.54 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =     325.50 ms /     7 tokens (   46.50 ms per token,    21.51 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =     325.65 ms /     8 tokens


4096

In [10]:
query_result

[-0.0010078781130065527,
 0.00024214071419637848,
 0.0005993970823485392,
 0.002725355812429061,
 0.002868373110648672,
 -0.00048120095573838073,
 -0.0012058562559610821,
 0.0021092211957514655,
 8.724051103464703e-06,
 0.001148868367779727,
 0.0007651822603766045,
 0.00023024828374956907,
 0.0014882935439384443,
 0.0005548892881825636,
 -0.001636383336677103,
 -0.00026641548493594024,
 0.0008033114190758866,
 0.001321978976542018,
 0.0004481664742708326,
 -0.0014962820309335548,
 0.0014700469432032423,
 -0.0004074473947065711,
 -0.0009581217178339948,
 -0.00415128340858302,
 0.0012120279003274006,
 0.000427277348970952,
 0.0007590536975487515,
 0.0007938861913037756,
 0.0001666076145831566,
 0.002269093075989904,
 0.0014509528344547192,
 -0.0012219624745666991,
 0.000746047495209331,
 0.0011655357162911506,
 0.0024685998289623997,
 0.0023946935923791247,
 -0.000515753276839279,
 0.0008707709028592987,
 0.0011244219484932828,
 -0.000728876863030007,
 0.0012007509509988326,
 -0.00098501

In [8]:
doc_result = embeddings.embed_documents([text])
len(doc_result)


llama_print_timings:        load time =    5363.54 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =     335.87 ms /     7 tokens (   47.98 ms per token,    20.84 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =     336.27 ms /     8 tokens


1

In [9]:
doc_result

[[-0.0010078781130065527,
  0.00024214071419637848,
  0.0005993970823485392,
  0.002725355812429061,
  0.002868373110648672,
  -0.00048120095573838073,
  -0.0012058562559610821,
  0.0021092211957514655,
  8.724051103464703e-06,
  0.001148868367779727,
  0.0007651822603766045,
  0.00023024828374956907,
  0.0014882935439384443,
  0.0005548892881825636,
  -0.001636383336677103,
  -0.00026641548493594024,
  0.0008033114190758866,
  0.001321978976542018,
  0.0004481664742708326,
  -0.0014962820309335548,
  0.0014700469432032423,
  -0.0004074473947065711,
  -0.0009581217178339948,
  -0.00415128340858302,
  0.0012120279003274006,
  0.000427277348970952,
  0.0007590536975487515,
  0.0007938861913037756,
  0.0001666076145831566,
  0.002269093075989904,
  0.0014509528344547192,
  -0.0012219624745666991,
  0.000746047495209331,
  0.0011655357162911506,
  0.0024685998289623997,
  0.0023946935923791247,
  -0.000515753276839279,
  0.0008707709028592987,
  0.0011244219484932828,
  -0.0007288768630300