# Large Language Model - Solutions

In this tutorial we will be working with the Large Language Model Llama. We will try to get answers to our questions prompted to the Model.

Now, it's time to set up your own Large Language Model. 

To download a model, see: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

For this taks, make sure your C compiler works. If you use Windows, install Visual Studio Community with the “Desktop development with C++” workload. If you use Mac OS, use brew install python3-dev. If you use Linux, use pt install python3-dev.

## Packages

In [1]:
InstallPackages = False
if InstallPackages:
    !pip install llama-cpp-python

In [2]:
from llama_cpp import Llama

## Seed

In [3]:
seed = 42

## Exercise 1 - Load the Large Language Model

First, set the path for your Large Language Model

In [4]:
llm_path = 'Models/llama-2-7b-chat.Q2_K.gguf'

Second, load the Llama Large Language Model

In [5]:
llm = Llama(
      model_path='Models/llama-2-7b-chat.Q8_0.gguf',
      # n_gpu_layers=-1, # Uncomment to use GPU acceleration
      # seed=seed, # Uncomment to set a specific seed
      # n_ctx=2048, # Uncomment to increase the context window
)

llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from Models/llama-2-7b-chat.Q8_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32           

## Exercise 2 - Write a function to query with your Large Language Model

In [6]:
def query(model, question):
    
    """
    
    This function is used to query with the large language model.
    
    """
    
    prompt = f"Q: {question} A:"
    
    output = model(
      prompt, # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
    ) # Generate a completion, can also call create_completion

    answer = output["choices"][0]["text"].partition('A: ')[2]
    
    return answer
    

## Exercise 3 - Ask you Large Language Model 'Who is the chancelor of Germany?'

In [8]:
query(model = llm, question = 'Who is the chancelor of Germany?')

Llama.generate: prefix-match hit

llama_print_timings:        load time =     996.53 ms
llama_print_timings:      sample time =       0.94 ms /    32 runs   (    0.03 ms per token, 34078.81 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (-nan(ind) ms per token, -nan(ind) tokens per second)
llama_print_timings:        eval time =    8323.92 ms /    32 runs   (  260.12 ms per token,     3.84 tokens per second)
llama_print_timings:       total time =    8337.50 ms /    32 tokens


'Angela Merkel has been the Chancellor of Germany since 2005, making her the longest-serving holder of the office in German'