# Parameterizing the Llama 2 Model

We've talked about some of the fundamentals of large language models -- llama 2 specifically -- and how they are
nondeterministic probabalistic machines which predict token sequences based on their pretrained weights. If we want to
(and I don't think this is done in deployment very often) we can actually observe some of the inner workings of this
choice in llama.cpp by configuring the model when we load it. This is going to be helpful in getting a better
understanding of how we can change the behavior of the model, so let's take a look.


In [2]:
# Import the llama.cpp python bindings and load the model, this time we are
# going to indicate to the model that we want to capture the probabilities
# of tokens for analysis as well, we do this by setting logits_all=True.
# I'll also include the llama types just for code cleanliness and completeness
import os
from llama_cpp import Llama
from llama_cpp.llama_types import *

model: Llama = Llama(
    model_path=os.environ["LLAMA_13B"], verbose=False, logits_all=True
)

In [3]:
# When we send a prompt to the model we can pass in the logprobs parameter which
# will determine how many alternative choices we will see. I'll just look at the
# top two choices. And we'll set the temperature to 0, which means that we'll
# have the model always choose from that top 3 the very best choice
result: Completion = model.create_completion(
    prompt="The capital of Michigan is ", logprobs=3, max_tokens=16, temperature=0
)

# Take out the response content from the Completion object
item = result["choices"][0]

# Print out the response text
print(item["text"])

# The individual token probabilities are in the CompletionLogprobs object
details: CompletionLogprobs = item["logprobs"]

# Here we can look at the list of tokens, their probabilities, and their offset
# in the prompt+response
print(details["tokens"])  # these will actually be encoded (text) copies of tokens!
print(
    details["token_logprobs"]
)  # the probability of the token in the set of candidates
print(details["text_offset"])  # the location in the prompt+response

14th on the list of most dangerous cities in America.
The city
['1', '4', 'th', ' on', ' the', ' list', ' of', ' most', ' dangerous', ' cities', ' in', ' America', '.', '\n', 'The', ' city']
[-1.2581345, -1.9064457, -2.0102007, -1.4877319, -0.69384456, -0.30852577, -0.12820004, -1.8777747, -1.5640584, -0.2770181, -0.23983036, -1.0902431, -0.76887864, -1.1293881, -1.9391508, -2.3217533]
[27, 28, 29, 31, 34, 38, 43, 46, 51, 61, 68, 71, 79, 80, 81, 84]


So, this is interesting, even if it's not something you're likely to do in production very often. It lets you see not
only what the tokens were, but how likely they were to fit into the response. Underneath, the goal of the model is to
choose the magnitude of the token which is closest to zero as often as possible, but to do so while respecting the
temperature parameter. Now, that's a little bit of an oversimplification, actually the model considers another parameter
called top_p as well. This parameter determines how many words are in the candidate set for us to consider. temperature
and top_p work differently, but together, to determine which token is chosen. For both parameters a higher number gives
more options, and thus less deterministic output.

Let's dig in just a bit more.


In [4]:
# We can actually see for each token what the alternatives were and what their
# logprob was. We can iterate through the set of tokens and, since I set logprobs=3,
# we will see the top three options for each token
for token_choice in details["top_logprobs"]:
    print(token_choice)

{'1': -1.2581345, '2': -1.3801266, '3': -2.0985427}
{'4': -1.9064457, '2': -2.1594117, '0': -2.2457373}
{'th': -2.0102007, '0': -2.3570406, '4': -2.3871949}
{' on': -1.4877319, ' in': -1.7202435, ' largest': -2.0042315}
{' the': -0.69384456, ' a': -2.0116336, ' our': -2.0541122}
{' list': -0.30852577, ' U': -3.4331338, ' For': -3.4778345}
{' of': -0.12820004, '.': -3.4785008, ',': -3.5628467}
{' the': -1.2324786, ' most': -1.8777747, ' cities': -2.7289329}
{' dangerous': -1.5640584, ' popul': -1.8241092, ' expensive': -2.2319298}
{' cities': -0.2770181, ' U': -2.9521394, ' places': -3.1585898}
{' in': -0.23983036, ' for': -3.0415556, '.': -3.055794}
{' the': -0.45761937, ' America': -1.0902431, ' Michigan': -4.70593}
{'.': -0.76887864, ',': -1.3212144, '\n': -2.5920274}
{'\n': -1.1293881, ' The': -2.3599255, ' It': -3.0682452}
{'The': -1.9391508, 'Mich': -2.707284, 'L': -2.7858906}
{' capital': -2.0821187, ' city': -2.3217533, ' F': -2.7388074}


## Temperature

Now, there are several parameters we can use to constrain or guide what the language model might choose for the next
token. Going into them in detail is outside of the scope of this short course, but I want to give you a bit of an
intuitive understanding of how they work. The first one, which we've already talked a bit about, is the `temperature`,
and we can use our newfound ability to find probabilities of tokens to see better how this is working.


In [5]:
# As the temperature was set to 0 the very first token, the one which
# minimized the absolute value of the logprob, was chosen. We can experiment
# a bit with this -- let's turn up the heat and see what happens
result: Completion = model.create_completion(
    prompt="The capital of Michigan is ", logprobs=3, max_tokens=16, temperature=0.7
)

details: CompletionLogprobs = result["choices"][0]["logprobs"]

# Here we can look at the list of tokens, their probabilities, and their offset
# in the prompt+response
print(details["tokens"])
print()  # little whitespace to see better

# iterate through results
for token_choice in details["top_logprobs"]:
    print(token_choice)

['1', '5', '4', ' miles', ' from', ' To', 'led', 'o', ',', ' Ohio', '.', '\n', 'A', ':', ' Detroit', ' is']

{'1': -1.2581345, '2': -1.3801266, '3': -2.0985427}
{'4': -1.9064457, '2': -2.1594117, '0': -2.2457373, '5': -2.257005}
{'0': -1.6116359, 'th': -2.5370834, '5': -2.5731857, '4': -2.7806041}
{' miles': -0.71913254, '.': -1.5969259, ' years': -2.4036627}
{' from': -1.1768222, ' away': -2.1144385, ' (': -2.2434993}
{' the': -1.715092, ' Chicago': -1.907944, ' Detroit': -2.7452128, ' To': -3.3888438}
{'led': -0.009333666, 'pe': -4.863393, 'ber': -7.8219204}
{'o': -0.000281056, 'a': -10.461307, '...': -10.747232}
{',': -0.75099295, '.': -1.2286122, ' and': -2.2649662}
{' Ohio': -0.5086052, ' and': -2.5078747, ' O': -2.7846692}
{',': -0.76004976, '.': -0.89862627, ' and': -3.0121918}
{'\n': -1.7718374, ' The': -1.9226686, ' It': -2.1530676}
{'The': -1.5949445, 'How': -2.5521302, 'What': -2.6860414, 'A': -4.042893}
{':': -2.264958, ' ': -3.4803605, '.': -3.5039792}
{' The': -1.0294231,

Ok! So here we can see that the sequence returned isn't made up of each of the lowest tokens in the candidate sets. In
fact, depending on a little bit of randomness, you might even see that several of the candidate sets have more than the
3 tokens in them -- that the high temperature resulted in the model choosing a token that wasn't in the top three
options. This important to note too -- even though we set the logprobs to three, that's not the complete candidate set
of tokens, but just the ones which are being returned to us programmatically.


## `top_k`

Another parameter we have in our toolbox is called `top_k`, and this more forcefully constraints the language model,
telling it _only_ to choose from a fixed set of best possible matches. This parameter is an integer value, and if we set
it to one, for instance, the model will have only one choice to choose from and the temperature won't matter. Let's take
a look.


In [6]:
# Let's now constrain the set of choices the model can consider by setting top_k
# to two. In this world the model only has a choice between the two best fitting
# tokens for a position in the output sequence.
result: Completion = model.create_completion(
    prompt="The capital of Michigan is ",
    logprobs=3,
    max_tokens=16,
    temperature=1.0,
    top_k=2,
)

details: CompletionLogprobs = result["choices"][0]["logprobs"]

# Here we can look at the list of tokens, their probabilities, and their offset
# in the prompt+response
print(details["tokens"])
print()  # little whitespace to see better

# iterate through results
for token_choice in details["top_logprobs"]:
    print(token_choice)

['1', '2', '5', ' miles', ' west', ' of', ' Detroit', ' and', ' ', '3', '0', ' miles', ' north', ' of', ' the', ' Indiana']

{'1': -1.2581345, '2': -1.3801266, '3': -2.0985427}
{'4': -1.9064457, '2': -2.1594117, '0': -2.2457373}
{'th': -1.9253953, '5': -2.0877736, '0': -2.1950066}
{' miles': -0.30566153, ' years': -2.3266575, '.': -2.7989914}
{' from': -1.3975685, ' west': -1.4879501, ' north': -2.098111}
{' of': -0.06492585, '-': -3.5104234, ' from': -4.4091716}
{' Detroit': -0.14634058, ' the': -3.1374495, ' Lake': -4.1514215}
{'.': -1.0419552, ' and': -1.2214749, ',': -1.5008957}
{' ': -1.2758332, ' is': -1.3793737, ' has': -2.916048}
{'1': -1.2426615, '2': -1.8018103, '3': -2.332631}
{'0': -0.9343345, '5': -1.8219879, '7': -2.7554862}
{' miles': -0.31157643, '0': -1.5994582, ' minutes': -3.880217}
{' east': -0.9051611, ' north': -1.9485438, ' south': -2.0717795}
{' of': -0.22633612, 'west': -1.7161185, '-': -4.6258225}
{' L': -1.0396819, ' the': -1.3250108, ' Ann': -2.4154887}
{' I

You see here that because logprobs is set to three I'm still being shown potential tokens, but they are pruned from the
candidate set before the model samples from the list as the `top_k` value is only two. By default the `top_k` value is
40, and you can see how it works with the temperature to determine which tokens are placed in the output sequence.


## `top_p`

We also have another variable we can use for tuning, called `top_p`. This variable is a bit more complex, it constrains
the candidate set of tokens based upon their cummulative probability distribution. It's a floating point number from 0
through 1, where 1 would be all possible tokens from our vocabulary. You can think of this as follows:

1. Each candidate token has a probability of being chosen and put in the output sequence.
2. We want to only choose from a set of really good options, so we take a number of high probability items and we're
   going to put them in the candidate set.
3. We don't want to constrain the set to only ten or fifteen items. Instead, once the number of items in that set have a
   combined probability equal to `top_p`, then we will stop adding items to the set.

So, it's a lot like `top_k` as the goal is to constrain the size of the candidate set, but for each position in the
output sequence the number of items we choose from will vary depending on how high of a probability the items have.
Let's take a look.


In [7]:
# Let's now constrain the set of choices the model can consider by setting top_p
# to a very low number. By default it's 0.95, but if we constrain it to 0.01
# we should be able to get a small set of high quality options.
result: Completion = model.create_completion(
    prompt="The capital of Michigan is ",
    logprobs=3,
    max_tokens=16,
    temperature=0.7,
    top_p=0.01,
)

details: CompletionLogprobs = result["choices"][0]["logprobs"]

# Here we can look at the list of tokens, their probabilities, and their offset
# in the prompt+response
print(details["tokens"])
print()  # little whitespace to see better

# iterate through results
for token_choice in details["top_logprobs"]:
    print(token_choice)

['1', '4', 'th', ' on', ' the', ' list', ' of', ' most', ' dangerous', ' cities', ' in', ' America', '.', '\n', 'The', ' city']

{'1': -1.2581345, '2': -1.3801266, '3': -2.0985427}
{'4': -1.9064457, '2': -2.1594117, '0': -2.2457373}
{'th': -2.0102007, '0': -2.3570406, '4': -2.3871949}
{' on': -1.4877319, ' in': -1.7202435, ' largest': -2.0042315}
{' the': -0.69384456, ' a': -2.0116336, ' our': -2.0541122}
{' list': -0.30852577, ' U': -3.4331338, ' For': -3.4778345}
{' of': -0.12820004, '.': -3.4785008, ',': -3.5628467}
{' the': -1.2324786, ' most': -1.8777747, ' cities': -2.7289329}
{' dangerous': -1.5640584, ' popul': -1.8241092, ' expensive': -2.2319298}
{' cities': -0.2770181, ' U': -2.9521394, ' places': -3.1585898}
{' in': -0.23983036, ' for': -3.0415556, '.': -3.055794}
{' the': -0.45761937, ' America': -1.0902431, ' Michigan': -4.70593}
{'.': -0.76887864, ',': -1.3212144, '\n': -2.5920274}
{'\n': -1.1293881, ' The': -2.3599255, ' It': -3.0682452}
{'The': -1.9391508, 'Mich': -2.7

The result of such a small `top_p` is that the candidate set of choices is very small, though not all of the choices are
good ones. Since this is a smaller set of choices, and the probability of any given choice in the set can vary wildly,
depending on the pretraining of the model. However, the tokens in the candidate set are reweighted as they are the only
options, and the temperature then determines how likely an given candidate will be chosen.

Now, all of this is a bit of an oversimplification, and there are several more parameters some of which you can tune,
such as the `typical_p`, `min_p`, and so forth. If you're interested more in how these generative models choose a given
response, from a statistics and deep learning lens, I'd encourage you to check out paper by Holtzman et al. that I've
put in the optional readings for this week, or read more on topics such as nucleus sampling, top_k sampling, and beam
search.
