## What Exactly is a Large Language Model?

A "Large Language Model" or LLM is one that can recieve textual input and can then generate words and sentences by predicting what the next word might be. Certain words are essentially assigned a probability according to the context of the input (i.e. phrase, sentence, paragraph, etc.) and then generates following "high-ranked words" (Wolfram) in order to fulfill the sentence.

*insert "hi what's up" prompt

These models consist of stacks or blocks containing levels of encoders and decoders, and each successive model varies in how high these stacks are. One can go on and on into the technicality of its structure and ability to generate, but the point here is this: these models function by producing a series of tokens, originally starting with the input from a human asking, say, a question, and each successive output is a token, which is then taken as another input, and the cycle repeats, until a fully fledged answer is more or less attained.

**insert pic here

## Here's an Example of How This Works:

We might define the input with the following sentence:

In [1]:
input_text = 'The man walked the cat'

In [2]:
import os

In [3]:
from transformers import AutoTokenizer, AutoModel, utils, AutoModelForCausalLM
from bertviz import model_view, head_view
import torch
import transformers

In [4]:

cache_dir='/Commjhub/HF_cache'
utils.logging.set_verbosity_error()  # Suppress standard warnings

model_name = 'gpt2'

gpt2 = AutoModelForCausalLM.from_pretrained(model_name, cache_dir=cache_dir,
                                            return_dict_in_generate=True)

model = AutoModel.from_pretrained(model_name, 
                                  cache_dir=cache_dir,
                                  output_attentions=True,
                                 )  
tokenizer = AutoTokenizer.from_pretrained(model_name, cache_dir=cache_dir)
tokenizer.pad_token_id = tokenizer.eos_token_id



The next step is to then `tokenize` the text; why? Because the model operates on *tokens*, meaning the smallest interpretable unit, not necessarily whole words.

In [5]:
inputs = tokenizer.encode(input_text, return_tensors='pt') 

With that, the model will then process this text and attribute attention weights to the tokens of the tokenized text.

In [6]:
# Run model
outputs = model(inputs) 

# Retrieve attention from model outputs
attention = outputs[-1]  

# Convert input ids to token strings
tokens = tokenizer.convert_ids_to_tokens(inputs[0])  

In [7]:
head_view(attention, tokens)

<IPython.core.display.Javascript object>

The probabilities retrieved and attributed to each of the following tokens can be described with the following code:

In [9]:
with torch.inference_mode():
  outputs2 = gpt2(inputs)

next_token_logits = outputs2.logits[0, -1, :]

next_token_probs = torch.softmax(next_token_logits, -1)

topk_next_tokens= torch.topk(next_token_probs, 5)

for idx, prob in zip(topk_next_tokens.indices, topk_next_tokens.values):
    print(f"{tokenizer.decode(idx): <20}{prob:.1%}")

walk                18.7%
 out                12.7%
,                   6.3%
 to                 4.9%
 and                4.4%


This simple experiment demonstrates how prompting and generation works with respect to LLMs! While it appears like magic, beneath the surface lies the intelligent design of an algorithm that relies on statistical concepts and text data processing, utilizing a series of inputs and outputs to produce some smart-sounding text!