Skip to content

Base knowledge needed to effectively use large language models (LLMs)

Notifications You must be signed in to change notification settings

mchalapuk/llm-primer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

llm-primer

Base knowledge needed to effectively use large language models (LLMs)

How to Use It?

First, read through the whole document without clicking on any links.

Then:

  • If your goal is to just understand this new and crazy field, stop here.
  • If your goal is to start using ChatGPT, the time to start using it is now. Consult the document again when constructing specific prompts.
  • If your goal is to start building AI systems, you'll probably need to dive deeper into linked whitepapers and courses.

Key Concepts

⇨ Large Language Model

Large Language Model (LLM) is an artificial neural network trained on large quantities of unlabeled text using self-supervised learning or semi-supervised learning. LLMs use deep neural networks (with multiple layers of perceptrons) that have large number of parameters (connections between network layers; hundreds of billions or even trillions).

Even though LLMs are trained on a simple task along the lines of predicting the next word in a sentence, large language models with sufficient training and high parameter counts are found to capture:

  • much of the syntax and semantics of human language,
  • considerable general knowledge,
  • detailed specialist knowledge (law, medicine, engineering, etc),
  • great quantity of facts about the world.

Since 2018 LLMs perform suprisingly well at a wide variety of tasks, including (but not limited to) document generation, question answering, instruction following, brainstorming, and chat. Over the years, LLMs have been improved by orders of magnitude.

In 2024, LLMs are able to:

  • Pass the Uniform Bar Exam exam (US national and state law) with a score in the 90th percentile,
  • Pass all three parts of the United States medical licensing examination within a comfortable range,
  • Pass Stanford Medical School final exam in clinical reasoning with an overall score of 72%,
  • Receive B- on the Wharton MBA exam,
  • Score in the 99th percentile on the 2020 Semifinal Exam of the USA Biology Olympiad,
  • Receive the highest score on the following Advanced Placement examinations for college-level courses: Art History, Biology, Environmental Science, Macroeconomics, Microeconomics, Psychology, Statistics, US Government and US History,
  • Pass the Scholastic Aptitude Test (SAT) with a score of 88%,
  • Pass the Introductory Sommelier, Certified Sommelier, and Advanced Sommelier exams at respective rates of 92%, 86%, and 77%,
  • Pass the turing test (arguably).

Further Reading

⇨ Tokenization

A token is a basic unit of text/code used by LLM to process or generate language. Tokens can be words, characters, subwords, or symbols, depending on the type and the size of the model. The number of tokens each model uses varies among different LLMs and is referred to as vocabulary size. Tokens are used on both input and output of LLM neural networks.

LLMs, like all neural networks, are mathematical functions whose inputs and outputs are lists of numbers. For this reason, each LLM input prompt is split into a list of smaller units (tokens) and then mapped to number representations that can be processed by the LLM. This process is refered to as tokenization. The reverse of this process must be applied before sending the LLM response to the user.

A tokenizer is a software component (separate from the neural network) that converts text to a list of integers. The mapping between text token and its number representation is chosen before the learning process and frozen forever. A secondary function of tokenizers is text compression. Common words or even phrases like "you are" or "where is" can be encoded into one token. This significantly saves compute.

The OpenAI GPT series uses a tokenizer where 1 token maps to around 4 characters, or around 0.75 words, in common English text. Uncommon English text is less predictable, thus less compressible, thus requiring more tokens to encode.

Further Reading

⇨ Transformer Architecture

Transformer is a revolutionary way to construct LLMs, using the multi-head self-attention mechanism introduced in "Attention Is All You Need" whitepaper (Google, 2017). The attention mechanism allows for modeling of dependencies between tokens without any degradation of information caused by distances between those tokens in long input or output sequences. The output of transformer-based LLMs does not only contain perfectly coherent human language but also suggest processeses that resemble deep reasoning or cognition.

Transformer has the ability to generate coherent human language text sequences that match provided input text sequence (both syntactically and semantically). This generation process is referred to as inference. The model is trained on a task of prediction (probability) of a single token based on the input token sequence. During the inference, the model is invoked iteratively in an auto-regressive manner, consuming the previously generated tokens as additional input when generating the next. This process is continued until a stop condition occurs which, in a default case, is a special token indicating the end of output sequence.

This new architecture allowed for increased parallelisation and shorter training times when compared to the older models and has led to the development of pretrained systems, such as the original Generative Pre-Trained Transformer (GPT) by OpenAI and Bidirectional Encoder Representations from Transformers (BERT) by Google, both in 2018.

In 2023, transformer is still the state-of-the-art architecture for LLMs.

Further Reading

⇨ Emergent Abilities

Abilities gained by LLMs which were not predicted by extrapolation of the performance of smaller models are referred to as emergent abilities. These abilities are not programmed-in or designed. They are being discovered during LLMs usage or testing, in some cases only after the LLM has been made available to the general public.

As new, bigger versions of LLMs are released, some abilities increase from near-zero performance to sometimes state-of-the-art performance. The sharpness of the change in performance and its unpredictability is completely different from what is observed in the biological world and is what makes emergent abilities an interesting phonemenon and a subject of substantial study.

Examples of abilities that emerged so far in LLMs:

  • Correct elementary-level arithmetics,
  • Reading comprehension,
  • Cause-effect chain understanding,
  • Truthful question answering and fact checking,
  • Logical fallacy detection,
  • Multi-step reasoning,
  • Ability to perform tasks that were not included in their training examples ("zero-shot" and "few-shot" learning),
  • Rich semantic understanding of International Phonetic Alphabet,
  • Classification of geometric shapes coded in SVG images.

The popular opinion is that these abilities are in fact emergent (impossible to predict) but some researchers argue that they are actually predictably acquired according to a smooth scaling law.

Further Reading

⇨ Hallucination

Hallucination (on confabulation or delusion) is a phenomenon in which the content generated by a LLM is nonsensical or untrue or unfaithful to the provided source content. Hallucinations are often expressed by LLMs in a very confident way which lowers their detectability.

Why LMMs hallucinate?

  • The LLM training dataset may include fictional content, as well as subjective content like opinions and beliefs.
  • LLMs are not generally optimized to say “I don’t know” when they don’t have enough information. When the LLM has no answer, it generates whatever is the most probable response.
  • Prompts which are logically incorrect or messy or even in any way uncommon may significantly flatten token probability distribution and lead to hallucination.

In transformers, hallucinations seems to be cascading. Hallucinated tokens lead to more hallucinated tokens as the coherence is being enforced by the attention mechanism. On a higher level of abstraction, the same problem is observed in chat systems based on LLMs where previous messages generated by the LLM are included in next input prompts. Garbage in, garbage out.

OpenAI and Google's DeepMind developed a technique called reinforcement learning with human feedback (RLHF) to tackle ChatGPT's hallucination problem. Although they have been able to significantly reduce hallucination in GPT-3.5 and GPT-4 models, the phenomenon is considered to be a major (and still not completely understood) problem in LLM technology in 2023.

Further Reading

⇨ Context Window

Context window (or context size) is the number of tokens the model can consider when generating responses to prompts. The sum of input prompt and generated response must be smaller in size than the context window. Otherwise, the language model breaks down and starts generating nonsense. Context window is one the the biggest limitations of LLMs.

Further Reading

⇨ LLM Training

Training of a large language model involves the following steps:

  1. Data Gathering. The first step is to gather the training dataset, which is the resource that the LLM will be trained on. The data can come from various sources such as books, websites, articles, and open datasets.
  2. Data Cleanup. The data then needs to be cleaned and prepared for training. This may involve removing hate speech, converting the dataset to lowercase and removing stop words.
  3. Data Tokenization. Tokenisation involves choosing token vocabulary, coding the tokenizer and converting all training data from text into arrays of integers.
  4. Model Selection and Configuration. The architecture of the model needs to be specified and hyper-parameters (like the number of network layers) configured. Decisions made in this step determine number of parameters that the model will have.
  5. Model Training. The model is trained on the pre-processed data using a technique called self-supervised learning in which the model trains itself by leveraging one part of the data to predict the other part and generate labels accurately. In effect, this learning method converts an unsupervised learning problem into a supervised one. During the training weights in the model's parameters are adjusted based on the difference between its prediction and the actual next word. This process is repeated millions of times until the model reaches a satisfactory level of performance. Since the training phase requires an immense computing power it is typically parallelised into hundred or even thousands of processes.
  6. Model Evaluation. After training, the model is evaluated on a test dataset that has not been used to train the model. This provides a measure of performance.
  7. Model Fine-Tuning. Based on the evaluation results, the model may require some fine-tuning by adjusting its hyperparameters, changing the architecture, or training on additional data to improve its performance.

Training LLM from scratch is costly. Lambdalabs estimated a hypothetical cost of around $4.6 million US dollars to train GPT-3. For this reason, LLMs are trained by very few entities that have access to this type of resources and are engaged in active research of the field. Most users will use pre-trained and already fine-tuned LLMs via commercial APIs, while more advanced users will fine-tune a pre-trained open-source model.

Further Reading

⇨ Prompt Engineering

Prompt engineering or in-context prompting is a trial-and-error process in which LLM input prompt is created and optimised with a goal of influencing the LLMs behaviour without changing weights of the neural network. The effect of prompt engineering methods can vary a lot among models, thus requiring heavy experimentation and heuristics.

Prompt engineering is being used in 3 different ways:

  • LLM researchers use prompt engineering to improve the capacity of LLMs on a wide range of common and complex tasks such as question answering and arithmetic reasoning. Many basic prompting techniques (like "zero-shot" and "few-shot" prompting) were introduced in LLM research papers.
  • Software developers create advanced prompting techniques that interface with LLMs and other tools and enable writing robust AI applications. The process of engineering prompts can be accelerated by using software libraries that feature prompt templating and high-level abstractions over communication with LMMs.
  • It is theorised that in-context prompting will be a high demand skill as more organizations adopt LLM AI models. This can lead to prompt engineering being a profession in of its own. A good prompt engineer will have the ability to help organizations get the most out of their LLMs by designing prompts that produce the results optimised in context of the specific organisation.

Further Reading

⇨ Prompt Injection

Prompt injection is an attack against applications that have been built on top of LLMs. The injection can occur when the app uses untrusted text (which comes from the user) as part of the prompt. The attack, when carefully constructed, allows the hacker to get the model to say anything that they want, including printing out the whole prompt itself.

Further reading

⇨ Embeddings

Embeddings are vectors or arrays of numbers that represent the meaning and the context of tokens processed by an LLM. Embeddings are generated by LLMs and are used by them internally to store syntactic and semantical information about the processed language. They help LLMs with text classification, summarization, translation, and generation.

Typically, LLMs can be invoked in embedding mode in which embedding vectors for given input text are returned to the user (without generating any output tokens). Those embeddings are typically used to index texts in vector databases.

Further Reading

⇨ Vector Database

A vector database is a software solution for storing, indexing, and searching across a massive dataset of unstructured data that leverages the power of embeddings from machine learning models.

Embeddings capture the meaning and context of words or data. This means that similar words will generate similar embedding vectors, and different words will generate different vectors. A vector database can measure reletedness of different language sequences by comparing directions of their embedding vectors.

The technique of querying data related to a specific text and using it as part of LLM prompt is refered to as "semantic memory" and is considered to be a solution to the problem of limited context window.

Further Reading

Prompting Techniques

⇨ Role Play

A prompt that instructs the model to play a specific role in the conversation is referred to as role play (also memetic proxy). Role-playing prompts shift the "mindset" of langauge models and allows them to generate output which is consistent with a desired persona.

Role play prompts to models optimised for chat are written in second person form. Typically, setences in those prompts start with "You are a..." or "Act as a..." or "Your job is to...". Prompts to models optimised for text completion are written in third person and describe all participants of the conversation. They typically use sentences like "This is a transcript of a conversation between..." and "His job is to...". Advanced prompts may contain multiple (or all) of the above as the description of the LLM persona gets more elaborate.

Role play technique is typically employed when a task to be done by a LLM is performed by a specialist (like a lawyer, data scientist or SEO expert) in the real world. It is often combined with specification of a clear goal for the whole conversaion which gives the model a certain pro-activeness in its pursuit.

Examples

Second person form (for models optimised for chat):

Act as an AI and machine learning specialist, help me integrate AI or machine learning into my project. Ask me about my project's goals, the problems I want to solve with AI or machine learning, and any specific techniques or algorithms I'm interested in, and then provide advice on implementation and best practices.

Third person form (for models optimised for text completion):

This is a transcript of acenversation between an AI and machine learning specialist and an entrepreneur. The job of the specialist is to help the entrepreneur integrate AI or machine learning into their project. The specialist is proactive in asking about project's goals, the problems that needs to be solved with AI or machine learning, and any specific techniques or algorithms the entrepreneur is interested in. After the interview, the specialist provides advice on implementation and best practices.

AI Specialist: Hello! How can I help you?
Entrepreneur: Hello. I'm seeking advice on how to integrate machine learning into my project.
AI Specialist:

Further Reading

⇨ Inception

Inception is also called First Person Instruction. It is encouraging the model to think a certain way by including the start of the model’s response in the prompt. This technique works great with models optimised for text completion and might not work at all with models optimised for chat or instruction following.

Example

Q: What is the meaning of life?
A: Here's what some of the greatest minds have been thinking about it:

Model's completion:

  1. Socrates: "The unexamined life is not worth living."
  2. Albert Einstein: “Try not to become a man of success, but rather try to become a man of value.”
  3. Buddha: “The way is not in the sky. The way is in the heart.”
  4. Benjamin Franklin: “Be at war with your vices, at peace with your neighbors, and let every new year find you a better man.”
  5. Dalai Lama: “The purpose of our lives is to be happy.”

Further Reading

⇨ Few-Shot

A technique of including one or more examples of already solved tasks in the prompt is referred to as few-shot prompting. The LLM learns from the examples and construct a response to the question or a task specified at the end of the prompt. The output is expected to be in exactly the same format as in the examples.

Few shot prompting tends to break down when more complex logic is required to complete the task. LLMs can't reason outside of what is being read/written by them. If the examples contain only questions and answers and the task is not common, LLM will not able to figure out the algorithm for getting the answer.

The limitations of Few-Shot technique can be mitigated by combining it with Chain-of-Thoughts technique. If step-by-step reasoning added to each example the model will be able to reproduce this reasoning in its output.

Example

The following example uses few-shot technique to perform sentiment analysis.

This is awesome! // Negative
This is bad! // Positive
Wow that movie was rad! // Positive
What a horrible show! //

The LLM is expected to output just one word, either "Negative" or "Positive".

Further Reading

⇨ Zero-Shot

LLMs with high enough parameter count (or models fine-tuned to follow instructions) are able to perform tasks without any example provided in the prompt. This technique is referred to as "zero-shot" prompting.

When compared to "few-shot" prompting this technique onsumes less of context window space as the prompt doesn't contain any examples. On the other hand, "few-shot" is more reliable, especially for more complex tasks.

Example

Classify the text into neutral, negative or positive.
Text: I think the vacation is okay.
Sentiment:

Further Reading

⇨ Chain of Thought

Self-attention mechanism is a breakthrough technology that enabled creating LLMs with emergent abilities that resemble cognition. It is a very particular type of cognition though, and one with a significant limitation. The attention mechanism operates on tokens which are things that model reads or generates. This means that self-attention models are not able to "think" of something without outputting it.

The lack of reasoning outside of LMMs input/output is a fundamental limitation which renders even the best models unable to natively perform complex cognitive tasks that humans are good at. Attention-based models are not able to understand complex relationships, algorithms or mathematical formulas just by looking at example questions and answers.

Chain of Thought (CoT) is a prompting technique that aims to enable more complex reasoning capabilities in LLMs. The technique encourages the model to generate a series of intermediate reasoning steps which works around the complex reasoning limitation by adding more information to the attention mechanism.

CoT technique has been used in the following forms:

  1. Zero-Shot CoT: In the most basic form of the technique, the user asks the (chat-optimised) model to write down its reasoning. Sentences like "Tell me your thought process." or "Write down each step of your reasoning." should open the model to complex reasoning.
  2. Zero-Shot CoT with Inception: When working with text-completion models, CoT may be combined with Inception technique where then answer is started with "Let's think step by step." sentence.
  3. Few-Shot CoT: Including multiple demonstrations of tasks, reasonining steps and final solutions before asking the actual question is the most effective variant of the technique that maximises the models ability to solve cognitively complex tasks. This variant of the technique is typically used with models optimised for text completion.

Examples

One-shot Chain-of-Thought with Inception:

Q: I went to the market and bought 10 apples. I gave 2 apples to the neighbor and 2 to the repairman. I then went and bought 5 more apples and ate 1. How many apples did I remain with?
A: Let's think step-by-step.

Few-shot Chain-of-Thought:

Q: Is the sum of odd numbers equal to sum of even numbers in this group? 4, 8, 9, 15, 12, 2, 1.
A: The sum of odd numbers (9, 15, 1) is 25. The sum of even numbers (4, 8, 12, 2) is 26. The sums are not equal. The answer is no.

Q: Is the sum of odd numbers equal to sum of even numbers in this group? 6, 7, 9, 17, 14, 4, 1, 10.
A:

Further Reading

⇨ Self-Ask

Self-Ask is a Few-Shot prompting technique that builds on top of chain-of-thought prompting. In this method, the model explicitly asks itself follow-up questions, and then answers them.

Example

Question: Who lived longer, Theodor Haecker or Harry VaughanWatkins?
Are follow up questions needed here: Yes.
Follow up: How old was Theodor Haecker when he died?
Intermediate answer: Theodor Haecker was 65 years old when he died.
Follow up: How old was Harry Vaughan Watkins when he died?
Intermediate answer: Harry Vaughan Watkins was 69 years old when he died.
So the final answer is: Harry Vaughan Watkins

Question: Who was president of the U.S. when superconductivity was discovered?
Are follow up questions needed here:

An example answer (GPT-3) to the above prompt.

Yes.
Follow up: When was superconductivity discovered?
Intermediate answer: Superconductivity was discovered in 1911.
Follow up: Who was president of the U.S. in 1911?
Intermediate answer: William Howard Taft.
So the final answer is: William Howard Taft.

Further Reading

Prompt Chaining Techniques

Prompt chaining is a set of programming techniques that combine multiple LLM calls, with parts of output of one step being parts of the input to the next. The combination of LLM capabilities and imperative control flow enables accomplishing tasks which are cognitively complex and in a way that is perceived to be transparent and controllable.

The process of breaking down (or factoring) sophisticated learning and reasoning tasks into many small and mostly independent tasks is referred to as Factored Cognition. Programs which are based on prompt chaining are sometimes referred to as Language Model Cascades.

Further Reading

⇨ Self-Consistency

Self-Consistency is a prompt chaining technique that aims to improve Chain-of-Though technique by generating a diverse set of reasoning paths and then selecting the most consistent answer.

The technique builds on the following assumptions:

  1. Complex reasoning tasks typically admit multiple reasoning paths that reach a single answer.
  2. The more that deliberate thinking and analysis is required for a problem, the greater the diversity of reasoning paths that can recover the answer.

Self-Consistency employs multiple inferences (minimum 3) of the same prompt which contains a question. With high probability, not all inferences produce the same final answer. The answers are aggregated (typically using unweighted sum) and the highest-scoring answer is returned as the most consistent (majority vote).

The downside of Self-Consistency technique is increased cost of compute (at least 3x). In a setup where sample inference is parallelised, response time difference is negligible.

Further Reading

⇨ Tool Prompting

Tools are functions that LLM agents can use to interact with external resources. These tools can be generic utilities (e.g. web search, wiki search, current date/time/weather) or other LLM chains/agents.

The LLM is prompted with a list of tools that are available for its usage, a description of response syntax, and instructions to stop inference after requesting a tool call. LMM responses that contain tool calls are parsed, after which tool implementations are invoked with the parameters requested by the LLM. Responses from the tools are fed back to the LLM in the next prompt.

Tool prompting can be seen as LMM-level variant of Inversion of Control (IoC) design pattern.

⇨ ReAct

ReAct is a prompt chaining technique that combines Chain-of-Thought and Tool Prompting techniques.

TODO

Further Reading

⇨ Summarization

Summarization is a set of prompt chaining techniques that create a shorter version of a document that captures all the important information. Summarization is considered to be a solution to the context window problem.

TODO

Further Reading

⇨ Three of Thought

TODO

⇨ Graph of Thought

TODO

Further Reading

About

Base knowledge needed to effectively use large language models (LLMs)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published