# 07. LLMs

- Intro
- Prompting
- PEFT
- Alignment
- LLMs for Code
- Exercise
- References

# 1. Introduction

## Transformers

[Source: [lena-voita.github.io](https://lena-voita.github.io/nlp_course/seq2seq_and_attention.html)]

![](https://lena-voita.github.io/resources/lectures/seq2seq/transformer/model-min.png)

## Mixture-of-Experts (MoE)

[Source: [github.com/huggingface/MoE](https://github.com/huggingface/blog/blob/main/moe.md)]

In the context of transformer models, a MoE consists of two main elements:
- Sparse MoE layers are used instead of dense feed-forward network (FFN) layers.
    - MoE layers have a certain number of “experts” (e.g. 8), where each expert is a neural network.
    - In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs!
- A gate network or router, that determines which tokens are sent to which expert.
    - For example, in the image below, the token “More” is sent to the second expert, and the token "Parameters” is sent to the first network.
    - As we’ll explore later, we can send a token to more than one expert.
    - How to route a token to an expert is one of the big decisions when working with MoEs - the router is composed of learned parameters and is pretrained at the same time as the rest of the network

![](./res/07_moe_switch.png)


The operation of a router in a standard MoE layer typically follows this sequence:

1.  _Input Reception:_ The router receives the representation (embedding) of an input token after it has passed through the self-attention mechanism in a transformer block.
2.  _Score Calculation:_ A trainable network within the router (often a simple linear layer) calculates a score (logit) for each expert, indicating how well-suited that expert is to process the given token.
3.  _Expert Selection (Routing):_ Based on the calculated scores, a subset of experts is selected. The most common strategy is _Top-$K$ Gating_, where the $K$ experts with the highest scores are chosen. For example, in models like Mixtral 8x7B and Grok-1, $K=2$ is used with a total of 8 experts.
4.  _Output Aggregation:_ The selected experts independently process the token. Their outputs are then combined into a weighted sum. The weights are determined by the router's scores, normalized (typically using a softmax function over the top-K experts).

It's crucial to understand that this process happens _independently_ in every MoE layer of the model. A token can be routed to different experts at different stages (layers) of its processing.

## Scaling Laws

[Source: [Kaplan et al - Scaling Laws for Neural Language Models 2020](https://arxiv.org/abs/2001.08361)]

Model performance depends most strongly on scale, which consists of three factors:
- the amount of compute used for training,
- the size of the dataset, and
- the number of model parameters.

 Within reasonable limits, performance depends very weakly on other architectural hyperparameters such as depth vs. width.

![](res/05_scaling_laws.png)

[Source: [LLMs](https://explodingtopics.com/blog/list-of-llms)]

| LLM Name | Developer | Release Date | Access | Parameters |
| :--- | :--- | :--- | :--- | :--- |
| GPT-5 | OpenAI | August 7, 2025 | API | Unknown |
| Claude 4.1 | Anthropic | August 5, 2025 | API | Unknown |
| Grok 5 | xAI | July 9, 2025 | API | Unknown |
| Qwen 3 | Alibaba | April 29, 2025 | API, Open Source | 235B |
| GPT-o4-mini | OpenAI | April 16, 2025 | API | Unknown |
| GPT-o3 | OpenAI | April 16, 2025 | API | Unknown |
| GPT-4.1 | OpenAI | April 14, 2025 | API | Unknown |
| Llama 4 Scout | Meta AI | April 5, 2025 | API | 17B |
| Gemini 2.5 Pro | Google DeepMind | Mar 25, 2025 | API | Unknown |
| GPT-4.5 | OpenAI | Feb 27, 2025 | API | Unknown |
| Claude 3.7 Sonnet | Anthropic | Feb 24, 2025 | API | Unknown (est. 200B+) |
| Grok-3 | xAI | Feb 17, 2025 | API | Unknown |
| Gemini 2.0 Flash-Lite | Google DeepMind | Feb 5, 2025 | API | Unknown |
| Gemini 2.0 Pro | Google DeepMind | Feb 5, 2025 | API | Unknown |
| GPT-o3-mini | OpenAI | Jan 31, 2025 | API | Unknown |
| Qwen 2.5-Max | Alibaba | Jan 29, 2025 | API | Unknown |
| DeepSeek R1 | DeepSeek | Jan 20, 2025 | API, Open Source | 671B (37B active) |
| DeepSeek-V3 | DeepSeek | Dec 26, 2024 | API, Open Source | 671B (37B active) |
| Gemini 2.0 Flash | Google DeepMind | Dec 11, 2024 | API | Unknown |
| Sora | OpenAI | Dec 9, 2024 | API | Unknown |
| Nova | Amazon | Dec 3, 2024 | API | Unknown |
| Claude 3.5 Sonnet (New) | Anthropic | Oct 22, 2024 | API | Unknown |
| GPT-o1 | OpenAI | Sept 12, 2024 | API | Unknown (o1-mini est. ~100B) |
| DeepSeek-V2.5 | DeepSeek | Sept 5, 2024 | API, Open Source | Unknown |
| Grok-2 | xAI | Aug 13, 2024 | API | Unknown |
| Mistral Large 2 | Mistral AI | July 24, 2024 | API | 123B |
| Llama 3.1 | Meta AI | July 23, 2024 | Open Source | 405B |
| GPT-4o mini | OpenAI | July 18, 2024 | API | ~8B (est.) |
| Nemotron-4 | Nvidia | July 14, 2024 | Open Source | 340B |
| Claude 3.5 Sonnet | Anthropic | June 20, 2024 | API | ~175-200B (est.) |
| GPT-4o | OpenAI | May 13, 2024 | API | ~1.8T (est.) |
| DeepSeek-V2 | DeepSeek | May 6, 2024 | API, Open Source | Unknown |
| Phi-3 | Microsoft | April 23, 2024 | API, Open Source | Mini 3B, Small 7B, Medium 14B |
| Mixtral 8x22B | Mistral AI | April 10, 2024 | Open Source | 141B (39B active) |
| Jamba | AI21 Labs | Mar 29, 2024 | Open Source | 52B (12B active) |
| DBRX | Databricks' Mosaic ML | Mar 27, 2024 | Open Source | 132B |
| Command R | Cohere | Mar 11, 2024 | API, Open Source | 35B |
| Inflection-2.5 | Inflection AI | Mar 7, 2024 | Proprietary | Unknown (predecessor ~400B) |
| Gemma | Google DeepMind | Feb 21, 2024 | API, Open Source | 2B, 7B |
| Gemini 1.5 | Google DeepMind | Feb 15, 2024 | API | ~1.5T Pro, ~8B Flash (est.) |
| Stable LM 2 | Stability AI | Jan 19, 2024 | Open Source | 1.6B, 12B |
| Grok-1 | xAI | Nov 4, 2023 | API, Open Source | 314 billion |
| Mistral 7B | Mistral AI | Sept 27, 2023 | Open Source | 7.3 billion |
| Falcon 180B | Technology Innovation Institute | Sept 6, 2023 | Open Source | 180 billion |
| XGen-7B | Salesforce | July 3, 2023 | Open Source | 7 billion |
| PaLM 2 | May 10, 2023 | API | 340 billion |
| Alpaca 7B | Stanford CRFM | Mar 13, 2023 | Open Source | 7 billion |
| Pythia | EleutherAI | Mar 13, 2023 | Open Source | 70 million to 12 billion |

[Source: [Timeline of major LLM releases](https://www.researchgate.net/publication/393983430_How_Well_Do_LLMs_Predict_Prerequisite_Skills_Zero-Shot_Comparison_to_Expert-Defined_Concepts)]

![](./res/07_llm_timeline.png)

[Source: [menlovc.com](https://menlovc.com/perspective/2025-mid-year-llm-market-update/)]

> Open-source models offer clear enterprise advantages: greater customization, potential cost savings, and the ability to deploy within private cloud or on-premises environments. But despite these benefits and recent improvements, open-source has continued to trail frontier, closed-source models in performance by nine to 12 months.

![](./res/07_closed_vs_open.png)

### The size of datasets used to train language models doubles approximately every six months

[Source: [epoch.ai/data-insights/dataset-size-trend](https://epoch.ai/data-insights/dataset-size-trend)]

> Across all domains of ML, models are using more and more training data. In language modeling, datasets are growing at a rate of 3.7x per year. The largest models currently use datasets with tens of trillions of words. The largest public datasets are about ten times larger than this, for example Common Crawl contains hundreds of trillions of words before.

![](./res/07_data_trend_llm.png)

### Training open-weight models is becoming more data intensive

[Source: [epoch.ai/data-insights/training-tokens-per-parameter](https://epoch.ai/data-insights/training-tokens-per-parameter)]

> The ratio of training data to active parameters in open-weight LLMs has grown 3.1x per year since 2022. Recent models have been trained with 20 times more data per parameter than the optimal ratio suggested by the 2022 Chinchilla scaling laws. Our analysis focuses on open-weights models, where information on training tokens and parameters is more available.

![](./res/07_training_tokens_per_parameter.png)

### Context Size

[Source: [epoch.ai](https://epoch.ai/data-insights/context-windows)]

> Since mid-2023, the longest LLM context windows have grown by about 30x per year. Their ability to use that input effectively is improving even faster: on two long-context benchmarks, the input length where top models reach 80% accuracy has risen by over 250x in the past 9 months.

![](./res/07_context_windows.png)


[Source: [lifearchitect.ai/gpt-4](https://lifearchitect.ai/gpt-4/)]

200K context tokens:
- about 150K words
- hundreds of pages of text
- a couple of books (The Great Gatsby about 72K tokens)
- text that would take about 10 hours to read

## Hard Tasks for LLMs

There are examples of complex problems that some LLMs solve well.

![](res/07_hard_task_gpt4.png)

There are examples of simple problems that LLMs do poorly.

![](./res/07_llm_hard_gpt4o_calculator.png)

UPDATE 2025.11.05

```deepseek
The product of 124454 and 54776 is 6817092304.

This is calculated as follows:  
124454 × 54776 = 124454 × (54000 + 776) = 124454 × 54000 + 124454 × 776  
124454 × 54000 = 6720516000  
124454 × 776 = 96576304  
Adding these gives 6720516000 + 96576304 = 6817092304.

Alternatively, using direct multiplication:  
124454 × 54776 = 6817092304.

\boxed{6817092304}
```

![](./res/07_llm_hard_gpt4o_alice.png)

UPDATE 2025.11.05

```
Alice has N brothers and M sisters. This means that, including Alice, there are M + 1 girls in the family. When considering Alice's brother, all the girls in the family are his sisters. Therefore, Alice's brother has M +  sisters.

\boxed{M+1}
```

Q:
```
How many "r"s in "strarrtrabbbery"?

```

A:
```
The string "strarrtrabbbery" contains **5** instances of the letter "r". Here's a breakdown for clarity:

- The string is: s, t, **r**, a, **r**, **r**, t, **r**, a, b, b, b, e, **r**, y
- Positions of "r": 3rd, 5th, 6th, 8th, and 14th characters.

Thus, the total count is 5.
```

More:

- [Dziri et al - Faith and Fate: Limits of Transformers on Compositionality](https://arxiv.org/abs/2305.18654)
- [Bubeck et al - Sparks of Artificial General Intelligence: Early experiments with GPT-4](https://arxiv.org/abs/2303.12712)
- [Nezhurina et al - Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models](https://arxiv.org/abs/2406.02061)
- [Williams Huckle - Easy Problems That LLMs Get Wrong](https://arxiv.org/abs/2405.19616v2)

# 2. Prompting

## In-Context Learning

In-context learning (ICL) is a technique where task demonstrations are integrated into the prompt in a natural language format.

![](http://ai.stanford.edu/blog/assets/img/posts/2022-08-01-understanding-incontext/images/image13.gif)

- 0-shot
- 1-shot
- few-shot

## Instructions

- LM just predicts the next token given the previous tokens
- One core capability of Large Language Models (LLMs) is to follow natural language instructions

## Chain-of-Thoughts

![](https://www.promptingguide.ai/_next/image?url=%2F_next%2Fstatic%2Fmedia%2Fcot.1933d9fe.png&w=1920&q=75)

## Tool Calling

Tool calling, also known as function calling, is a technique that allows LLMs to interact with external tools, APIs, or functions.
This capability transforms LLMs from conversational chatbots into active agents that can perform actions like fetching real-time data or performing calculations.

LLMs are specifically taught to use tools. The process looks like this:

1.  Creating a Specialized Dataset: Researchers create thousands or millions of example dialogues that demonstrate _when_ and _how_ to call tools.
2.  Training Data Structure:
    - User Query: "What's the weather in London?"
    - Correct Model Response (Internal Representation): Not just "It's sunny, 20°C", but a structured output:
        ```json
        {
          "tool_calls": [{
            "name": "get_current_weather",
            "arguments": {"location": "London"}
          }]
        }
        ```
    *   The dataset also includes multi-turn examples where the model calls a tool, receives a result from the "system," and then generates a final, natural language answer for the user.
4.  The Learning Process: The model (already knowledgeable from pretraining) is fine-tuned on these specific examples. It learns to recognize the pattern: `If the user asks about X, and answering requires tool Y, then I must generate a structured JSON object instead of plain text.`

## Reasoning



"Reasoning" refers to the model's ability to process information in a multi-step, structured way to arrive at an answer or conclusion that is not directly present in its training data or the immediate prompt.

# 3. PEFT

## PEFT Taxonomy

![](res/07_peft_taxonomy.png)

[Source: [Lialin et al 2023](https://arxiv.org/abs/2303.15647)]

## LoRA

![](./res/07_lora.png)

[Source: [sebastianraschka.com](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms)]

# 4. Alignment

![](res/07_alignment.png)

_AI Alignment_ (or LLM Alignment) is the field of research and techniques aimed at ensuring that an Artificial Intelligence system's goals and behaviors are in accordance with human intentions.

For LLMs, this means shaping the model to be:
1.  Helpful: It proactively tries to understand and fulfill the user's request.
2.  Harmless: It refuses to generate dangerous, unethical, or illegal content.
3.  Honest: It strives to provide accurate information and indicates when it is uncertain, rather than "hallucinating" confidently.

## RLHF

![](2022-2023/res_nlp/rlhf_overview.png)

# 5. LLMs for Code

| Model Name | Release Date | Opensource or Closed | Number of Parameters (Billion) |
| :--- | :--- | :--- | :--- |
| **GPT-5** | August 2025 | Closed | Undisclosed |
| **GPT-5-mini / nano** | 2025 (specific date not given) | Closed | Undisclosed |
| **GPT-5-Codex** | September 2024 | Closed | Undisclosed |
| **Claude Opus 4.1** | August 2025 | Closed | Undisclosed |
| **Claude Sonnet 4.5** | September 2025 | Closed | Undisclosed |
| **Claude Haiku 4.5** | October 2025 | Closed | Undisclosed |
| **Gemini 2.5 Pro** | March 2025 | Closed | Undisclosed |
| **Gemini 2.5 Flash / Flash-Lite** | 2025 (specific date not given) | Closed | Undisclosed |
| **Mistral Large 2** | July 2024 | Closed | 123B |
| **Codestral** | May 2024 | Closed | 22B |
| **DeepSeek R1** | January 2025 | Open Source | 671B (37B active) |
| **Qwen 3** | April 2025 | Open Source | 235B |
| **Falcon 3** | December 2024 | Open Source | Up to 10B |
| **Granite 3.2** | February 2025 | Open Source | 2B and 8B |
| **Llama 4 Scout** | April 2025 | Information Missing | 17B |

# Exercise


It is necessary to conduct a mini-research to determine whether LLMs are good for clone detection.  

> The clone detection task: two snippets (which can be considered as two functions) are given as input.
> It is necessary to determine whether they implement the same functionality or not. In other words, whether they will generate the same output given the same input.  

To conduct the research, a dataset will be required. The [BigCloneBench](https://github.com/clonebench/BigCloneBench) dataset can be used.  

The task will be evaluated based on how well-justified the conclusions are. If assumptions are made, they must be documented. In this task, it is necessary to make maximum use of LLMs.

Possible plan:
1. choose any open model (codellama, codegemma etc.)
2. choose any dataset for clones, or part of it, or come up with a small number of examples yourself
3. select a prompt
4. get the model's responses
5. transform the model's responses into labels (in any way: structured output, classification, regular expressions, manually...)
6. calculate metrics, make conclusions

# References

- [Machine Learning Trends](https://epoch.ai/trends)
- [Lenovo LLM Sizing Guide](https://lenovopress.lenovo.com/lp2130-lenovo-llm-sizing-guide)
- [Mixture of Experts Explained](https://huggingface.co/blog/moe)
- [What is MoE 2.0? Update Your Knowledge about Mixture-of-experts](https://huggingface.co/blog/Kseniase/moe2)
- [How do mixture-of-experts models compare to dense models in inference?](https://epoch.ai/gradient-updates/moe-vs-dense-models-inference)
- [Kaplan et al - Scaling Laws for Neural Language Models](https://arxiv.org/abs/2001.08361v1)
- [Wei et al - Emergent Abilities of Large Language Models](https://arxiv.org/abs/2206.07682)
- [Wei et al - Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903)
- [Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers](https://arxiv.org/abs/2212.10559)
- [Microsoft: The power of prompting](https://www.microsoft.com/en-us/research/blog/the-power-of-prompting/)
- [Yandex: Тетрадь с чит-промптами](https://ya.ru/project/cheat-prompts/index)
- [Antropic: Prompt engineering](https://docs.anthropic.com/claude/docs/prompt-engineering)
- [Lialin et al - Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning](https://arxiv.org/abs/2303.15647)
- [LoRA](https://huggingface.co/docs/peft/conceptual_guides/lora)
- [Practical tips](https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms)
- [Zhou et al - LIMA: Less Is More for Alignment](https://arxiv.org/abs/2305.11206): SFT on carefully selected examples (1000), without using RL
- [Rafailov et al - Direct Preference Optimization](https://arxiv.org/abs/2305.18290)
- [Constitutional AI](https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback)
- [Zhout et al - LIMA: Less Is More for Alignment](https://arxiv.org/abs/2305.11206)
- [Rafailov et al - Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/abs/2305.18290)
- https://wandb.ai/ayush-thakur/RLHF/reports/Understanding-Reinforcement-Learning-from-Human-Feedback-RLHF-Part-1--VmlldzoyODk5MTIx
- [Illustrating Reinforcement Learning from Human Feedback (RLHF)](https://huggingface.co/blog/rlhf)
- https://github.com/opendilab/awesome-RLHF
- https://rail.eecs.berkeley.edu/deeprlcourse/
- [Stiennon et al - Learning to summarize from human feedback](https://arxiv.org/abs/2009.01325)
- https://github.com/huybery/Awesome-Code-LLM
- [Code Llama: Open Foundation Models for Code](https://arxiv.org/abs/2308.12950)
- [DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence](https://arxiv.org/abs/2401.14196)
- [CodeGemma: Open Code Models Based on Gemma](https://goo.gle/codegemma)