---
title: "Day 1: Introduction to Large Language Models"
author: "Juan F. Imbet"
date: "2025-05-07"
format:
  revealjs:
    theme: default
    slide-number: true
    preview-links: auto
    css: ../../styles.css
    logo: ../../images/logo_header.svg
jupyter: python3
---


# Introduction to Large Language Models in Finance

## Overview of Today's Lecture

- Evolution of NLP applications in Finance
- Word Embeddings
- Tokenizers
- The Transformers Architecture
- Softmax Probabilities and Token Generation
- Classification and Scalability of LLMs


# Evolution of NLP Applications in Finance

## Historical Development

:::: {.columns}

::: {.column width="100%"}
- **Traditional rule-based approaches (1960s – 1990s)** keyword spotting and hand-written grammars for parsing bank statements and wire stories. Access to computing power limited to large institutions.
- **Statistical methods (1990s – 2010s)** – TF-IDF, n-grams, naïve Bayes sentiment. 
- **Machine-learning era (2010s)** – supervised classifiers & finance-specific dictionaries.  
- **Deep-learning revolution (≈2015 +)** – word-embeddings, CNN/RNN sentiment on earnings calls, topic models.
- **Large Language Models (2018 +)** – GPT-style chatbots & summarizers now embedded in research and advisory workflows (see Morgan Stanley GPT-4 Assistant 2023).
:::

::::

---

## Selected Academic Milestones (2004 – 2024)

:::: {.columns}

::: {.column width="60%"}
- **Antweiler & Frank 2004, *JF*** – Internet-forum sentiment & volatility ([link](https://doi.org/10.1111/j.1540-6261.2004.00662.x))  
- **Tetlock 2007, *JF*** – WSJ pessimism predicts next-day returns ([link](https://doi.org/10.1111/j.1540-6261.2007.01232.x))  
- **Dougal *et al.* 2012, *RFS*** – Columnist-specific tone moves the market ([link](https://academic.oup.com/rfs/article-abstract/25/3/639/1617372))  
- **Loughran & McDonald 2011, *JF*** – Finance-specific tone dictionaries for 10-Ks ([link](https://doi.org/10.1111/j.1540-6261.2010.01625.x))  
- **Jegadeesh & Wu 2013, *JFE*** – Market-reaction-weighted tone metric ([link](https://www.sciencedirect.com/science/article/pii/S0304405X13002328))  
- **Chen *et al.* 2014, *RFS*** – Seeking Alpha opinions predict returns & EPS surprises ([link](https://academic.oup.com/rfs/article/27/5/1367/1581938))  
- **Manela & Moreira 2017, *JFE*** – NVIX text-based disaster-risk index ([link](https://doi.org/10.1016/j.jfineco.2016.08.013))  
- **Buehlmaier & Whited 2018, *RFS*** – Text-identified financial-constraints premium ([link](https://academic.oup.com/rfs/article-abstract/31/7/2693/4824924))  
- **Hassan *et al.* 2019, *QJE*** – Firm-level political-risk from earnings-call text ([link](https://academic.oup.com/qje/article/134/4/2135/5531768))  
- **Bybee *et al.* 2024, *JF*** – Topic-model news-attention indices improve macro forecasts ([link](https://doi.org/10.1111/jofi.13377))  
:::

::: {.column width="40%"}
**Why it matters**

- Predict **returns**, **volatility**, and **risk premia**  
- Reveal intangible firm traits (constraints, political risk)  
- Enhance macro forecasting with text-derived factors  
- Methodology shifted: word-counts → ML classification → embeddings & topic models
:::

::::

---

## Real-World Industry Applications of NLP

:::: {.columns}

::: {.column width="60%"}
- **Thomson Reuters / Bloomberg News Analytics** – Real-time machine-readable sentiment feeds power quant desks and HFT strategies ([Forbes](https://www.forbes.com/sites/tomgroenfeldt/2011/11/28/trading-on-sentiment-analysis-a-public-relations-tool-goes-to-wall-street/))  
- **RavenPack** – >70 % of top quant funds ingest its news-sentiment data for alpha & risk ([RavenPack](https://www.ravenpack.com/products/edge/data/news-analytics))  
- **MarketPsych Capital (2008-10)** – Hedge fund trading on media sentiment, +28 % during the 2008 crisis ([MarketPsych](https://www.marketpsych.com/))  
- **Derwent Capital "Twitter Fund" 2011** – $40 m portfolio guided by Twitter mood ([Atlantic](https://www.theatlantic.com/business/archive/2011/05/the-worlds-first-twitter-based-hedge-fund-is-finally-open-for-business/239097/))  
- **J.P. Morgan COIN 2017** – NLP reviews loan contracts in seconds, saving 360 k lawyer-hours ([ABA Journal](https://www.abajournal.com/news/article/jpmorgan_chase_uses_tech_to_save_360000_hours_of_annual_work_by_lawyers_and))  
- **Morgan Stanley GPT-4 Assistant 2023-24** – Chatbot for 16 k advisors, instant Q&A on 100 k research docs ([Press release](https://www.morganstanley.com/press-releases/morgan-stanley-research-announces-askresearchgpt))  
- **Kensho (acq. S&P Global 2018)** – NLP Q&A platform "Warren" enhances S&P analytics ([S&P Global](https://investor.spglobal.com/news-releases/news-details/2018/SP-Global-to-Acquire-Kensho-Bolsters-Core-Capabilities-in-Artificial-Intelligence-Natural-Language-Processing-and-Data-Analytics-2018-3-6/default.aspx))  
- **SEC "RoboCop" 2013** – Accounting Quality Model flags anomalous filings for enforcement ([Harvard Law Blog](https://corpgov.law.harvard.edu/2014/01/27/the-secs-refocus-on-accounting-irregularities/))  
- **Lloyd's / FRISS Fraud Detection** – Text-mining claims boosts fraud-catch rate ≈30 % ([Lloyd's Lab report](https://assets.lloyds.com/media/dc22cd29-1c4e-441c-a872-e1bf5ce9142a/Lloyds%20Lab_impact%20report_FINAL.pdf))  
:::

::: {.column width="40%"}
**Key take-aways**

- **Alpha & Risk** – sentiment + event extraction  
- **Efficiency** – contract analysis, research Q&A  
- **Governance** – regulators spot fraud & anomalies  
- **Generative AI** – LLM-powered advisory tools
:::

::::


# Word Embeddings in Finance

## What Are Word Embeddings?

### Technical Definition
- **Word embeddings**: Dense vector representations of words in a continuous vector space
- Words are mapped to real-valued vectors in an n-dimensional space (typically 100-300 dimensions)
- Semantically similar words are positioned closer together in this vector space
- The relative positions and distances between word vectors encode meaningful relationships

### Intuition
- Think of embeddings as "translating" words into a language that computers understand (vectors)
- Each dimension represents a latent feature of the word's meaning
- Instead of treating words as isolated symbols, embeddings capture their context and relationships
- Example: In a 2D simplification, "profit" and "earnings" would be close together, while "loss" would be farther away

## Why Embeddings Matter in Finance

- Transform unstructured textual data (news, reports, filings) into structured numerical data
- Enable quantitative analysis of qualitative information
- Allow algorithms to understand semantic relationships between financial concepts
- Support tasks like sentiment analysis, document classification, and information retrieval
- Bridge the gap between natural language and mathematical models
- You can do this with words, as well as with entire sentences, paragraphs, or documents.

## Traditional Methods

### One-Hot Encoding
- Represents each word as a unique vector with a single 1 and all other elements 0.
- High-dimensional and sparse representation
- Example: "dog" = [0, 0, ..., 1, 0, ..., 0] (1 at index for "dog")
- Pros: Simple and Fast to compute
- Cons: Inefficient, high-dimensional, and does not capture relationships between words

### TF-IDF (Term Frequency-Inverse Document Frequency)
- Weighs words based on their frequency in a document relative to their frequency across a corpus. 
- Helps identify important terms in documents
Formula

$$
\begin{align*}
TF-IDF(w, d) = TF(w, d) * IDF(w) \\
TF(w, d) = \frac{f(w, d)}{|d|} \\
IDF(w) = \log\frac{N}{n(w)}
\end{align*}
$$

Where:
- \( f(w, d) \): Frequency of word \( w \) in document \( d \)
- \( |d| \): Total number of words in document \( d \)
- \( N \): Total number of documents in the corpus
- \( n(w) \): Number of documents containing word \( w \)
- Example:
```
Doc 1: "The quick brown fox jumps over the lazy dog."
Doc 2: "The lazy dog sleeps all day."
```

- TF-IDF for "lazy"
```math
TF("lazy", Doc 1) = 1/9
IDF("lazy") = log(2/2) = 0
TF-IDF("lazy", Doc 1) = (1/9) * 0
TF-IDF("lazy", Doc 2) = (1/7) * log(2/2) = 0  
```

- TF-IDF for "quick"
```math
TF("quick", Doc 1) = 1/9
IDF("quick") = log(2/1) = log(2)
TF-IDF("quick", Doc 1) = (1/9) * log(2) 
TF-IDF("quick", Doc 2) = 0
```

### Word2Vec
- Meaning of a word based from its surrounding words. 
- Trained with large corpora to learn word relationships
- Words which appear in similar contexts are mapped to vectors which are nearby as measured by cosine similarity. 

### GloVe (Global Vectors)
- It was developed by Stanford in 2014.
- It is trained on aggregated global word-word co-occurrence statistics from a corpus. (e.g. how often words appear together)

#### Bag of Words (BoW)
- Represents text as occurrence counts of words
- Simple but loses word order and context

#### FastText
- Extension of Word2Vec that uses character n-grams
- Better handles rare words and morphologically rich languages
- Can represent out-of-vocabulary words

## Properties of Word Vector Spaces

### Semantic Clustering
- Words with similar meanings cluster together
- Example: "equity," "stock," "share" form a cluster

### Vector Arithmetic
- Word vectors can be added and subtracted with meaningful results
- Classical example: "king" - "man" + "woman" ≈ "queen"
- Financial examples:
  - "Bull" - "Market" + "Housing" ≈ "Bubble"
  - "Fed" + "Increase" ≈ "Rates"
  - "Bond" - "Price" + "Increase" ≈ "Yield"

### Analogical Reasoning
- Vector relationships encode semantic relationships
- Enables solving analogies: (A is to B as C is to D)
- Financial example: "Stock:Equity :: Bond:Debt"

## Finance-Specific Word Embeddings

### Domain Adaptation
- Generic embeddings often miss nuances in financial language
- Domain-specific embeddings are trained on financial corpora:
  - Earnings call transcripts
  - Financial news
  - SEC filings
  - Analyst reports

### Benefits in Financial Applications
- More accurate representation of financial terminology
- Better capture of relationships between market concepts
- Improved performance in financial NLP tasks:
  - Sentiment analysis of market news
  - Classification of financial documents
  - Information extraction from reports

## Limitations of Traditional Embeddings

### Context Insensitivity
- Each word has only one vector regardless of context
- Polysemy problem: "bank" (financial institution vs. river edge)
- "Interest" (monetary vs. attentional) has different meanings in finance

### Static Nature
- Cannot adapt to evolving language and new terminology
- Market-specific terms change meaning during different economic cycles

### Rare Terms Challenge
- Financial jargon and specialized terminology often lack quality embeddings
- Numerical values and symbols are not well represented

### Phrase Handling
- Important financial phrases ("interest rate," "balance sheet") need special treatment
- Individual word embeddings may not capture phrase meanings


## Limitations of traditional embeddings


- Do you recall how the word ***bank*** can refer to a financial institution or the side of a river? Traditional word embeddings struggle with such polysemy, as they assign a single vector representation to each word, regardless of context.
- The only way to address this is to use context-sensitive embeddings, which means that words need to talk to each other. The word embedding for a word depends on the words around it.
- Letting words **talk to each other** was first explored in the context of machine translation using RNN (Recurrent Neural Networks). 
- However. it was not until the introduction of the **Transformer architecture** that we could effectively let words talk to each other in a scalable way. We will explore this in the next section.

## Some Definitions
- **Recurrent Neural Networks (RNNs)**: A type of neural network designed for sequential data, where the output from previous steps is fed as input to the current step.
- **LSTM (Long Short-Term Memory)**: A type of RNN that can learn long-term dependencies, making it suitable for tasks like language modeling and translation.
- Prior to transformers, RNN architectures were the state of the art. They contained a **feedback*** loop in the network connections that allows information to propagate, making it ideal for sequential data like text.
- A crutial feature of these networks is that the **input** and **output** do not have to be the same length.

## Basic RNNs

- Consider a sequence of observations of arbitrary length and a prediction of the next observation in the sequence. (E.g. bond quotes in TRACE)
- A basic RNN would take the previous observations as input, process it through a hidden layer, and output a prediction for the next return.
- A RNN Cell is a simple unit that takes an input and the previous hidden state, processes them, and outputs a new hidden state and an output.

![Basic RNN Cell](../../images/rnn1.png)

## Unfolding the RNN

- The RNN can be unfolded over time, where each time step corresponds to a new observation in the sequence.

![Basic RNN Cell](../../images/rnn2.png)


## More on RNNs

- Recurrent Neural Networks (RNNs) extend traditional neural networks by allowing them to process **sequences of variable length**, unlike vanilla or convolutional networks which operate on fixed-size inputs and outputs.

- RNNs can handle diverse tasks such as **sequence-to-sequence** (e.g., machine translation), **sequence-to-one** (e.g., sentiment analysis), **one-to-sequence** (e.g., image captioning), and **synced input/output sequences** (e.g., video frame labeling).

- The **core mechanism** of RNNs is the state vector, which evolves through a **fixed, learned transformation** that combines past information (state) with new input at each time step.

- RNNs are more **computationally expressive** than feedforward networks: they can be seen as running a learned program, and are theoretically **Turing-complete**.

- Even when inputs and outputs are fixed-size vectors, RNNs can still be used to process them **sequentially** — for example, by learning to attend over parts of an image or generating images step by step.

## Different types of RNNs

![Different RNNs](https://karpathy.github.io/assets/rnn/diags.jpeg)

If you are interested in all that RNNs can do, I recommend reading [Andrej Karpathy's blog post](https://karpathy.github.io/2015/05/21/rnn-effectiveness/) on the effectiveness of RNNs.

## The Encoder-Decoder Framework

- For most applications we will focus on mapping a sequence of inputs to a sequence of outputs. In a RNN the *encoder* encodes the information from the input sequence into a numerical representation, usually **encoded** in the last hidden state. 
- In the ***The capital of France is** example, the encoded representation is the last hidden state. 
- The *decoder* then takes this representation and generates the output sequence, one token at a time.

## Encoder-Decoder blocks for machine translation, e.g. english to catalan

![RNN for machine translation](../../images/rnn3.png)

## Limitations of the traditional Encoder-Decoder framework

- Although elegant in its simplicity, one weakness is that the final hidden state of the encoder creates an **information bottleneck**. A single state needs to be able to represent the meaning of the whole input sequence. 
- This is particularly challenging for long sequences. 
- What if we give access to the decoder to all the hidden states of the encoder?
- This is the idea behind the **attention mechanism**, let RNN cells in the decoder **pay attention** to all the hidden states of the encoder, not just the last one.

## Attention Mechanisms 

![Attention Mechanism](../../images/encoder-decoder-attention.png)

- The idea behind attention is to give the decoder access to the hidden states of the encoder. 
- However, using all the states at the same time would create a huge computational burden, so we need to **weight** the hidden states of the encoder.
- These weights are learned during training and allow the decoder to focus on the most relevant parts of the input sequence.
- This general attention mechanism is also referred as **cross-attention**.
- A big limitation is that the attention mechanism is still sequential, meaning that the decoder needs to process the input sequence one word at a time. 


# Understanding Transformers: Step by Step

## What is a Transformer?

- Revolutionary neural network architecture introduced in "Attention Is All You Need" (2017)
- **Key innovation**: Replaces recurrence and convolutions entirely with attention mechanisms
- Enables **parallel processing** of sequences (unlike RNNs)
- Foundation for modern LLMs including GPT-2, GPT-3/4, BERT

![Attention Paper](../../images/attention_paper.png)

## Transformer Evolution

![Attention Mechanism](../../images/attention_abstract.png)

## A Visual Representation

![Transformer Baby](../../images/transformer_baby.png)


## The Big Picture: Transformer Architecture

![Transformer Architecture](../../images/transformers_model.png)

- **Encoder-Decoder Structure**: Input → Encoder → Decoder → Output
- **Self-Attention**: Each position can attend to all positions in previous layer
- **Parallelizable**: No sequential dependencies like RNNs



## The recipe: Part 1


- **Tokenise** the *source* sentence and add start/end markers  
- **Embed** each token **+** add positional encodings

- **Encoder (× N layers)**  
  - Multi-head **self-attention**  
  - Position-wise **feed-forward network**  
  - **Residual connection + LayerNorm** after each sub-layer  

- Cache the resulting **encoder hidden states** (the "memory")



## The recipe: Part 2


- **Decoder (run autoregressively)**  
  1. Embed the generated prefix tokens **+** positional encodings.  
  2. **Masked self-attention** (each token sees only ≤ current position).  
  3. **Cross-attention** over the encoder memory (lets the decoder “look back” at the source).  
  4. Feed-forward → Residual → LayerNorm.  
  5. **Linear projection** (tied to embeddings) → **softmax** → probability distribution.  
  6. **Select** the next token (greedy, top-k, nucleus, beam, …), append it, and repeat until ⟨end of sentence EOS⟩ or a maximum length.


## New vocabulary?

- **Self-Attention**: Mechanism allowing each token to attend to all other tokens in the sequence, capturing dependencies regardless of distance.
- **Multi-Head Attention**: Multiple self-attention mechanisms running in parallel, allowing the model to capture different types of relationships.
- **Feed-Forward Network (FFN)**: A fully connected neural network applied to each position independently, typically with a ReLU (Rectified Linear Unit) activation.
- **Positional Encoding**: Adds information about the position of each token in the sequence, since transformers do not have a built-in notion of order.
- **Residual Connection**: A shortcut connection that adds the input of a layer to its output, helping to prevent vanishing gradients in deep networks.
- **Layer Normalization**: A technique to stabilize and accelerate training by normalizing the inputs to each layer, applied after residual connections.


## Step 1: Input Embeddings

### Token Embeddings
- When you feed a sequence of tokens into a transformer‐based LLM, each discrete token (an integer index) is turned into a dense vector of lower dimensionality than the vocabulary size

- Let $V$ be the vocabulary size, and $N$ the sequence length 
- A token $j$ is an integer in the set $\{0, 1, \ldots, V-1\}$
- A brute force one-hot vector encoding $x \in \{0,1\}^V$
$$
x_j = \begin{cases}
1 & \text{if } j = i \\
0 & \text{otherwise}
\end{cases}
$$

- This is inefficient, especially for large vocabularies, as it results in high-dimensional sparse vectors
- Instead, we use a **learnable embedding matrix** $\mathbf{E} \in \mathbb{R}^{V \times d_{\text{model}}}$, where $d_{\text{model}}$ is typically much smaller than $V$

## Step 1: Positional Encoding

- If token $t$ is represented by index $j$, its embedding is given by the $j$-th row of 
$$
\mathbf{e}_j = \mathbf{x}_j^T \mathbf{E} = \mathbf{E}_j
$$

### Positional Encoding
Since transformers have no inherent notion of position, we add positional information:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$

$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$

**Final input**: $\text{input}_i = \text{embedding}_i + PE_i$



## Step 2: Self-Attention Mechanism - The Core

### Queries, Keys, and Values
For each token embedding $\mathbf{x}_i$, we create three vectors:

$$\mathbf{q}_i = \mathbf{x}_i \mathbf{W}^Q \quad \text{(Query)}$$
$$\mathbf{k}_i = \mathbf{x}_i \mathbf{W}^K \quad \text{(Key)}$$  
$$\mathbf{v}_i = \mathbf{x}_i \mathbf{W}^V \quad \text{(Value)}$$

Where $\mathbf{W}^Q, \mathbf{W}^K, \mathbf{W}^V \in \mathbb{R}^{d_{\text{model}} \times d_k}$ are learned parameter matrices.



## Step 3: Computing Attention - The Intuition

### Attention Formula
$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$$

### Step-by-step breakdown:

1. **Dot product**: $\mathbf{Q}\mathbf{K}^T$ gives similarity scores between all pairs
2. **Scale**: Divide by $\sqrt{d_k}$ to prevent softmax saturation
3. **Normalize**: Apply softmax to get attention weights
4. **Weighted sum**: Multiply by values $\mathbf{V}$

## Step 3: Attention Visualization

![Attention Visualization](../../images/transformers1.png)

- Each token attends to all other tokens
- Attention weights determine how much information flows
- Self-attention is the key to capturing dependencies regardless of distance
- This replaces the need for recurrence in traditional RNNs

## Step 4: Multi-Head Attention

- Single attention mechanism provides limited representational power
- **Multi-head attention** runs multiple attention computations in parallel
- Each "head" learns different relationship patterns:
  - Some heads focus on nearby words
  - Others capture long-range dependencies
  - Some track syntactic relationships

$$\text{MultiHead}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{Concat}(\text{head}_1, \text{head}_2, ..., \text{head}_h)\mathbf{W}^O$$

Where each head is computed as:
$$\text{head}_i = \text{Attention}(\mathbf{Q}\mathbf{W}_i^Q, \mathbf{K}\mathbf{W}_i^K, \mathbf{V}\mathbf{W}_i^V)$$

## Step 5: Feed-Forward Networks

- After attention, each position goes through identical feed-forward networks
- Applied to each position separately and identically
- Consists of two linear transformations with a ReLU activation in between:

$$\text{FFN}(x) = \max(0, x\mathbf{W}_1 + \mathbf{b}_1)\mathbf{W}_2 + \mathbf{b}_2$$

- This introduces non-linearity and allows the model to transform the representation

## Step 6: Residual Connections & Layer Normalization

- **Residual connections** help with training deep networks:
  - Add the input of each sub-layer to its output: $x + \text{Sublayer}(x)$
  - Allows gradients to flow through the network more easily

- **Layer normalization** stabilizes the learning process:
  - Normalizes the inputs across the features
  - Applied after each residual connection

$$\text{LayerNorm}(x + \text{Sublayer}(x))$$

## Transformer Model Applications

- **Machine Translation**: Original use case in "Attention Is All You Need"
- **Text Generation**: Foundation for GPT models
- **Document Understanding**: BERT and its variants
- **Multimodal Applications**: Vision transformers, audio transformers
- **Financial Applications**: Market prediction, sentiment analysis, report generation

## GPT-2: A Landmark Decoder-Only Model

### GPT-2 Architecture Basics

- Released by OpenAI in 2019
- Decoder-only transformer architecture
- Trained on 40GB of internet text
- Available in different sizes:
  - Small: 117M parameters
  - Medium: 345M parameters
  - Large: 762M parameters
  - XL: 1.5B parameters

## GPT-2: Model Architecture Details

- Uses masked self-attention (can only attend to previous tokens)
- No encoder-decoder structure, just the decoder component
- Trained on a simple next-token prediction objective
- Layer configuration (for 1.5B model):
  - 48 layers
  - 1600 dimensional embeddings
  - 25 attention heads

## GPT-2: Key Innovations

- Demonstrated impressive zero-shot capabilities
- Introduced **unsupervised pre-training** at scale
- Showed that scaling model size and data substantially improves performance
- Established the foundation for subsequent models like GPT-3 and GPT-4
- Pioneered better sampling methods for text generation

## Sampling Strategies: Introduction

- After the model computes the probability distribution for the next token, how do we select it?
- Different sampling methods produce different text qualities and characteristics
- Trade-off between:
  - **Determinism**: Consistent, predictable outputs
  - **Creativity**: Novel, diverse text generation
  - **Coherence**: Staying on topic without degrading

## Sampling Strategy: Greedy Decoding

- **Approach**: Always select the most probable next token
- **Formula**: $y_t = \arg\max_w P(w|y_{<t})$

### Advantages:
- Simple to implement
- Often produces coherent text for short sequences
- Deterministic results

### Disadvantages:
- Lacks diversity
- Can get stuck in repetition loops
- May produce suboptimal overall sequences

## Sampling Strategy: Temperature Sampling

- **Approach**: Sample from softmax distribution with temperature adjustment
- **Formula**: $P(w|y_{<t}) = \frac{\exp(z_w/T)}{\sum_{w'} \exp(z_{w'}/T)}$

### Temperature effects:
- $T < 1$: Makes distribution more peaked (less random)
- $T > 1$: Makes distribution more uniform (more random)
- $T = 1$: Standard softmax, no adjustment
- $T \to 0$: Approaches greedy decoding

## Sampling Strategy: Top-K Sampling

- **Approach**: Limit sampling to the K most likely next tokens
- **Procedure**:
  1. Sort the vocabulary by probability
  2. Keep only the top K tokens
  3. Renormalize probabilities
  4. Sample from this smaller distribution

### Advantages:
- Reduces chance of selecting low-probability (potentially nonsensical) tokens
- Maintains some randomness
- Often produces more coherent text than pure sampling

### Disadvantages:
- K is a fixed hyperparameter regardless of confidence distribution
- May be too restrictive for some contexts, too permissive for others

## Sampling Strategy: Nucleus (Top-p) Sampling

- **Approach**: Sample from the smallest set of tokens whose cumulative probability exceeds threshold p
- **Procedure**:
  1. Sort tokens by probability
  2. Keep adding tokens until cumulative probability ≥ p
  3. Renormalize and sample from this dynamic set

### Advantages:
- Adapts to the confidence of the model
- More flexible than Top-K
- Current standard for high-quality text generation

### Disadvantages:
- Slightly more complex to implement
- Still requires tuning the p parameter (typically 0.9-0.95)

## Sampling in Financial Applications

### Conservative Approach (Low Temperature/High Precision):
- Regulatory reporting
- Earnings statement generation
- Financial advice

### Creative Approach (Higher Temperature/More Exploration):
- Market scenario generation
- Stress testing
- Alternative investment thesis formulation

## Putting It All Together: The Transformer Revolution

- Transformers dramatically improved NLP capabilities through:
  - **Parallelization**: Training efficiency
  - **Attention Mechanism**: Better at capturing relationships
  - **Scalability**: Performance continues to improve with size

- Led to a new paradigm of foundation models
- Enabled financial applications previously considered impossible
- Continues to evolve with each new model generation




# Softmax Probabilities in Token Generation

## From Encoder-Decoder to Next Token Prediction

:::: {.columns}

::: {.column width="60%"}
- **The transformer pipeline overview**:
  - Input tokens → Embeddings → Encoder layers
  - Decoder layers → Linear projection → Softmax
  - Probability distribution over vocabulary
  - Sampling/selection of next token
  
- **Key transformation stages**:
  - Multi-head attention creates contextual representations
  - Feed-forward networks refine token representations  
  - Layer normalization ensures stable training
  - Final linear layer maps to vocabulary size
  - Softmax converts logits to probabilities
:::


- **Mathematical flow**:
  - Hidden states: $h_i \in \mathbb{R}^{d_{model}}$
  - Linear projection: $W_{out} \in \mathbb{R}^{d_{model} \times |V|}$
  - Logits: $z_i = h_i W_{out}$
  - Probabilities: $p_i = \text{softmax}(z_i)$
:::

::::

## The Softmax Function in Detail

:::: {.columns}

::: {.column width="60%"}
- **Mathematical definition**:
  $$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{|V|} e^{z_j}}$$
  
- **Key properties**:
  - Outputs sum to 1 (valid probability distribution)
  - Emphasizes largest logit values (temperature effect)
  - Differentiable for backpropagation
  - Maps any real number to (0,1) range
  
- **Temperature parameter**:
  $$\text{softmax}(z_i/T) = \frac{e^{z_i/T}}{\sum_{j=1}^{|V|} e^{z_j/T}}$$
:::

::: {.column width="40%"}
- **Temperature effects**:
  - $T = 1$: Standard softmax
  - $T > 1$: More uniform distribution (creative)
  - $T < 1$: More peaked distribution (conservative)
  - $T \to 0$: Approaches argmax (deterministic)
  - $T \to \infty$: Approaches uniform distribution

- **Financial text implications**:
  - Low temperature: Precise financial terminology
  - High temperature: Creative financial analysis
  - Balanced approach for professional content
:::

::::

## Vocabulary Mapping and Logits

:::: {.columns}

::: {.column width="60%"}
- **Vocabulary construction**:
  - Tokenizer creates mapping: token ↔ integer ID
  - Vocabulary size typically 30K-100K+ tokens
  - Special tokens: [PAD], [UNK], [CLS], [SEP]
  - Subword tokenization (BPE, WordPiece, SentencePiece)
  
- **Logit computation**:
  - Final hidden state: $h_{final} \in \mathbb{R}^{d_{model}}$
  - Output projection: $W_{out} \in \mathbb{R}^{d_{model} \times |V|}$
  - Bias term: $b_{out} \in \mathbb{R}^{|V|}$
  - Logits: $z = h_{final} W_{out} + b_{out}$
:::

::: {.column width="40%"}
- **Financial vocabulary considerations**:
  - Specialized financial terms (EBITDA, derivatives)
  - Numerical representations ($, %, basis points)
  - Company names and ticker symbols
  - Regulatory terminology (SEC, GAAP, IFRS)
  - Domain-specific abbreviations (P/E, ROE, NPV)
  
- **Example logit interpretation**:
  - High logit for "earnings" after "quarterly"
  - High logit for "%" after numerical values
  - Context-dependent probabilities
:::

::::

## From Probabilities to Token Selection

:::: {.columns}

::: {.column width="60%"}
- **Deterministic selection (Greedy)**:
  - Choose token with highest probability
  - $\text{token} = \arg\max_i p_i$
  - Consistent but potentially repetitive
  - Risk of getting stuck in loops
  
- **Probabilistic sampling**:
  - Sample from probability distribution
  - $\text{token} \sim \text{Multinomial}(p_1, p_2, ..., p_{|V|})$
  - Introduces randomness and creativity
  - Multiple runs produce different outputs
:::

::: {.column width="40%"}
- **Advanced sampling strategies**:
  
  **Top-k sampling**:
  - Consider only k most probable tokens
  - Reduces probability mass on unlikely tokens
  - Balances quality and diversity
  
  **Top-p (nucleus) sampling**:
  - Select smallest set with cumulative probability ≥ p
  - Adaptive vocabulary size based on confidence
  - More dynamic than fixed top-k
  
  **Beam search** (for deterministic quality):
  - Maintain multiple candidate sequences
  - Exponential search space pruning
:::

::::

## Temperature and Sampling in Financial Context

:::: {.columns}

::: {.column width="60%"}
- **Conservative financial writing** (T = 0.3-0.7):
  - Precise terminology and standard phrases
  - Regulatory compliance language
  - Technical analysis descriptions
  - Risk disclosures and disclaimers
  
- **Creative financial analysis** (T = 0.8-1.2):
  - Novel insights and interpretations
  - Alternative scenario descriptions
  - Innovative investment strategies
  - Market commentary and opinions
:::

::: {.column width="40%"}
- **Practical temperature settings**:

  ```
  Financial reports: T = 0.2
  "The company reported strong 
   quarterly earnings..."
  
  Market analysis: T = 0.7  
  "Given current market dynamics,
   we anticipate..."
  
  Creative insights: T = 1.0
  "An unconventional perspective 
   suggests..."
  ```
  
- **Risk considerations**:
  - High temperature may generate inaccurate numbers
  - Low temperature may lack analytical depth
  - Context-dependent optimization needed
:::

::::

## Mathematical Properties of Financial Token Generation

:::: {.columns}

::: {.column width="60%"}
- **Probability distribution constraints**:
  - $\sum_{i=1}^{|V|} p_i = 1$ (normalization)
  - $p_i \geq 0$ for all $i$ (non-negativity)
  - $\max_i p_i$ indicates model confidence
  - Entropy $H = -\sum_i p_i \log p_i$ measures uncertainty
  
- **Information theory perspective**:
  - Low entropy: Model is confident (peaked distribution)
  - High entropy: Model is uncertain (flat distribution)
  - Cross-entropy loss drives training optimization
  - Perplexity measures model surprise: $2^H$
:::

::: {.column width="40%"}
- **Financial implications**:
  
  **High confidence predictions**:
  - Standard financial formulas
  - Well-established market terminology
  - Common financial ratios and metrics
  
  **High uncertainty predictions**:
  - Novel market conditions
  - Emerging financial instruments
  - Ambiguous regulatory interpretations
  
- **Quality metrics**:
  - Perplexity on financial test sets
  - Domain-specific evaluation benchmarks
  - Human expert evaluation scores
:::

::::

## Attention Influence on Token Probabilities

:::: {.columns}

::: {.column width="60%"}
- **Attention-weighted context**:
  - Each token attends to relevant previous tokens
  - Attention weights influence final representations
  - Financial entities receive higher attention weights
  - Temporal relationships in financial time series
  
- **Multi-head attention aggregation**:
  - Different heads capture different relationships
  - Some heads focus on syntax, others on semantics
  - Financial domain heads for numerical relationships
  - Entity-relation heads for company connections
:::

::: {.column width="40%"}
- **Financial attention patterns**:
  
  **Entity-focused attention**:
  - Company names → Financial metrics
  - Dates → Performance periods
  - Currency symbols → Numerical values
  
  **Causal attention**:
  - Market events → Price movements
  - Economic indicators → Sector performance
  - Regulatory changes → Compliance requirements
  
  **Temporal attention**:
  - Historical performance → Future projections
  - Seasonal patterns in financial data
  - Business cycle phase relationships
:::

::::

## Practical Implementation Considerations

:::: {.columns}

::: {.column width="60%"}
- **Computational efficiency**:
  - Softmax computation over large vocabularies
  - Memory requirements for probability distributions
  - Hierarchical softmax for very large vocabularies
  - Approximate methods for real-time applications
  
- **Numerical stability**:
  - LogSumExp trick: $\log(\sum e^{x_i}) = \max(x) + \log(\sum e^{x_i - \max(x)})$
  - Prevents overflow for large logits
  - Critical for stable training and inference
  - Especially important for financial applications
:::

::: {.column width="40%"}
- **Implementation details**:

  **Efficient softmax computation**:
  ```
  # Numerical stability approach
  logits_max = max(logits)
  logits_shifted = logits - logits_max
  exp_logits = exp(logits_shifted)
  probabilities = exp_logits / sum(exp_logits)
  ```
  
  **Memory optimization**:
  - Sparse attention for long sequences
  - Gradient checkpointing
  - Mixed precision training
  - Model parallelism for large vocabularies
:::

::::

## Advanced Token Generation Strategies

:::: {.columns}

::: {.column width="60%"}
- **Repetition penalties**:
  - Reduce probability of recently generated tokens
  - Prevent repetitive financial phrases
  - Encourage diverse vocabulary usage
  - Balance between coherence and variety
  
- **Length penalties**:
  - Bias toward longer or shorter sequences
  - Important for financial document structure
  - Executive summary vs. detailed analysis
  - Regulatory compliance requirements
:::

::: {.column width="40%"}
- **Constrained generation**:
  
  **Financial format constraints**:
  - Currency formatting ($1,234.56)
  - Percentage notation (12.34%)
  - Date standardization (Q1 2024)
  - Ticker symbol validation (AAPL, MSFT)
  
  **Content constraints**:
  - Regulatory compliance checking
  - Factual accuracy verification
  - Risk disclosure requirements
  - Professional tone maintenance
  
- **Quality control mechanisms**:
  - Post-processing validation
  - Rule-based filtering
  - Confidence thresholding
  - Human-in-the-loop verification
:::

::::

## Evaluation Metrics for Financial Token Generation

:::: {.columns}

::: {.column width="60%"}
- **Intrinsic metrics**:
  - Perplexity on financial corpora
  - BLEU scores for reference comparisons
  - ROUGE scores for summarization tasks
  - BERTScore for semantic similarity
  
- **Extrinsic metrics**:
  - Financial accuracy of generated numbers
  - Compliance with regulatory requirements
  - Professional tone and style consistency
  - Domain expert evaluation scores
:::

::: {.column width="40%"}
- **Financial-specific evaluation**:
  
  **Numerical accuracy**:
  - Correct calculation of financial ratios
  - Consistent units and formatting
  - Reasonable value ranges
  - Mathematical relationship preservation
  
  **Domain coherence**:
  - Appropriate financial terminology
  - Logical sequence of financial concepts
  - Compliance with industry standards
  - Factual consistency with known data
  
- **Human evaluation criteria**:
  - Professional appropriateness
  - Analytical insight quality
  - Recommendation soundness
  - Risk assessment accuracy
:::

::::

## Debugging and Interpretability

:::: {.columns}

::: {.column width="60%"}
- **Probability analysis techniques**:
  - Ranking top-k predictions at each step
  - Analyzing probability mass distribution
  - Identifying confident vs. uncertain predictions
  - Tracking probability changes across layers
  
- **Attention visualization**:
  - Which input tokens influenced final prediction
  - Head-specific attention patterns
  - Layer-wise attention evolution
  - Financial entity relationship mapping
:::

::: {.column width="40%"}
- **Diagnostic tools**:
  
  **Probability debugging**:
  ```
  Top 5 predictions:
  1. "earnings" (p=0.34)
  2. "revenue" (p=0.22) 
  3. "profit" (p=0.18)
  4. "income" (p=0.12)
  5. "performance" (p=0.08)
  ```
  
  **Attention analysis**:
  - Input: "Q3 financial results show"
  - High attention: "Q3" → time context
  - High attention: "financial" → domain context
  - Output prediction: "strong" (financial qualifier)
  
- **Model introspection**:
  - Layer-wise representation analysis
  - Neuron activation patterns
  - Concept emergence tracking
:::

::::

## Future Directions and Research

:::: {.columns}

::: {.column width="60%"}
- **Improved sampling methods**:
  - Contrastive search for coherent generation
  - Typical sampling for natural distributions
  - Mirostat for consistent text quality
  - Adaptive temperature scheduling
  
- **Financial domain adaptations**:
  - Specialized vocabularies for financial subdomains
  - Multi-modal token generation (text + numbers)
  - Structured output generation (tables, reports)
  - Fact-grounded generation techniques
:::

::: {.column width="40%"}
- **Emerging research areas**:
  
  **Controllable generation**:
  - Style and tone control
  - Risk level adjustment
  - Audience-specific adaptation
  - Compliance-aware generation
  
  **Multimodal integration**:
  - Chart and graph generation
  - Table structure prediction
  - Visual financial data interpretation
  - Cross-modal attention mechanisms
  
- **Evaluation advances**:
  - Automated fact-checking
  - Real-time accuracy assessment
  - Bias detection and mitigation
  - Professional quality metrics
:::

::::


# Tokenizers: Theory, Algorithms, and Practical Considerations

## 1. Motivation
Natural-language text is a sequence of Unicode code points that is **too sparse and high-entropy** for efficient statistical modeling.  A *tokenizer* transforms this raw character stream into a shorter sequence of discrete symbols drawn from a bounded *vocabulary* $V$, enabling language models to learn meaningful patterns.

- Unicode code points are just a way to represent characters in a computer. Each character is assigned a unique number, which allows computers to handle text in different languages and scripts. E.g. `A` is represented by the code point `U+0041`, and `€` by `U+20AC`.

- The complete set of Unicode code points is denoted $\Sigma$, and the set of all finite-length strings over $\Sigma$ is $\Sigma^{*}$.  A tokenizer maps these strings to a sequence of tokens from a finite vocabulary $V$.

**Definition.**  A tokenizer is a deterministic (or stochastic) mapping  
$\mathcal T : \Sigma^{*} \;\longrightarrow\; V^{*},$  
where $\Sigma$ is the character alphabet and $V$ is a finite set of tokens.

- $V^*$ is the set of all finite-length sequences of tokens, including the empty sequence.

## Example 

- Imagine you are an alien from a civilization with only 3 symbols (letters): `A`, `B`, and `C`. 

- $\Sigma = \{A, B, C\}$ and $\Sigma^{*} = \{\epsilon, A, B, C, AA, AB, AC, BA, BB, BC, CA, CB, CC, AAA, AAB, \ldots\}$. where $\epsilon$ is the empty string.
- You are creating your own LLM and you define a vocabulary $V = \{A, B, C, AB, AC, BA, BB, BC\}$. Recall that $|\Sigma^*| = \infty$.
- Your tokenizer is a function $\mathcal T : \Sigma^{*} \longrightarrow V^{*}$ that maps strings from $\Sigma^{*}$ to sequences of tokens in $V^{*}$. For example
$$
\mathcal T(AA) = (A, A), \quad \mathcal T(AB) = (AB), \quad \mathcal T(ACB) = (AC, B), \quad \mathcal T(ABC) = (AB, C).
$$


## 2. What would you like from a tokenizer?
1. **Coverage** — every input string should be tokenisable without `UNK` (unknown) tokens. This means that the tokenizer should be able to handle any input string without producing tokens that are not in the vocabulary.

2. **Compression** — minimise the expected token sequence length $\mathbb E[|\mathcal T(x)|] \forall x\in\Sigma^{*}$. 
3. **Consistency** — identical substrings map to identical token sequences.  
4. **Latency** — $\mathcal T$ should run in $O(|x|)$ time (linear time).
5. **Reversibility** — decoding $\mathcal T^{-1}$ must recover the original text (modulo normalisation). E.g. $T^{-1}(AB,A,C)=ABAC$

Balancing these criteria leads to different tokenization families.

## 3. Taxonomy of Tokenizers
| Family | Vocabulary Size $|V|$ | Sequence Length | OOV Risk | Typical Use |
|--------|----------------------|-----------------|----------|-------------|
| **Character** | $|\Sigma|\approx 10^{3}$ | High | None | OCR, robust systems |
| **Word**      | $\sim 10^{5}$          | Low  | High | Early NLP, controlled domains |
| **Sub-word**  | $2\times10^{4}$–$8\times10^{4}$ | Medium | Very low | Modern LLMs |

## 4. Training a Tokenizer
- Tokenizers can be trained on a corpus of text to learn the most effective way to split the text into tokens.
- The training process involves analyzing the frequency of character sequences in the corpus and selecting the most common sequences as tokens.
- The goal is to create a vocabulary that balances coverage, compression, and consistency.
- The most common algorithms for training tokenizers are:
  - Byte-Pair Encoding (BPE): 
    - Iteratively merges the most frequent pairs of characters or tokens until a desired vocabulary size is reached.
  - Unigram Language Model: 
    - Treats the tokenization problem as a probabilistic model, selecting tokens based on their likelihood of occurrence in the corpus.
  - WordPiece:
    - Similar to BPE, but uses a probabilistic approach to select the most likely tokens based on their frequency and context.
- Most LLM providers do not train their own tokenizers, but rather use pre-trained tokenizers.


# Classification and Scalability of LLMs

## LLM Classification Framework

:::: {.columns}

::: {.column width="60%"}
**By Architecture:**
- **Autoregressive (Decoder-only)**: GPT family, Claude, Llama
- **Autoencoding (Encoder-only)**: BERT, RoBERTa, DistilBERT
- **Encoder-Decoder**: T5, BART, FLAN-T5

**By Training Approach:**
- **Pre-training**: Self-supervised learning on massive corpora
- **Fine-tuning**: Task-specific supervised learning
- **Instruction-following**: Trained to follow human instructions
- **Reinforcement Learning from Human Feedback (RLHF)**
:::

::: {.column width="40%"}
![LLM Classification](../../images/llm_classification_placeholder.png)
:::

::::

---

## Model Architecture Types

### Autoregressive Models (Decoder-only)
- **Examples**: GPT-3/4, Claude, Llama, Mistral
- **Training**: Next-token prediction
- **Use cases**: Text generation, completion, conversation
- **Advantages**: Excellent for generative tasks
- **Disadvantages**: Less efficient for understanding tasks

### Autoencoding Models (Encoder-only)
- **Examples**: BERT, RoBERTa, DistilBERT
- **Training**: Masked language modeling
- **Use cases**: Classification, sentiment analysis, NER
- **Advantages**: Bidirectional context, efficient for understanding
- **Disadvantages**: Cannot generate text naturally

### Encoder-Decoder Models
- **Examples**: T5, BART, FLAN-T5
- **Training**: Various objectives (span corruption, denoising)
- **Use cases**: Translation, summarization, question answering
- **Advantages**: Flexible for both understanding and generation
- **Disadvantages**: More complex architecture

---

## Training Paradigms

### Pre-training Objectives

**Autoregressive (AR):**
$$P(\text{sequence}) = \prod_{i=1}^{n} P(x_i | x_1, \ldots, x_{i-1})$$

**Masked Language Modeling (MLM):**
- Randomly mask tokens and predict them
- Bidirectional context understanding

**Span Corruption (T5-style):**
- Mask contiguous spans of text
- Predict the masked spans

---

## Model Scale Categories

| Category | Parameters | Examples | Characteristics |
|----------|-----------|----------|----------------|
| **Small** | <1B | DistilBERT, ALBERT | Fast inference, limited capabilities |
| **Base** | 1B-10B | BERT-Large, GPT-2 | Good balance of performance/efficiency |
| **Large** | 10B-100B | GPT-3, T5-XXL | Strong performance, higher compute |
| **Extra Large** | 100B+ | GPT-4, PaLM, Claude | State-of-the-art, very high compute |

---

## Specialized LLM Categories

### Code Models
- **GitHub Copilot** (based on Codex)
- **CodeT5**, **InCoder**, **CodeGen**
- **StarCoder**, **WizardCoder**

### Financial Domain Models
- **BloombergGPT**: 50B parameters, trained on financial data
- **FinBERT**: BERT fine-tuned for financial sentiment
- **PaLM-Finance**: Specialized for financial reasoning

### Multimodal Models
- **GPT-4V**: Vision capabilities
- **Claude 3**: Image understanding
- **DALL-E 3**: Text-to-image generation

---

# LLM Providers

## Frontier Labs (Global Leaders)

| Company | Flagship or Latest LLM(s) (2024 – 25) | Brief description |
|---------|---------------------------------------|-------------------|
| **OpenAI** | GPT-4o, GPT-4.1 mini | Original ChatGPT maker; continues to set benchmark accuracy and multimodality |
| **Anthropic** | Claude 4 (Opus & Sonnet) | Safety-first research lab spun out of OpenAI; emphasises "constitutional AI" |
| **Google DeepMind** | Gemini 2.5 Pro | Multimodal model powering Google Search, Workspace & the Gemini app |
| **Microsoft (Azure AI)** | Phi-3 / Phi-4 SLM family | Compact open-source "small language models"; also resells OpenAI models via Azure |
| **Meta** | Llama 4 (Scout & Maverick) | Open-weight, natively-multimodal successor to Llama 2/3 |

---

## Big Tech & Hardware Companies

| Company | Flagship or Latest LLM(s) (2024 – 25) | Brief description |
|---------|---------------------------------------|-------------------|
| **Amazon AWS** | Titan Text G1 (Premier / Express) | Proprietary Bedrock-hosted models for enterprise workloads |
| **Apple** | 3 B "Apple Intelligence" model | First fully on-device LLM for iPhone, iPad & Mac |
| **xAI** | Grok 3 (Think / Fast) | Elon Musk-backed lab focused on real-time reasoning and openness |
| **NVIDIA** | Nemotron-4 340B | Open models optimised for synthetic-data generation and self-training |

---

## Enterprise & Specialized Players (Part 1)

| Company | Flagship or Latest LLM(s) (2024 – 25) | Brief description |
|---------|---------------------------------------|-------------------|
| **Mistral AI** | Mistral Medium 3 | High-performance, permissively licensed models at low latency & cost |
| **Cohere** | Command R & Command A | Retrieval-augmented, long-context LLMs built for private data |
| **AI21 Labs** | Jurassic-3, Jamba | Early entrant offering controllable text generation APIs |
| **Databricks** | DBRX | Open-weight Mixture-of-Experts tuned for data-engineering use cases |
| **Snowflake** | Arctic | 128 k-context Apache-licensed model for cost-efficient enterprise AI |
| **IBM** | Granite 4.0 family | Trustworthy, business-oriented models aligned with EU AI Act |

---

## Enterprise & Specialized Players (Part 2)

| Company | Flagship or Latest LLM(s) (2024 – 25) | Brief description |
|---------|---------------------------------------|-------------------|
| **Salesforce AI** | xGen (small / code) | Long-context, domain-tuned models powering Einstein Copilot |
| **Stability AI** | Stable LM 2 (1.6 B → 12 B) | Lightweight multilingual open-source models for consumer GPUs |
| **Adept AI** | Fuyu-Heavy & Fuyu-8B | Multimodal transformers designed for agentic tasks and UI control |
| **Reka AI** | Reka Flash 21 B | Efficient multilingual reasoning model for real-time & edge |
| **Aleph Alpha** | Luminous / Pharia | European sovereign stack with explainability APIs |
| **Together AI** | RedPajama-v2 & training-as-a-service | Open datasets + cloud for fine-tuning and hosting OSS models |

---

## China Ecosystem 

What makes China's LLM ecosystem unique is its rapid development, large-scale models, and focus on domestic applications. The Chinese government has also been supportive of AI initiatives, leading to a vibrant ecosystem.

The US has banned the export of advanced chips to China, which has led to a focus on developing indigenous AI capabilities. Chinese companies are also focusing on building large-scale models that can handle the Chinese language and cultural context effectively.




| Company | Flagship or Latest LLM(s) (2024 – 25) | Brief description |
|---------|---------------------------------------|-------------------|
| **Baidu** | Ernie 4.0 Turbo | Search-integrated LLM; 300 M+ users |
| **Alibaba Cloud** | Qwen 3 family | Hybrid-reasoning models matching frontier benchmarks |
| **Tencent** | Hunyuan Turbo S | Fast Transformer-Mamba MoE model for Chinese & maths tasks |
| **Huawei Cloud** | PanGu 5 / Ultra MoE | Large-scale models optimised for Ascend NPUs & on-prem deployment |
| **SenseTime** | SenseNova 5.5 | China's first real-time multimodal model series |
| **Zhipu AI** | GLM-4 (Air / Flash) | Open-source bilingual models with free API tier |
| **DeepSeek** | DeepSeek-V3 (671 B MoE) | Open-weight MoE excelling at maths & code |
| **01.AI** | Yi-1.5 (6 B → 34 B) | Apache-licensed zh-en models for community fine-tuning |

---

## Open-Source & Community

| Company | Flagship or Latest LLM(s) (2024 – 25) | Brief description |
|---------|---------------------------------------|-------------------|
| **BigScience + Hugging Face** | BLOOM 176 B | First 100 B+ multilingual model released with full weights |
| **Eleuther AI** | GPT-NeoX-20B | Volunteer collective behind "The Pile"; open 20 B model |
| **Cerebras Systems** | Cerebras-GPT family | 111 M → 13 B models trained on wafer-scale CS-2 hardware |
| **Ollama** | Ollama LLMs | Open-source models with a focus on simplicity and ease of use |

- [A Survey of LLM Surveys](https://github.com/NiuTrans/ABigSurveyOfLLMs)
