# <center> MSBA 6461: Advanced AI for Natural Language Processing </center>
<center> Summer 2025, Mochen Yang </center>

## <center> Large Language Models </center>

# Table of Contents
1. [Large Language Models](#llm)
    - [General Process of LLM Training](#llm_train)
    - [Common LLM Architecture](#llm_arch)
    - [LLM Use Cases](#llm_use)
1. [Application Case: BERT](#bert)
    - [What is BERT?](#bert_intro)
    - [Use BERT](#bert_example)
1. [Additional Resources](#resource)

# Large Language Models <a name="llm"></a>

Large Language Models (LLMs) are generative AI models that can process, "understand", and generate texts. They are "large" because they are often trained on enormous amount of textual data and have a huge number of parameters. They learn representations of languages, and can be further fine-tuned for a variety of different language tasks. Current major players in the LLM arena include:
- [OpenAI (GPT Models)](https://openai.com/)
- [Meta (Llama Models)](https://www.llama.com/)
- [Anthropic (Claude Models)](https://www.claude.ai/)
- [Mistral](https://mistral.ai/en)
- [Deepseek](https://www.deepseek.com/)

LLMs are perhaps the most exciting major advancement in NLP currently. This lecture is designed to provide a brief exposition of LLMs.

## General Process of LLM Training <a name="llm_train"></a>

Broadly speaking, LLM Training consists of two distinct stages: **pre-training** and **post-training**. The two stages use different data / techniques and serve different purposes.
- **Pre-training**: Pre-training is the process of representation learning from huge quantities of raw data, typically in an unsupervised manner (this stage is also referred to as unsupervised pre-training). The training task is similar to what we have discussed in sequence-to-sequence modeling (Notebook 3) -- autoregressively predicting next token based on previous tokens. The goal of pre-training is to obtain high-quality token representations;
- **Post-training**: Post-training is the process of "fine-tuning" the LLM to perform certain specific tasks. It can be done via both **supervised learning** and **reinforcement learning**, with some differences in objectives:
    - **Supervised post-training**: fine-tuning the LLM to _perform certain task via supervised learning_. The BERT demo in the second half of this notebook is a very simple example of this. In practice, LLM providers will fine-tune their pre-trained LLMs on a wide variety of different tasks;
    - **Reinforcement post-training**: fine-tuning the LLM to _better align with human preferences via reinforcement learning_. For many tasks that are not completely objective, humans may have a preference for one response over another (even if both responses are technically correct, e.g., humans may prefer a more "polite" LLM than a more "blunt" one). Carrying out reinforcement learning with such human feedback signals can further adjust the LLMs to generate responses that are more human-acceptable.

## Common LLM Architecture <a name="llm_arch"></a>

Even though LLMs are mostly based on transformers, they can follow different architectures, including **encoder-decoder**, **encoder-only** (e.g., BERT), and **decoder-only** (e.g., Mistral). 

1. The **encoder-decoder** architecture is what we have discussed from last lecture. It is sequence-to-sequence.
2. The **encoder-only** architecture uses just the encoder part, and it is inherently an encoding model. That means it takes an input sequence and produces its encoded representation. Those representations can then be used to carry out task-specific fine-tuning (e.g., act as inputs to a classifier). It is _not_ sequence-to-sequence.
3. The **decoder-only** architecture uses just the decoder part, but it can perform sequence-to-sequence tasks. The trick is to prepend the input sequence (the "prompt") ahead of the output sequence and give the entire thing to the decoder. It will learn to predict the next token in the autoregressive manner.

Furthermore, because of the large scale of LLMs (transformers of billions of parameters), using them to generate responses (a.k.a. "LLM inference") can be very costly and slow. A powerful architectural innovation to address this issue is called **mixture-of-experts** (MoE). The idea of MoE is not to use the entire transformer network to process each input token, but instead use different parts of the network (each called an "expert") to deal with tokens of different types. To achieve effective MoE, there needs to be a separate "routing" model (often a gated neural network) that decides which "expert" to invoke for each token. However, because each expert is only a (sparse) subset of the entire transformer, MoE can significantly speed up LLM inference without sacrificing response quality. For more technical details of MoE, see [this paper](https://www.jmlr.org/papers/volume23/21-0998/21-0998.pdf).

## LLM Use Cases <a name="llm_use"></a>

Given LLM's wide ranges of capabilities, it is constructive to think of it not as a single tool, but as a [general purpose technology](https://en.wikipedia.org/wiki/General-purpose_technology). Its use cases include at least the following (and are quickly expanding over time):
- **Completion**: this is the baseline use of a foundation LLM. Users send question prompts, LLM returns answers (sometimes using tools to do so).
- **Retrieval Augmented Generation** (RAG): enhance LLM response with information retrieved from user-supplied files / databases. This allows LLM to refer to (potentially private) user-owned information when composing its answers.
- **Fine-Tuning**: further modifying a foundation LLM with task-specific data. This is conceptually identical to supervised post-training, except that it is done by LLM users. It allows users to customize a foundation LLM to their own use cases.
- **Agents**: in the above use cases of LLM, the model usually assumes a "passive" role, acting only when asked to do something. An Agentic LLM is more "active" (i.e., have some degree of "agency", hence the name) and can decide to do something on its own. As an example, OpenAI's [Operator](https://openai.com/index/introducing-operator/) can perform various tasks using a built-in browser (e.g., book a flight for you or manage your calender).
- **Reasoning**: using LLM for reasoning tasks. This is at the current frontier of LLM research and development. OpenAI's o-series models and Deepseek's R-series models achieve reasoning by carrying out (sometimes implicitly) chain-of-thought processes before generating actual answers.

# Application Case: BERT <a name="bert"></a>

## What is BERT? <a name="bert_intro"></a>

BERT stands for _**B**idirectional **E**ncoder **R**epresentations from **T**ransformers_. It is a **language representation model**, which means it takes raw text and generate a meaningful representation (e.g., embedding) of it. It was developed by Google in 2018. With everything we have discussed so far, you are ready to make sense of all the key components of BERT:

1. **B**idirectional means that the encoder makes uses of full self-attention where every position can attend to every other position;
2. **E**ncoder **R**epresentations means that the model is aiming to generate representation of the input sequence, i.e., it is an encoder-only architecture;
3. **T**ransformers means that BERT uses a transformer architecture with self-attention.

## Use BERT <a name="bert_example"></a>

Google has released a number of different BERT models, trained with different hyperparameters. [Here is a directory of all those models](https://www.tensorflow.org/tutorials/text/classify_text_with_bert#choose_a_bert_model_to_fine-tune). You see that each model is identified by three parameters:
- $L$: this is the number of transformer blocks. You can think of it as number of "layers";
- $H$: this is the dimension of embedding. We called this $D$ in our discussion of transformer;
- $A$: this is the number of heads in multi-head self-attention. This means cutting the embedding into $A$ pieces and apply self-attention to each piece.

You can access pre-trained BERT models and potentially fine-tune them for your own ML tasks via [Hugging Face](https://huggingface.co/), an online platform that hosts many commonly used pre-trained models. In the following example, we access a basic BERT model and use it to encode some text. See this [page](https://huggingface.co/bert-base-uncased) for detailed documentation.

In [None]:
# install transformer package from Hugging Face
#!pip install transformers

Collecting transformers
  Downloading transformers-4.19.1-py3-none-any.whl (4.2 MB)
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
Collecting packaging>=20.0
  Downloading packaging-21.3-py3-none-any.whl (40 kB)
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp38-cp38-win_amd64.whl (155 kB)
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp38-cp38-win_amd64.whl (3.3 MB)
Collecting filelock
  Downloading filelock-3.7.0-py3-none-any.whl (10 kB)
Installing collected packages: pyyaml, packaging, filelock, tokenizers, huggingface-hub, transformers
  Attempting uninstall: packaging
    Found existing installation: packaging 20.8
    Uninstalling packaging-20.8:
      Successfully uninstalled packaging-20.8
Successfully installed filelock-3.7.0 huggingface-hub-0.6.0 packaging-21.3 pyyaml-6.0 tokenizers-0.12.1 transformers-4.19.1


In [None]:
from transformers import BertTokenizer, TFBertModel

# fetch the pre-trained model (it will download a model file ~500M)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained("bert-base-uncased")

Some layers from the model checkpoint at bert-base-uncased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-uncased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [None]:
# input text and encode
text = "We are using the BERT model!"
encoded_input = tokenizer(text, return_tensors='tf')
output = bert_model(encoded_input)

In [None]:
# Look at the tokenized input
# Question: what are tokens 101 and 102?
encoded_input

{'input_ids': <tf.Tensor: shape=(1, 9), dtype=int32, numpy=array([[  101,  2057,  2024,  2478,  1996, 14324,  2944,   999,   102]])>, 'token_type_ids': <tf.Tensor: shape=(1, 9), dtype=int32, numpy=array([[0, 0, 0, 0, 0, 0, 0, 0, 0]])>, 'attention_mask': <tf.Tensor: shape=(1, 9), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1, 1, 1, 1]])>}

In [None]:
# Look at the encoded input
# Question: what is the dimension of encoding?
# Question: why are there two encoding outputs? What are they?
output

TFBaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=<tf.Tensor: shape=(1, 9, 768), dtype=float32, numpy=
array([[[ 0.10261209,  0.18043919, -0.00554929, ..., -0.166134  ,
          0.26679957,  0.35773745],
        [ 0.263622  , -0.21110201, -0.57594675, ..., -0.20186077,
          1.308478  , -0.14822024],
        [ 0.12224663, -0.15183868, -0.36246365, ..., -0.56034166,
          0.18197185,  0.45692527],
        ...,
        [ 0.487611  ,  0.05848615, -0.26846886, ..., -0.64023006,
         -0.01316616, -0.00961822],
        [-0.16868652, -0.17555293, -0.15778571, ...,  0.54957277,
          0.45626837, -0.39924195],
        [ 0.52467674,  0.37009996, -0.21517405, ...,  0.00148578,
         -0.5219994 , -0.30393368]]], dtype=float32)>, pooler_output=<tf.Tensor: shape=(1, 768), dtype=float32, numpy=
array([[-8.78734827e-01, -3.25698197e-01, -3.28317106e-01,
         6.70523882e-01,  6.76294491e-02, -4.97857258e-02,
         8.80656004e-01,  2.76587784e-01, -1.80702090e-0

The next example shows how BERT can be used as an embedding layer to build a classification model. For illustration, the [sentiment classification dataset](https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences) is used. 

In [None]:
# read and process data
text = []
label = []
for line in open("datasets/sentiment.txt"):
    line = line.rstrip('\n').split('\t')
    text.append(line[0])
    label.append(int(line[1]))
text = np.array(text)
label = np.array(label)

In [None]:
# Use BERT to encode texts
vectorized_text = tokenizer(text.tolist(), return_tensors='tf', padding=True)
bert_embeddings = bert_model(vectorized_text)['last_hidden_state']

In [None]:
# For illustration, build a LSTM model with BERT embeddings
embeddings = keras.layers.Input(shape = (bert_embeddings.shape[1], bert_embeddings.shape[2]))
masked_embeddings = tf.keras.layers.Masking(mask_value=0)(embeddings)
h_all, h_final, c_final = keras.layers.LSTM(units = 128,
                                            return_state = True)(masked_embeddings)
pred = keras.layers.Dense(units = 1,
                          activation='sigmoid')(h_final)

In [None]:
# Assemble model
model_bert_lstm = keras.Model(inputs = embeddings,
                              outputs = pred)

In [None]:
# configure training / optimization
model_bert_lstm.compile(loss = keras.losses.BinaryCrossentropy(),
                        optimizer='adam',
                        metrics=['accuracy'])

In [None]:
model_bert_lstm.summary()

Model: "functional_10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_10 (InputLayer)        [(None, 100, 768)]        0         
_________________________________________________________________
masking_4 (Masking)          (None, 100, 768)          0         
_________________________________________________________________
lstm_8 (LSTM)                [(None, 128), (None, 128) 459264    
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 129       
Total params: 459,393
Trainable params: 459,393
Non-trainable params: 0
_________________________________________________________________


In [None]:
# training with 20% validation and 10 epochs.
model_bert_lstm.fit(x = bert_embeddings,
                    y = label,
                    batch_size = 32,
                    epochs = 10,
                    validation_split = 0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x1e10b3bdc70>

# Additional Resources <a name="resource"></a>

- LLMs:
    - [Hyung Won Chung (OpenAI) Stanford lecture](https://www.youtube.com/watch?v=orDKvo8h71o)
    - [Paper that proposes the MoE architecture for LLM inference](https://www.jmlr.org/papers/volume23/21-0998/21-0998.pdf)

- BERT:
    - Original research paper that proposed BERT: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). In particular, Section 3 talks about BERT model architecture;
    - [Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html);
    - [Text Classification with BERT](https://www.tensorflow.org/tutorials/text/classify_text_with_bert).

<font color="blue">Some of my personal opinions: </font> A general trend in the development of language models is to _build extremely large models_, i.e., take the state-of-the-art architecture and train it with more and more parameters and on larger and larger datasets. However, looking back on what we have learned so far, you really need _fundamentally new ideas_ (e.g., from bag-of-words to embeddings, from simple RNNs to LSTMs, from RNNs + attention to transformers) to achieve significant (non-incremental) improvement. Therefore, although the transformer architecture is the current state-of-the-art, it is fundamentally unclear what else we need for the next breakthrough.