# Understanding larage language models

Traditional ML/NLP methods excelled at categorization tasks (e.g., email spam classification) and pattern recognition, but <span style="color:red">underperformed in language tasks that demanded complex understanding and generation abilities</span>.

When we say language models "understand", we mean they can processe and generate text coherently/remarkably, <span style="color:red">not that they possess humna-like consciousness or comprehension</span>.

While earlier NLP models were designed for specific tasks, <span style="color:green">LLMs demonstrate a broader proficiency</span>.

The success behing LLMs can be attributed to 
1. <span style="color:green">Transformer architecture that underpins many LLMs</span>, and
2. <span style="color:green">The vast amountd of data they are trained on</span>.

# 1.1 What's an LLM?

<span style="color:#4ea9fb">A deep neural network designed to understand, generate, and respond to human like text.</span>

**What does the "large" in "large language model" refers?**
1. <span style="color:#4ea9fb">**model's size**</span> in terms of the parameters (10s or 100s of billions) and 
2. <span style="color:#4ea9fb">**immense dataset**</span> (in trillions)

**What does LLM do?**
- Essentially predict the next word in sequence.

**Why next word prediction?**
- It <span style="color:#4ea9fb">harnesses the inherent sequential nature of language</span> to train models on understanding context, structure, and relationships within text.

**What architecture does LLM utilize?**
- **Transformers** 
  - <span style="color:green">Allows them to pay selective attention to different parts of the input when making predictions.</span>

LLMs are a.k.a. Generative Artificial Intelligence or *Generative AI* or GenAI or *GAI*, as they can *generate text*. 

<br>
<img src="images/hierarchical-depiction-of-AI-ML-DL-LLM-GAI.png" width="600px">
<br>

# 1.2 Applications of LLMs

LLMs are invaluable in automating almost any task that involveds parsing and generating text, and their applications are virtually endless.

**The topic that interests me the most are**
- <span style="color:green">Effective knowledge retrieval from vast volumns of text in specialized areas like medicine</span> or law.
- Multi-class Classification with detailed reasoning and explainability.
  

# 1.3 Stages of building and using LLMs

<p style="color:black; background-color:#F5C780; padding:15px">💡 Coding an LLM from ground up is the best way to understand it's mechanics and limitations, which essentially equips us with knowledge for pretraining and fine-tuning existing ope source LLMs to our own domain-specific datasets or tasks.</p>

**Why custom build LLMs?**
- LLMs that are tailored for specific tasks (BloombergGPT specialised for finance, <span style="color:green">LLMs tailored for medical Q&A</span>) - can generally outperform general-purpose LLMs (e.g., ChatGPT).
- Offers several advantages
  - Data privacy.
  - Custom implementation (e.g., locally on customer devices reduce server related costs and latency).
  - Complete developer autonomy/control.

**Two key stages in creating an LLM?**
1. **Pre-training**
   - <span style="color:#4ea9fb">Initial phase where the model is trained on a large, diverse dataset (aka *raw* text) to <i>develop a broad understanding of the language</i> and predict the next workd in the text</span>.
   - Why "raw" text?
     - <span style="color:red">Traditional ML/DL models requires labelled info when trained via conventional supervised learning</span>.
     - On the other hand, <span style="color:#4ea9fb">LLMs use **self-supervised learning**, where the model generates its own label (**self-labelling**) from the input data (a.k.a **pseudo labels**)</span>.
   - These pre-trained models are also known as *base* or *foundation* models.
   - Capabilities
     - Text completion
     - Few-short capabilities
   - Example
     - GPT-3 model (A precursor for ChatGPT).
2. **Fine-tuning**
   - <span style="color:#4ea9fb">A process where the pre-trained model (a.k.a. foundation model) is specifically trained on a narrower dataset that's more specific to a particular tasks or domains.</span>
   - Two most popular categories of fine-tuning
     - Classification fine-tuning
     - 2.1 Instruction fine-tuning
  
| | Instruction fine-tuning | Classification fine-tuning |
| - | - | - |
| Labelled data | Instruction and answer pairs. | Texts and associated class labels. |
| Example | Query to translate a text (in english) ccompanied by the correctly translated text (in german). | Emails associated with "spam" and "not spam" labels. |

<img src="images/two-stages-in-llm-creation.png" width="600px">

# 1.4 Introduction to transformer archiecture

**Why "Attention Is All You Need"?**
- <span style="color:#4ea9fb"><b>Most model LLMs rely on the *transformer* architecture</b></span>, which is a deep neural network (DNN) architecture introduced in 2017 paper "Attention Is All You Need".
- <span style="color:#4ea9fb"><b>To understand the LLMs, we must first understand the original transformer</b></span>, which was developed for machine translation (English to German and French).

**Two submodules of transformer:**
1. Encoder
2. Decoder

| | Encoder | Decoder |
| - | - | - |
| Description | Processes input texts, and encodes it into numerical representations or vectors that captures the contextual information. | Takes the encoded vectors and generates the output text. |
| Example: Translation | Encode the input text from source language into vectors. | Decode the vectors to generate output text in target language. | 

<img src="images/simplified-depiction-of-original-transformer-architecture.png" width="600px">

**Key component of transformers and LLMs**
- Both encoders and decoders consists of many layers connected by self-attention mechanism.
- The <span style="color:#4ea9fb"><b>self-attention mechanism</b> allows the model to weigh the importance of different words or tokens in a sequence relative to each other, thus allowing the model to capture long-range dependencies and contextual relationships</span>.

**Later variants of transformer architecture**

| BERT | GPT |
| --- | --- |
| Bidirectional encoder representations from transformers | Generative pretrained transformers |
| Built upon original transformer's encoder submodule. | Built upon transformers decoder submodule. |
| Specialize in masked word prediction. | Trained to perform text completion tasks. | 
| Strengths in text classification tasks such as<br>- Sentiment prediction<br>- Document categorization | Strengths in text generation tasks such as <br>- Machine translation<br>- Text summarization <br>- Fiction writing<br> - Writing computer code<br>... <br><br>In addition to text completion, <span style="color:green">GPT-like LLMs show remarkable versatility in their capabilities <b>without needing retraining, fine-tuning, or model archiecture changes</b></span>, such as</span><br>- <span style="color:#4ea9fb">Zero-shot learning tasks</span><br>- <span style="color:#4ea9fb">Few-shot learning tasks</span>|
| Used in twitter/X to detect toxic content.| ChatGPT, etc. | 

<img src="images//visual-representation-of-transformer-encoder-and-decoder-modules.png" width="600px">

<img src="images/gpt-like-llms-are-good-at-zero-shot-and-few-shot.png" width="600px">

<p style="color:black; background-color:#F5C780; padding:15px">💡 <b>Transformers vs. LLMs</b><br> Not all transformers are LLMs. <br>Not all LLMs are transformers.</p>

# 1.5 Utilize large datasets

- <span style="color:red">Pre-training LLMs requires access to significant resources and is very expensive</span>.
- <span style="color:red">GPT-3 pretraining cost is estimated ~ $4.6 million interms of cloud computing credits.</span>

<img src="images/pretraining-dataset-of-popular-gpt-3-llm.png" width="600px">

<p style="color:black; background-color:#F5C780; padding:15px">💡 <b>GPT-3 dataset details</b><br>- "Number of tokens" col total ~499 billion, but the model was trained only on 300 billion tokens (⇒ 60% of the original data).<br>- 410 billion tokens from CommonCrawl dataset requires ~570GB of storage (<i>perhaps, not a lot </i>🤔).<br>- Later iterations of GPT-3 like models such as Meta's Llama expanded training scope to <br><tab>&nbsp;&nbsp;&nbsp;- Arxiv research papers (92GB).<br>&nbsp;&nbsp;&nbsp;- StackExchange's code-related Q&As (78 GB).<br>- GPT-3 paper authors did not share the training dataset.</p>

# 1.6 A closer look at the GPT architecture

- GPT was originally introduced in the paper "Improving Language Understanding by Generative Pre-Training" [paper](https://mng.bz/x2qg) from OpenAI.
- ChatGPT was created by finetuning GPT-3 on a large instruction dataset using a method from OpenAI's InstructGPT [paper](https://arxiv.org/abs/2203.02155).

<img src="images/next-word-prediction-model-of-gpt-modules.png" width="700px">

**Perks of next word prediction tasks in training LLMs**
- <span style="color:#4ea9fb">the **next-word prediction** is a form of **self-supervised learning**, which is a form of **self-labeling**</span>.
- <span style="color:#4ea9fb">we can **create labels "on the fly"**, thus allowing us to **use massive unlabelled text datasets** for training LLMs</span>.

<b>What is an <i>auto-regressive</i> model</b>?
- <span style="color:#4ea9fb">Models that incorporate previous outpus as inputs for future predictions</span>.

<b>Why GPT-like models are considered as a <i>auto-regressive</i> models?</b>
- GPT just uses the decoder part of the transformer architecture without the encoder.
  - <span style="color:#4ea9fb">It's designed for <b>unidirectional, left-to-right processing</b></span>.
- Since <span style="color:#4ea9fb">decoder only models like GPT generate text by predicting output text one word at a time in a iterative fashion</span>, they are considerde as <i>autoregressive</i> models.

<b>What's an <i>emergent behaviour</i> or <i>emergent abilities</i>?</b>
- <span style="color:#4ea9fb"><b>Ability to perform tasks the model was not explicilty trained to perform.</b></span>

**Benefits and capabilities of large scale, generative models**
<p style="color:black; background-color:#F5C780; padding:15px">&nbsp;&nbsp;&nbsp;💡Though the GPT models were not specifically trained to perform translation task (unlike the original transformers), and rather primarily trained on next-word prediction task, it emerges as a natural consequence of the model's exposure to vast quantities of multilingual dataset in diverse contexts.</p>

# 1.7 Building a large language model

<b>3 main stage of coding/building an LLM</b>
1. **Building an LLM:** Learn the fundamentals of data preproccessing steps + code the attenttion mechanism + LLM architecture.
2. **Foundation model**: Code & pre-train a GPT-like LLM + Model evaluation + loading openly available model weights.
3. **Fine-tuned model**: Fine-tuning to follow instructions: Classifier + Personal assistant.

# Summary

- LLMs have **transformed the field of NLP**, leading to advancements in understanding, generating, and translating human language.
- Modern LLMs are trained in two main steps:
  1. **Pre-training**: On a large corpus of unlabelled text in self-superivsing fashion by using the prediction of the next word in a sentence as a label (psuedo-label or self-label).
  2. **Fine-tuning**: The base/foundational/pre-trained model on smaller/narrower, labelled dataset to follow instructions or perform classification tasks.
- **LLMs are based on transformer architecture**. The key idea in transformer architecture is **"attention mechanism"** that allows the model to weigh the importance of each word in a sequence relative to other words, thus allowing the model to capture longrange contextual understandings and relationships/dependencies.
- Original transformer architecture consists of two submodules
    1. **Encoder**: For encoding input text into numerical vector representations that captures the contectual understanding
    2. **Decoder**: For decoding the encoded vectors into target language.
- LLMs trained for text generative tasks (like GPT-3 and ChatGPT) only uses decoder submodule of the transformer architecture.
- LLMs are pre-trained on** massive unlabelled raw datasets **(e.g., Trillions of tokens).
- LLMs exhibit **emergent abilities**, such as capabilities to classify, translate, and summarize text.
  - 💡Though the GPT models were not specifically trained to perform translation task (unlike the original transformers), and rather primarily trained on next-word prediction task, it emerges as a natural consequence of the model's exposure to vast quantities of multilingual dataset in diverse contexts.
-** LLMs finetuned on custom datasets can outperform general LLMs on specific tasks**.