<left><img width=25% src="img/cornell_tech2.svg"></left>

# Lecture 19: Introduction to LLMs

### Applied Machine Learning

__Brandon Amos__<br>Cornell Tech

In [1]:
import torch
import numpy as np; np.set_printoptions(precision=2)
import matplotlib.pyplot as plt; plt.rcParams['figure.figsize'] = [12, 4]
import warnings; warnings.filterwarnings('ignore')
device = 'mps'

import os
os.environ['TOKENIZERS_PARALLELISM'] = 'True'

# Preface and disclaimer ⚠

+ Language, NLP, LLMs is a huge space. Many great resources out there!
+ **This lecture**
    1. Tour through my favorite introductory parts from them
    2. Some code examples to show how to apply and use <br/>
       a) **basic tokenization** and **autoregressive generation**, <br/>
       b) **chat templates**, and <br/>
       c) **code completion** (fill-in-the-middle)

# Transformers and language models: ubiquitous

<center>
<img width='55%' src="https://www.comet.com/site/wp-content/uploads/2023/07/Screen-Shot-2023-07-11-at-9.48.50-PM-1536x1153.png"/><br/>
<a href="https://www.comet.com/site/blog/explainable-ai-for-transformers/">Image sources: Explainable AI: Visualizing Attention in Transformers</a>
</center>

# Review on classification

$$ \underbrace{\text{Dataset}}_\text{Features, Attributes, Targets} + \underbrace{\text{Learning Algorithm}}_\text{Model Class + Objective + Optimizer } \to \text{Predictive Model} $$

-----

1. Training dataset $\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(N)}, y^{(N)})\}$.
2. The target space is discrete: $\mathcal{Y} = \{y_1, y_2, \ldots y_K\}$. <br>
   <span style='color: gray'>Each of the $K$ discrete values corresponds to a <em>class</em> that we want to predict</span>
3. Optimize the conditional likelihood
    $$\max_\theta \ell(\theta) = \max_{\theta} \frac{1}{N}\sum_{i=1}^N \log P_\theta(y^{(i)} | {x}^{(i)}).$$

# LLMs (for generation) are "just" doing next-token classification

+ Represent language as a sequence of **discrete tokens**
+ Given the past sequence of text $x^{(i)}$, classify the next portion $y{(i)}$.
+ Parameterize $P_\theta$ with a sequence architecture (e.g., a transformer)
+ (Pre)train with maximum likelihood
  $$\max_\theta \ell(\theta) = \max_{\theta} \frac{1}{N}\sum_{i=1}^N \log P_\theta(y^{(i)} | {x}^{(i)}).$$

# Tokenization and representing language

+ **Tokenization** is how the string is represented <span style='color: grey'>(what the $K$ values correspond to)</span>
+ Dataset has token strings $x^{(i)}\in\{1, \ldots, K\}^{n_i}$
  and next tokens $y^{(i)}\in\{1, \ldots, K\}$:
  $$\mathcal{D} = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \ldots, (x^{(N)}, y^{(N)})\}$$
+ Many options for how to tokenize a sequence, e.g.:

<center>
<img width='25%' src='https://njoroge.tomorrow.co.ke/static/images/AI/tokenization.jpg'/><br/>

Image source: 
<a href="https://njoroge.tomorrow.co.ke/blog/ai/word_vs_character_level_tokenization">Character vs. Word Tokenization in NLP</a>
</center>

# Tokenization in practice

+ A large topic and very important choice
+ Tokens often learned via Byte-Pair Encoding, SentencePiece, or WordPiece
+ Many other great resources:
  + [HuggingFace Tokenizer Summary](https://huggingface.co/docs/transformers/en/tokenizer_summary)
  + [Let's build the GPT Tokenizer](https://www.youtube.com/watch?v=zduSFxRajkE) ([code](https://github.com/karpathy/minbpe))
  + [Llama tokenizer visualizer](https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/)

# Applications of transformers

<img src="https://www.comet.com/site/wp-content/uploads/2023/07/Screen-Shot-2023-07-13-at-6.37.03-PM.png"/>
<center><a href="https://www.comet.com/site/blog/explainable-ai-for-transformers/">Image source: Explainable AI: Visualizing Attention in Transformers</a></center>

# What attention looks like

<center>
<img width='55%' src="https://sebastianraschka.com/images/blog/2023/self-attention-from-scratch/summary.png"> <br/>

<small><a href="https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html">Source: Understanding and Coding the Self-Attention Mechanism of LLMs From Scratch</a></small>
</center>

# Putting everything together

<center>
<img width='70%' src="https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd8bc0b7-e7b7-4a96-9e00-627ad2ecda20_2232x1362.png"/> <br>
(Source: the Llama 2 paper)
</center>

# Full architectures

Combine many components we've covered: embeddings, attention, residual

<center>
<img width='60%' src="https://media.licdn.com/dms/image/v2/D5612AQGzmd6t0QZpcw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1710740975319?e=1736985600&v=beta&t=a-hzk4nmEQCMeKYfX7miID0veRX8AwM4Hd6dxF9qPOo"/> <br/>
Image source: <a href="https://www.youtube.com/@umarjamilai">Umar Jamil</a>
</center>

# Going deeper into the architecture

+ Many other interesting design choices we won't cover, especially positional embeddings, masking, KV caching, flash attention
+ Some further reading:
    + [Attention is all you need](https://arxiv.org/abs/1706.03762)
    + [Understanding and Coding the Self-Attention Mechanism of Large Language Models From Scratch](https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html)
    + [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)
    + [Explained: Multi-head Attention](https://storrs.io/attention/)

# Going even deeper into the training setup 

Maximum likelihood (pre)training is just the beginning...
+ Alignment, supervised fine-tuning, preference optimization, RLHF, tool use

<center>
<img width='75%' src="https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb9d0144-3952-42db-8382-8e2eb37d917e_1670x640.png">
</center>

<center>Image source: <a href="https://arxiv.org/abs/2203.02155">InstructGPT</a> (and <a href="https://cameronrwolfe.substack.com/p/understanding-and-using-supervised">here</a>)</center>

<left><img width=25% src="img/cornell_tech2.svg"></left>

# Part 2: Running some code!

# Loading a model and tokenizer

+ [HuggingFace](https://huggingface.co/) hosts many models, tokenizers, datasets, and benchmarks
and provides Python/PyTorch libraries for downloading and using them
+ Let's load a "small" (with 1B parameters) Llama 3.2 model. (It runs on my laptop)

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto")

Let's start with the tokenizer. What does it look like?

In [3]:
tokenizer

PreTrainedTokenizerFast(name_or_path='meta-llama/Llama-3.2-1B-Instruct', vocab_size=128000, model_max_length=131072, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|begin_of_text|>', 'eos_token': '<|eot_id|>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	128000: AddedToken("<|begin_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128001: AddedToken("<|end_of_text|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128002: AddedToken("<|reserved_special_token_0|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128003: AddedToken("<|reserved_special_token_1|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128004: AddedToken("<|finetune_right_pad_id|>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	128005: AddedToken("<|reserved_special_token_2|>",

Let's check the vocabulary
<span style='color: grey'>(Ġ is special and indicates a space before the word)</span>

In [4]:
tokenizer.vocab_size

128000

In [5]:
tokenizer.vocab

{'bilt': 70824,
 'ĠãĤŃ': 109949,
 'ucus': 38601,
 '_depth': 19601,
 'ç·Ĵ': 114262,
 'ĠSimilarly': 35339,
 'Ġtess': 80930,
 'Ġspecials': 60874,
 'ĠOT': 8775,
 '=j': 46712,
 'ylation': 79933,
 'ĠNose': 93223,
 'ĠRolls': 70710,
 '.Power': 55186,
 '.Wh': 18951,
 'åĪ©çĶ¨': 107740,
 'ĠATT': 42385,
 'Ġraj': 92528,
 'OLLOW': 31289,
 'ĠBeer': 34484,
 '--------------------------------': 1434,
 'ĠØ§ÙĦØªÙĪ': 124487,
 'Almost': 39782,
 'è§Ĵèī²': 125499,
 'ĠÙĨØ²Ø¯ÛĮÚ©': 121045,
 '-f': 2269,
 'è¨ĺäºĭ': 77219,
 'Visibility': 11686,
 'ëĮĢë¡ľ': 106687,
 'ĠRather': 26848,
 'ÂłT': 115414,
 'Ġdescribes': 16964,
 'Ġpouvoir': 68226,
 'dimensions': 60339,
 'umbling': 42732,
 'Ġsimult': 20731,
 '[],': 13292,
 "('');Ċ": 21011,
 'Ġnaturally': 18182,
 'ĠÐŁÐ°Ð²': 114087,
 'Ġ:|:': 114168,
 'Ġtreatments': 22972,
 'hait': 98808,
 'Ð²ÐµÐ´': 36750,
 'ĠFold': 61573,
 'á»ĥn': 87982,
 'Ġspis': 123099,
 'Ġmales': 25000,
 '(Q': 6386,
 'orient': 15226,
 'Ġimgs': 57265,
 '.executeUpdate': 40012,
 'Ġbooked': 34070,
 '>::': 697

The vocabulary maps subwords to integers <span style='color: grey'>(here, out of 128k possibilities)</span>

In [6]:
tokenizer.vocab['hi'] # happens to be a token here, although is not guaranteed

6151

**Encoding** is the process of obtaining the sequence of tokens<br/>
<span style='color:grey'>(<a href="https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/">llama-tokenizer.js</a> is great for visualizing this)</span>

In [7]:
encoded_string = tokenizer.encode('tokenization example string', add_special_tokens=False)
encoded_string

[5963, 2065, 3187, 925]

**Decoding** is the process of obtaining the string from tokens<br/>

In [8]:
print(tokenizer.convert_ids_to_tokens(encoded_string))
print(tokenizer.decode(encoded_string))

['token', 'ization', 'Ġexample', 'Ġstring']
tokenization example string


Next, let's look at the model. It mostly has components we've seen before:

In [9]:
model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm):

# Querying for generations

+ Given the tokenizer and model, and input token sequence $x_{1:n}=[x_1, \ldots, x_n]$,
we can ask the model to predict (generate) next tokens $P(x_{n+j}|x_{1:n+j-1})$.
+ The model can sample from many possible generations <br/>
  <span style='color: grey'>(often controlled by `temperature`, as well as top-$p$ and top-$k$ parameters)</span>

In [10]:
prompt_str = 'Once upon a time'
prompt_tokens = tokenizer.encode(prompt_str, add_special_tokens=True, return_tensors='pt').to('mps')

num_samples = 10
outputs = []
for _ in range(num_samples):
    model_output_tokens = model.generate(
        prompt_tokens, do_sample=True, temperature=1.0, max_new_tokens=10,
        pad_token_id=tokenizer.eos_token_id, 
        attention_mask = torch.ones_like(prompt_tokens),
    ).squeeze(0)
    model_output_str = tokenizer.decode(model_output_tokens.tolist())
    outputs.append(model_output_str)

for output in outputs:
    print(output)

<|begin_of_text|>Once upon a time, there was a small town in Africa where the
<|begin_of_text|>Once upon a time, there was a young man named Max. Max
<|begin_of_text|>Once upon a time, in the land of Aethoria, there
<|begin_of_text|>Once upon a time, a beautiful and mysterious woman named Lena wandered into
<|begin_of_text|>Once upon a time, in a small village nestled in the rolling hills
<|begin_of_text|>Once upon a time, in a bustling city, a young and ambitious
<|begin_of_text|>Once upon a time, there lived a little girl named Lily who had
<|begin_of_text|>Once upon a time, there was a boy named Max who lived in
<|begin_of_text|>Once upon a time, in a small village nestled in the rolling hills
<|begin_of_text|>Once upon a time, there lived a young girl named Sophia. Sophia


# Generations for answering questions

With the ability to predict next tokens, we can query the model to answer questions.
This is an example from the standard [MMLU benchmark](https://huggingface.co/datasets/cais/mmlu)
along with a basic prompt style:

In [11]:
prompt_str = """Answer the following multiple choice question by giving the most appropriate response. Answer should be one among [A, B, C, D].

Question: The famous statement “An unexamined life is not worth living” is attributed to _____.

A: Aristotle
B: John Locke
C: Socrates
D: Plato

Answer:"""

In [12]:
prompt_tokens = tokenizer.encode(prompt_str, add_special_tokens=True, return_tensors='pt').to('mps')

model_output_tokens = model.generate(
    prompt_tokens, do_sample=False, temperature=None, max_new_tokens=1,
    pad_token_id=tokenizer.eos_token_id, attention_mask=torch.ones_like(prompt_tokens)
).squeeze(0)
model_output_str = tokenizer.decode(model_output_tokens.tolist())
print(model_output_str)

<|begin_of_text|>Answer the following multiple choice question by giving the most appropriate response. Answer should be one among [A, B, C, D].

Question: The famous statement “An unexamined life is not worth living” is attributed to _____.

A: Aristotle
B: John Locke
C: Socrates
D: Plato

Answer: C


# Extracting more information

+ The model's generation was correct, but doesn't tell us much
+ **Chain-of-thought** prompting is a way of extracting more information,
  as done in <a href="https://arxiv.org/abs/2205.11916">Large Language Models are Zero-Shot Reasoners</a>.
+ Many variations on this:

<center>
<img width='70%' src="https://miro.medium.com/v2/resize:fit:1400/format:webp/1*0EFaLY_NIIDkDn3vP-FBmQ.png"> <br/>
<a href="https://arxiv.org/abs/2308.09687">Source: Graph of Thoughts</a>
</center>

In [13]:
prompt_str = """Answer the following multiple choice question by giving the most appropriate response. Answer should be one among [A, B, C, D].

Question: The famous statement “An unexamined life is not worth living” is attributed to _____.

A: Aristotle
B: John Locke
C: Socrates
D: Plato

Answer: Let's think step-by-step. """

In [14]:
prompt_tokens = tokenizer.encode(prompt_str, add_special_tokens=True, return_tensors='pt').to('mps')

model_output_tokens = model.generate(
    prompt_tokens, do_sample=False, temperature=None, max_new_tokens=100,
    pad_token_id=tokenizer.eos_token_id, attention_mask=torch.ones_like(prompt_tokens)
).squeeze(0)
model_output_str = tokenizer.decode(model_output_tokens.tolist())
print(model_output_str)

<|begin_of_text|>Answer the following multiple choice question by giving the most appropriate response. Answer should be one among [A, B, C, D].

Question: The famous statement “An unexamined life is not worth living” is attributed to _____.

A: Aristotle
B: John Locke
C: Socrates
D: Plato

Answer: Let's think step-by-step.  The quote is attributed to Socrates.  Socrates was a Greek philosopher who lived in ancient Athens.  He is known for his method of questioning, which is now called the Socratic method.  Socrates believed that the unexamined life is not worth living, and this quote reflects his belief that one must examine their own life and values in order to live a meaningful and fulfilling life.  Therefore, the correct answer is C.  The other options are incorrect because Aristotle was a philosopher


# From next-token predictions to a chatbot

+ Now that we can generate continuations of sequences, what if we want to chat with the LLM as an assistant or chatbot?
+ We can't just query it as we were doing before ❌

In [15]:
prompt_str = 'What food do you recommend me?'
prompt_tokens = tokenizer.encode(prompt_str, add_special_tokens=True, return_tensors='pt').to('mps')
model_output_tokens = model.generate(
    prompt_tokens, do_sample=False, temperature=None, max_new_tokens=10,
    pad_token_id=tokenizer.eos_token_id, attention_mask=torch.ones_like(prompt_tokens)
).squeeze(0)
model_output_str = tokenizer.decode(model_output_tokens.tolist())
model_output_str

"<|begin_of_text|>What food do you recommend me? I'm looking for something that's easy to make"

We need to use a special **chat prompt template** that the model has been trained with on other chat data. These usually separate the text into `system`, `user`, and `assistant` roles and formats them back into a standardized sequence of tokens:

In [16]:
prompt = [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "What food do you recommend me?"},
]
inputs = tokenizer.apply_chat_template(prompt, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True,).to(device)
print(inputs['input_ids'].tolist()); print()
print(tokenizer.decode(inputs['input_ids'][0]))

[[128000, 128006, 9125, 128007, 271, 38766, 1303, 33025, 2696, 25, 6790, 220, 2366, 18, 198, 15724, 2696, 25, 220, 868, 4723, 220, 2366, 19, 271, 2675, 527, 264, 11190, 18328, 13, 128009, 128006, 882, 128007, 271, 3923, 3691, 656, 499, 7079, 757, 30, 128009, 128006, 78191, 128007, 271]]

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 15 Nov 2024

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What food do you recommend me?<|eot_id|><|start_header_id|>assistant<|end_header_id|>




Let's run the generation on this input and look at the result:

In [17]:
outputs = model.generate(
    **inputs, do_sample=False, temperature=None,
    max_new_tokens=60, pad_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0]))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 15 Nov 2024

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What food do you recommend me?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

I'd be happy to recommend some delicious food options for you. Since I don't know your personal preferences, I'll provide a variety of suggestions.

Here are some popular and tasty food ideas:

**Breakfast Options:**

1. Avocado toast with scrambled eggs and cherry tomatoes
2. Greek


# Code generation

Lastly, let's switch to some basic code generation with Code Llama. We will load a quantized version for speed:

In [18]:
from ctransformers import AutoModelForCausalLM
del tokenizer # now is inside the model
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/CodeLlama-7B-GGUF", model_file="codellama-7b.Q2_K.gguf", model_type="llama", gpu_layers=0,
    max_new_tokens=50, temperature=0.1)

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

In [19]:
model.config

Config(top_k=40, top_p=0.95, temperature=0.1, repetition_penalty=1.1, last_n_tokens=64, seed=-1, batch_size=8, threads=-1, max_new_tokens=50, stop=None, stream=False, reset=True, context_length=-1, gpu_layers=0, mmap=True, mlock=False)

For simple completions, the prompt can be the same as before with the start of a block of code.
This is an example from the standard [humanevalplus](https://huggingface.co/datasets/evalplus/humanevalplus)
dataset and benchmark. We can look at the tokenization in the same way as before:

In [20]:
prompt_str = '''
def fib(n: int):
    """Return n-th Fibonacci number.
    >>> fib(10)
    55
    >>> fib(1)
    1
    >>> fib(8)
    21
    """
'''
print(model.tokenize(prompt_str))

[1, 29871, 13, 1753, 18755, 29898, 29876, 29901, 938, 1125, 13, 1678, 9995, 11609, 302, 29899, 386, 383, 747, 265, 21566, 1353, 29889, 13, 1678, 8653, 18755, 29898, 29896, 29900, 29897, 13, 268, 29945, 29945, 13, 1678, 8653, 18755, 29898, 29896, 29897, 13, 268, 29896, 13, 1678, 8653, 18755, 29898, 29947, 29897, 13, 268, 29906, 29896, 13, 1678, 9995, 13]


And query the model with:

In [21]:
print(model(prompt_str))

    if n < 2:
        return n
    else:
        return fib(n-1) + fib(n-2)

def test_fib():
    assert fib(1) == 1



# From code completion to infilling

+ Completion works great and is powerful, but what if we want to generate suggestions in the middle of a larger file?
+ **How do we prompt the model in the middle of the code??**

---

```Python
def fib(n: int):
    """Return n-th Fibonacci number.
    >>> fib(10)
    55
    >>> fib(1)
    1
    >>> fib(8)
    21
    """
    if n < 2:
        ## user's cursor is here ##
    else:
        return fib(n-1) + fib(n-2)
```

# Infilling and fill-in-the-middle (FIM) prompting

+ Think about the file consisting of `PREFIX`, `MIDDLE`, and `SUFFIX` portions
+ **Key idea:** reformulate the prompt so the middle comes at the end,
  so a file is represented as `<PRE> prefix <SUF>suffix <MID>middle`
  + Uses special tokens `<PRE>` `<MID>` and `<SUF>` to separate them
  + ⚠ **Need to be very careful with the spaces**
+ More details in [Efficient Training of Language Models to Fill in the Middle](https://arxiv.org/abs/2207.14255)

Here's what the basic FIM prompt tokenizes to:

In [22]:
toks = model.tokenize('<PRE> code <SUF>code <MID>')
toks

[1, 32007, 775, 32008, 401, 32009]

In [23]:
[model.detokenize([tok]) for tok in toks]

['', ' <PRE>', ' code', ' <SUF>', 'code', ' <MID>']

Now we can put the Fibonacci prompt into this format and ask for the middle:

In [24]:
prompt_str = '''<PRE> def fib(n: int):
    """Return n-th Fibonacci number.
    >>> fib(10)
    55
    >>> fib(1)
    1
    >>> fib(8)
    21
    """
    if n < 2:
         <SUF>
    else:
        return fib(n-1) + fib(n-2) <MID>'''
print(model(prompt_str))

return n <EOT>


<left><img width=25% src="img/cornell_tech2.svg"></left>

# Summary
1. Tour through my favorite introductory parts from them
2. Some code examples to show how to apply and use <br/>
    a) **basic tokenization** and **autoregressive generation**, <br/>
    b) **chat templates**, and <br/>
    c) **code completion** (fill-in-the-middle)