<a href="https://colab.research.google.com/github/mpfoley73/hands-on-llm/blob/main/chapter01/Chapter%201%20-%20Introduction%20to%20Language%20Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Chapter 1 - Introduction to Language Models</h1>

This notebook is for Chapter 1 of the [Hands-On Large Language Models](https://www.amazon.com/Hands-Large-Language-Models-Understanding/dp/1098150961) book by [Jay Alammar](https://www.linkedin.com/in/jalammar) and [Maarten Grootendorst](https://www.linkedin.com/in/mgrootendorst/).

You should use a GPU to run the examples in this notebook. In Google Colab, go to

**Runtime > Change runtime type > Hardware accelerator > GPU > T4 GPU.**

Uncomment and run the following codeblock to install the dependencies.

In [1]:
%%capture
!pip install transformers>=4.40.1 accelerate>=0.27.2

# Phi-3 Mini

Load the `Phi-3-mini-4k-instruct` model and tokenizer from [Hugging Face](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct). Phi-3 is a "small language model", tiny compared to GPT-4, Genimi and Llama 3. It is designed to be small enough to deploy independently on a phone or other smart device. Phi-3 comes in three sizes (3.8B params for Mini, 7B for Small, and 14B for Medium). Mini has two variants based on context length: 4K and 128K.

In `from_pretrained()` below, `device_map="cuda"` enables to model to load into a CUDA-enabled GPU. CUDA (Compute Unified Deivce Architecture) is a parallel computing platform and API created by NVIDIA which allows developers to use the GPU.

`torch_dtype="auto"` automatically selects the appropriate data type for the model's tensors. `trust_remote_code=True` allows the model to execute custom code from the model respository (evidentaly necessary for Phi-3).

The last statement loads the associated tokenizer. LLMs usually come with a tokenizer which splits the input into tokens before feeding into the generative model.

> Authentication is recommended but still optional to access public models or datasets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret named `HF_TOKEN` in Google Colab and restart the session. The secret will persist to for other notebooks.

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

configuration_phi3.py:   0%|          | 0.00/11.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_phi3.py:   0%|          | 0.00/73.2k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors.index.json:   0%|          | 0.00/16.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.67G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.44k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.94M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/306 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/599 [00:00<?, ?B/s]

You could work with this, but `tranformers.pipline` simplifies the process by enxapsulating the `model`, `tokenizer`, and text-generation process into a single pipeline function. `return_full_text=False` excludes the prompt from the model output. `max_new_tokens=50` prevents long, unwieldy output. `do_sample=False` generates tokens by selecting the highest probability token rather than sampling, making the model much less creative.

In [3]:
from transformers import pipeline

# Create a pipeline
generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=False,
    max_new_tokens=500,
    do_sample=False
)

We're ready to test the model. Without a GPU, this statement might take 4-5 minutes.

In [4]:
# The prompt (user input / query)
messages = [
    {"role": "user", "content": "Create a funny joke about chickens."}
]

# Generate output
output = generator(messages)
print(output[0]["generated_text"])

The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.


 Why did the chicken join the band? Because it had the drumsticks!
