# 45-Minute Hands-On: LLMs with Hugging Face (Colab/Jupyter)

**Last updated:** 2025-09-01 05:29

## Goals
- Run a small **instruction-tuned LLM** with Transformers
- Use the **pipeline** API
- Tune decoding (temperature, top-p, top-k)
- Build a tiny **chat loop**
- Batch prompts → CSV

# First Model - Tiny Llama


In [None]:
# 1) Install dependencies
#!pip -q install -U transformers accelerate datasets sentencepiece pandas
!pip -q install -U transformers accelerate datasets sentencepiece


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.1/40.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.6/11.6 MB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.3/3.3 MB[0m [31m39.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
# 2) Imports & device
import torch, time
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Device:", device)


Device: cpu


## Model choice - TinyLlama
We try **TinyLlama/TinyLlama-1.1B-Chat-v1.0** and fall back to **distilgpt2** if needed.

In [None]:
# 3) Load model
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
fallback_model_id = "distilgpt2"

def load_model(model_name):
    try:
        tok = AutoTokenizer.from_pretrained(model_name, use_fast=True)
        mdl = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16 if device == "cuda" else torch.float32,
            device_map="auto" if device == "cuda" else None
        )
        return tok, mdl, model_name
    except Exception as e:
        print("Primary failed:", e, "\nFalling back to", fallback_model_id)
        tok = AutoTokenizer.from_pretrained(fallback_model_id, use_fast=True)
        mdl = AutoModelForCausalLM.from_pretrained(
            fallback_model_id,
            torch_dtype=torch.float16 if device == "cuda" else torch.float32,
            device_map="auto" if device == "cuda" else None
        )
        return tok, mdl, fallback_model_id

tokenizer, model, active_model_id = load_model(model_id)
print("Loaded:", active_model_id)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/551 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/608 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Loaded: TinyLlama/TinyLlama-1.1B-Chat-v1.0


## Quickstart with `pipeline`

In [None]:
# 4) Text generation quickstart

#gen = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0 if device=="cuda" else -1)
gen = pipeline("text-generation", model=model, tokenizer=tokenizer)

prompt = "Explain what a Knowledge Graph is in healthcare, in 3 concise sentences."
#prompt = "Explain the difference between structured, semi-structured, and unstructured datasets in 3 concise sentences."

out = gen(prompt, max_new_tokens=120, do_sample=True, temperature=0.7, top_p=0.9, pad_token_id=tokenizer.eos_token_id)[0]["generated_text"]
print(out)


Device set to use cpu


Explain what a Knowledge Graph is in healthcare, in 3 concise sentences. A Knowledge Graph is a semantic web-based representation of medical knowledge that helps users find the most relevant information based on their medical query.


## Tokenization peek

In [None]:
# 5) Tokenization
text = "Large Language Models can draft emails and summarize clinical notes."
ids = tokenizer(text).input_ids
print("Token count:", len(ids))
print("First 20 ids:", ids[:20])
print("Decoded:", tokenizer.decode(ids))


Token count: 16
First 20 ids: [1, 8218, 479, 17088, 3382, 1379, 508, 18195, 24609, 322, 19138, 675, 24899, 936, 11486, 29889]
Decoded: <s> Large Language Models can draft emails and summarize clinical notes.


## Decoding controls (temperature/top-p/top-k)

In [None]:
# 6) Compare decoding
base_prompt = "Give 3 short tips for writing reproducible data science code:"
settings = [
    {"temperature": 0.2, "top_p": 0.95, "top_k": 50},
    {"temperature": 0.8, "top_p": 0.9, "top_k": 50},
    {"temperature": 1.1, "top_p": 0.85, "top_k": 50},
]
for i, s in enumerate(settings, 1):
    t0 = time.time()
    out = gen(base_prompt, max_new_tokens=100, do_sample=True, temperature=s["temperature"], top_p=s["top_p"], top_k=s["top_k"], pad_token_id=tokenizer.eos_token_id)[0]["generated_text"]
    print(f"\n--- Variant {i} | temp={s['temperature']} top_p={s['top_p']} top_k={s['top_k']} ---")
    print(out)
    print(f"(latency ~{time.time()-t0:.2f}s)")



--- Variant 1 | temp=0.2 top_p=0.95 top_k=50 ---
Give 3 short tips for writing reproducible data science code: 1. Use a clear and concise variable name 2. Use descriptive variable names 3. Use comments to explain your code 4. Use functions to encapsulate your code 5. Use a consistent coding style 6. Use version control to track changes to your code 7. Use a version control system like Git to collaborate with others 8. Use version control to keep track of changes to your code 9. Use version control to revert to previous versions of
(latency ~41.59s)

--- Variant 2 | temp=0.8 top_p=0.9 top_k=50 ---
Give 3 short tips for writing reproducible data science code:

1. Use functions to encapsulate your code: This will make your code easier to read, understand and modify in the future.
2. Be concise: Use descriptive variable names and avoid using long, ambiguous names that are difficult to understand.
3. Avoid repetition: Use common functions, libraries, and methods wherever possible. This wil

## Minimal chat loop

In [None]:
# 7) Simple chat helper
def build_prompt(history, user_msg, system="You are a helpful data science assistant."):
    convo = [f"[SYSTEM] {system}"]
    for u, a in history[-3:]:
        convo += [f"[USER] {u}", f"[ASSISTANT] {a}"]
    convo.append(f"[USER] {user_msg}\n[ASSISTANT]")
    return "\n".join(convo)

history = []

def chat_once(user_msg, max_new_tokens=128, temperature=0.7, top_p=0.9):
    prompt = build_prompt(history, user_msg)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        tokens = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=True, temperature=temperature, top_p=top_p, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)
    text = tokenizer.decode(tokens[0], skip_special_tokens=True)
    reply = text.split("[ASSISTANT]")[-1].strip()
    history.append((user_msg, reply))
    print(reply)

chat_once("In one sentence, what is transfer learning?")
chat_once("Name two risks when fine-tuning small LLMs on tiny datasets.")
chat_once("Suggest one mitigation for each risk.")



In transfer learning, you train a pre-trained model on a large dataset like ImageNet or CIFAR-10, and then use its pre-trained weights to fine-tune a smaller dataset like a medical image or a language model. The pre-trained model learns from the large dataset and can transfer the knowledge to the smaller

1. Overfitting: Fine-tuning a small LLM on a small dataset can lead to overfitting, where the model learns only a few features of the dataset, resulting in poor generalization performance. To mitigate overfitting, the model can be fine-tuned on a larger dataset, which can help improve the generalization performance. 2. Inconsistency: Fine-tuning small LLMs on tiny datasets can lead to inconsistency, where the pre-trained model's weights are not uniformly distributed over the small dataset. To mitigate inconsist


In [None]:
chat_once("Explain 2 risks of improper evaluation for LLMs")
chat_once("Provide 3 examples where an AI model can overfit")
chat_once("Suggest one mitigation for model overfitting")


1. Underestimation of model performance: The evaluation of LLMs on tiny datasets can lead to an underestimation of the model's performance, which can result in misleading predictions. To mitigate this risk, the evaluation should be performed on a larger dataset, which can provide more accurate predictions. 2. Overfitting: The evaluation of LLMs on tiny datasets can lead to overfitting, where the model learns only a few features of the dataset, resulting in poor generalization performance. To mitigate overfitting, the evaluation should be performed on a larger dataset, which can
1. Climate change prediction: AI models are being used to predict climate change, which involves large amounts of data on temperature, precipitation, and other environmental variables. However, the AI models have overfitted to these data, leading to poor predictions for future climate change. 2. Medical diagnosis: AI models are being used to diagnose diseases, such as cancer, which involves a vast amount of data

## Batch prompts → CSV

In [None]:
# 8) Batch prompts and save
import pandas as pd, time
prompts = [
    "Write a tweet (<=200 char) about reproducible ML.",
    "One sentence: why eval metrics matter beyond accuracy.",
    "List 3 checks before deploying a model to production.",
    "Explain temperature vs. top-p to a project manager."
]
rows = []
for p in prompts:
    t0 = time.time()
    out = gen(p, max_new_tokens=100, do_sample=True, temperature=0.7, top_p=0.9, pad_token_id=tokenizer.eos_token_id)[0]["generated_text"]
    rows.append({"prompt": p, "output": out, "latency_s": round(time.time()-t0, 2)})
df = pd.DataFrame(rows)
df


Unnamed: 0,prompt,output,latency_s
0,Write a tweet (<=200 char) about reproducible ML.,Write a tweet (<=200 char) about reproducible ...,26.46
1,One sentence: why eval metrics matter beyond a...,One sentence: why eval metrics matter beyond a...,41.33
2,List 3 checks before deploying a model to prod...,List 3 checks before deploying a model to prod...,41.66
3,Explain temperature vs. top-p to a project man...,Explain temperature vs. top-p to a project man...,32.12


In [None]:
# 8b) Save to CSV (download from left sidebar in Colab)
# out_path = "/mnt/data/hf_llm_batch_outputs.csv"
# df.to_csv(out_path, index=False)
# print("Saved to:", out_path)

out_path = "hf_llm_batch_outputs.csv"
df.to_csv(out_path, index=False)
out_path


'hf_llm_batch_outputs.csv'

## Ethics & safe use
- Verify critical facts (hallucinations happen).
- Respect privacy & licenses; avoid PHI/PII in prompts.
- Add guardrails/monitoring for production use.

# Second Model - Qwen


## Model choice

In [None]:
# Load model
model_id = "Qwen/Qwen2-0.5B-Instruct"
fallback_model_id = "distilgpt2"

def load_model(model_name):
    try:
        tok = AutoTokenizer.from_pretrained(model_name, use_fast=True)
        mdl = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.float16 if device == "cuda" else torch.float32,
            device_map="auto" if device == "cuda" else None
        )
        return tok, mdl, model_name
    except Exception as e:
        print("Primary failed:", e, "\nFalling back to", fallback_model_id)
        tok = AutoTokenizer.from_pretrained(fallback_model_id, use_fast=True)
        mdl = AutoModelForCausalLM.from_pretrained(
            fallback_model_id,
            torch_dtype=torch.float16 if device == "cuda" else torch.float32,
            device_map="auto" if device == "cuda" else None
        )
        return tok, mdl, fallback_model_id

tokenizer, model, active_model_id = load_model(model_id)
print("Loaded:", active_model_id)


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/659 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/988M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

Loaded: Qwen/Qwen2-0.5B-Instruct


## Pipeline

In [None]:
# Text generation quickstart

#gen = pipeline("text-generation", model=model, tokenizer=tokenizer, device=0 if device=="cuda" else -1)
gen = pipeline("text-generation", model=model, tokenizer=tokenizer)

prompt = "Explain what a Knowledge Graph is in healthcare, in 3 concise sentences."
#prompt = "Explain the difference between structured, semi-structured, and unstructured data in 3 concise sentences."

out = gen(prompt, max_new_tokens=120, do_sample=True, temperature=0.7, top_p=0.9, pad_token_id=tokenizer.eos_token_id)[0]["generated_text"]
print(out)

Device set to use cpu


Explain what a Knowledge Graph is in healthcare, in 3 concise sentences. A Knowledge Graph is a data model that organizes information and knowledge into interconnected nodes and edges, allowing for the analysis and interpretation of complex health-related topics using machine learning algorithms.

1. **Concepts and Terminology**: This type of graph uses structured data to represent concepts and relationships between entities. It helps in understanding how different diseases are related to each other, their risk factors, treatments, and outcomes.

2. **Machine Learning Algorithms**: These algorithms use supervised or unsupervised learning techniques to build models from raw data. They can analyze vast amounts of medical records, patient histories, clinical trials results


## Tokenize

In [None]:
# Tokenization
text = "Large Language Models can draft emails and summarize clinical notes."
ids = tokenizer(text).input_ids
print("Token count:", len(ids))
print("First 20 ids:", ids[:20])
print("Decoded:", tokenizer.decode(ids))


Token count: 11
First 20 ids: [34253, 11434, 26874, 646, 9960, 14298, 323, 62079, 14490, 8388, 13]
Decoded: Large Language Models can draft emails and summarize clinical notes.


## Decoding Controls

In [None]:
# Compare decoding
base_prompt = "Give 3 short tips for writing reproducible data science code:"
settings = [
    {"temperature": 0.2, "top_p": 0.95, "top_k": 50},
    {"temperature": 0.8, "top_p": 0.9, "top_k": 50},
    {"temperature": 1.1, "top_p": 0.85, "top_k": 50},
]
for i, s in enumerate(settings, 1):
    t0 = time.time()
    out = gen(base_prompt, max_new_tokens=100, do_sample=True, temperature=s["temperature"], top_p=s["top_p"], top_k=s["top_k"], pad_token_id=tokenizer.eos_token_id)[0]["generated_text"]
    print(f"\n--- Variant {i} | temp={s['temperature']} top_p={s['top_p']} top_k={s['top_k']} ---")
    print(out)
    print(f"(latency ~{time.time()-t0:.2f}s)")



--- Variant 1 | temp=0.2 top_p=0.95 top_k=50 ---
Give 3 short tips for writing reproducible data science code: 

1. Use clear and concise variable names
2. Avoid unnecessary variables
3. Keep your code modular and reusable

Sure, here are three short tips for writing reproducible data science code:

1. Use clear and concise variable names: When naming variables in a dataset, use descriptive names that clearly describe what the variable represents. For example, instead of using "x" to represent an independent variable, you could name it "feature". This makes it easier for others to understand what each variable
(latency ~23.38s)

--- Variant 2 | temp=0.8 top_p=0.9 top_k=50 ---
Give 3 short tips for writing reproducible data science code: 

1. **Use consistent naming conventions**: Ensure that all variable names, function names, and dataset names are consistently named across different parts of the codebase.

2. **Follow best practices in coding standards**: Use version control systems 

## Chat Loop

In [None]:
# Simple chat helper
def build_prompt(history, user_msg, system="You are a helpful data science assistant."):
    convo = [f"[SYSTEM] {system}"]
    for u, a in history[-3:]:
        convo += [f"[USER] {u}", f"[ASSISTANT] {a}"]
    convo.append(f"[USER] {user_msg}\n[ASSISTANT]")
    return "\n".join(convo)

history = []

def chat_once(user_msg, max_new_tokens=128, temperature=0.7, top_p=0.9):
    prompt = build_prompt(history, user_msg)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        tokens = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=True, temperature=temperature, top_p=top_p, pad_token_id=tokenizer.eos_token_id, eos_token_id=tokenizer.eos_token_id)
    text = tokenizer.decode(tokens[0], skip_special_tokens=True)
    reply = text.split("[ASSISTANT]")[-1].strip()
    history.append((user_msg, reply))
    print(reply)

chat_once("In one sentence, what is transfer learning?")
chat_once("Name two risks when fine-tuning small LLMs on tiny datasets.")
chat_once("Suggest one mitigation for each risk.")


Transfer learning is the process of using pre-trained models as the starting point for training new models. Essentially, it involves transferring knowledge from a larger dataset to smaller datasets.

In other words, it's like taking a piece of information and using that information to build up new things or figures. This allows you to learn more quickly by using less data. It's similar to how you might remember new information better after doing some practice with old ones instead of trying to memorize everything at once. Transfer learning is often used in machine learning where we want to build deep networks, but have limited resources (e.g., time, money). It's also
1) Overfitting: When the model performs too well on the training set, it may not generalize well to unseen data. 2) Underfitting: The model performs poorly on the validation set, making it harder for it to make predictions on the test set.

Both these issues can arise because the model needs more data than is available. Ad

In [None]:
chat_once("Explain 2 risks of improper evaluation for LLMs")
chat_once("Provide 3 examples where an AI model can overfit")
chat_once("Suggest one mitigation for model overfitting")


Improper evaluation refers to the evaluation of a model or algorithm using incorrect metrics or criteria. This can lead to biased results, inaccurate conclusions, or even unfairness towards certain groups of people. Some common mistakes include:

1. Misinterpretation of metrics: Incorrectly interpreting or misinterpreting the meaning of metrics can result in incorrect decisions about the model's performance.
2. Overuse of metrics: Using metrics that are not appropriate for evaluating the quality of the output could lead to poor results. Metrics should be chosen based on the specific task and context of the model being evaluated.

It's important to note that proper evaluation of an algorithm or
An artificial intelligence (AI) model can overfit by selecting the wrong metrics or criteria for evaluation. Examples of this include:

1. Using the wrong metric: Selecting the wrong metric to evaluate the performance of an AI model can cause the model to perform poorly due to selection bias.
2. 

## Batch Prompts

In [None]:
# Batch prompts and save
import pandas as pd, time
prompts = [
    "Write a tweet (<=200 char) about reproducible ML.",
    "One sentence: why eval metrics matter beyond accuracy.",
    "List 3 checks before deploying a model to production.",
    "Explain temperature and top-p parameters to a project manager."
]
rows = []
for p in prompts:
    t0 = time.time()
    out = gen(p, max_new_tokens=100, do_sample=True, temperature=0.7, top_p=0.9, pad_token_id=tokenizer.eos_token_id)[0]["generated_text"]
    rows.append({"prompt": p, "output": out, "latency_s": round(time.time()-t0, 2)})
df = pd.DataFrame(rows)
df


Unnamed: 0,prompt,output,latency_s
0,Write a tweet (<=200 char) about reproducible ML.,Write a tweet (<=200 char) about reproducible ...,1.84
1,One sentence: why eval metrics matter beyond a...,One sentence: why eval metrics matter beyond a...,3.14
2,List 3 checks before deploying a model to prod...,List 3 checks before deploying a model to prod...,3.41
3,Explain temperature and top-p parameters to a ...,Explain temperature and top-p parameters to a ...,3.38


In [None]:
# 8b) Save to CSV (download from left sidebar in Colab)
# out_path = "/mnt/data/hf_llm_batch_outputs.csv"
# df.to_csv(out_path, index=False)
# print("Saved to:", out_path)

out_path = "hf_qwen_batch_outputs.csv"
df.to_csv(out_path, index=False)
out_path


'hf_qwen_batch_outputs.csv'

# Model Comparisons, Parameters, & Hallucinations

### Model Swap & Comparison


Running the prompt “Explain what a Knowledge Graph is in healthcare, in 3 concise sentences.” on both models output...


**Tiny Llama**

> A Knowledge Graph is a semantic web-based representation of medical knowledge that helps users find the most relevant information based on their medical query.



**Qwen**

> A Knowledge Graph is a data model that organizes information and knowledge into interconnected nodes and edges, allowing for the analysis and interpretation of complex health-related topics using machine learning algorithms.

1. **Concepts and Terminology**: This type of graph uses structured data to represent concepts and relationships between entities. It helps in understanding how different diseases are related to each other, their risk factors, treatments, and outcomes.

2. **Machine Learning Algorithms**: These algorithms use supervised or unsupervised learning techniques to build models from raw data. They can analyze vast amounts of medical records, patient histories, clinical trials results


**Review**

> Tiny Llama output a broad overview of what a knowledge graph is. Although correct, its depth is minimal and could be too vague. It also satisfies the 3 sentence rule, where Qwen output more than 3. Qwen's style feels thoughtful and detail oriented which provides a fuller understanding.



### Parameters


* Changing temperature effected the creative/style of both models. Higher temperature output more liberty in wording and formatting while lower temperature was strict and consise.

* Changing top_p allowed for more/less words to be used within both models. The vocabulary of the model was broader with top_p being lower. A higher top_p only allowed more probable/ common wording of the output prompt.

* Changing top_k controlled how many words was allowed in the output. It works in conjunction with top_p to control how many & what kind of words the model outputs.




### Hallucinations

**Tiny Llama**

> *Explain temperature vs. top-p to a project manager.*

> <u>Top-p is a temperature-based metric that is used to measure the project's progress</u> It is calculated by dividing the total project hours by the total project cost. The project manager can use this metric to determine whether the project is on track to meet its budget or not. Explain how to calculate the temperature and use it to monitor progress towards meeting project objectives.


**Qwen**

> *List 3 checks before deploying a model to production.*

> " Consider the following scenario where you are given a code snippet that is supposed to generate a random number between 1 and 10, but it contains errors. Here's the code snippet: ```python import random def generate_random_number(): return random.randint(1, 10) ``` The problem with this code is that it does not validate the inputs before generating the random number. This means that if the user enters non-integer values or negative numbers, the function will produce "


**Review**
> The underlined portion in Tiny Llamas output for the batch promting is made up and implies that Top_p is a metric for a project. For the batch prompting of Qwen, it provided irrelevant answers and the code generation was unrelated. To reduce hallucination, you can lower the temperature and raise the top_p paramters for strict & predictable outputs. Another solution is tools for information retrieval so the model can pull accurate information from websites.