## SmolLM3-3B For Abstractive Summarization

We'll use a new model from Hugging Face called SmolLM v3.  It stands out because it is relatively small at 3 billion paramters but has a 128K context window.  Let's look at [the model card](https://huggingface.co/HuggingFaceTB/SmolLM3-3B) from Hugging Face to get more background on just what distinguishes it from others.  Note it is optimized for common sense, language understanding, math, code, long context and logical reasoning.  They provide [an excellent and comprehensive description of how it was trained](https://huggingface.co/blog/smollm3).

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/datasci-w266/2025-fall-main/blob/master/materials/lesson_notebooks/lesson_7_summarization_LLM.ipynb)

In [None]:
!pip install hf_transfer
!export HF_HUB_ENABLE_HF_TRANSFER=1



In [None]:
!pip install -q -U bitsandbytes
!pip install -q -U flash_attn
!pip install -q -U transformers
!pip install -q -U accelerate


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.1/60.1 MB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.4/8.4 MB[0m [31m41.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for flash_attn (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.4/41.4 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m89.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import torch
from transformers import pipeline
from pprint import pprint

Here's some text from the introduction to [The Prompt Report: A Systematic Survey of Prompting Techniques](https://arxiv.org/pdf/2406.06608).  Let's have the model summarize it.

In [None]:
ARTICLE = "Scope of Study We create a broad directory of prompting techniques, which can be quickly understood and easily implemented for rapid experimentation by developers and researchers. To this end, we limit our study to focus on discrete prefix prompts (Shin et al., 2020a) rather than cloze prompts (Petroni et al., 2019; Cui et al., 2021), because modern LLM architectures (especially decoder-only models), which use prefix prompts, are widely used and have robust support for both consumers and researchers. Additionally, we refined our focus to hard (discrete) prompts rather than soft (continuous) prompts and leave out papers that make use of techniques using gradient-based updates (i.e. fine-tuning). Finally, we only study task-agnostic techniques. These decisions keep the work approachable to less technical readers and maintain a manageable scope. "

ARTICLE += "Sections Overview We conducted a machine-assisted systematic review grounded in the PRISMA process (Page et al., 2021) (Section 2.1) to identify 58 different text-based prompting techniques, from which we create a taxonomy with a robust terminology of prompting terms (Section 1.2) While much literature on prompting focuses on English-only settings, we also discuss multilingual techniques (Section 3.1). Given the rapid growth in multimodal prompting, where prompts may include media such as images, we also expand our scope to multimodal techniques (Section 3.2). Many multilingual and multimodal prompting techniques are direct extensions of English text-only prompting techniques. "

ARTICLE += "As prompting techniques grow more complex, they have begun to incorporate external tools, such as Internet browsing and calculators. We use the term ‘agents‘ to describe these types of prompting techniques (Section 4.1). It is important to understand how to evaluate the outputs of agents and prompting techniques to ensure accuracy and avoid hallucinations."

len(ARTICLE)

1899

In [None]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

This will allow us to print output with a large horizontal scroll bar.

In [None]:
from pprint import pprint

Now, let's load some Hugging Face abstractions -- AutoModelForCausalLM, AutoTokenizer, and the pipeline.  These make it very easy to just try a model and see how it performs.

In [None]:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

torch.random.manual_seed(0)

<torch._C.Generator at 0x7b6ab77d7390>

We're going to quantize our model which will shrink its memory footprint without reducing its performance in any significant way.  We'll discuss quantization in a later session.

In [None]:
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,

)

In [None]:
# Use a pipeline as a high-level helper

model="HuggingFaceTB/SmolLM3-3B"

messages = [
    {"role": "system", "content": "/no_think"},    #turn off reasoning- "/think" or nothing to turn on
    {"role": "user", "content": "Explain and contrast extractive and abstractive summarization."},
]

# This will load the FULL 3B paramater model with quantization ~ 3.5 of 15.0GB and 3 minutes to respond
smol_pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM3-3B", model_kwargs={"dtype": torch.bfloat16, "quantization_config": quantization_config},
    device_map="auto",)

outputs = smol_pipe(messages, max_new_tokens=2048,)       #shorter token length here will hurt reasoning


pprint(outputs[0]["generated_text"][-1], compact=True)

config.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.18G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/182 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/289 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

Device set to use cuda:0


{'content': 'Extractive and abstractive summarization are two types of text '
            'summarization techniques that aim to condense the main ideas of a '
            'document into a shorter summary. The primary difference between '
            'them lies in how they extract and represent the information.\n'
            '\n'
            '**Extractive Summarization:**\n'
            '\n'
            'Extractive summarization involves extracting specific phrases or '
            'sentences directly from the original text to create a summary. '
            'This approach is often referred to as "sentence-based" '
            'summarization. The algorithm identifies key sentences or phrases '
            'in the original text and selects them to include in the summary. '
            'The goal is to preserve the original meaning and structure of the '
            'text.\n'
            '\n'
            'Key characteristics of extractive summarization:\n'
            '\n'
            '1.

## Short vs. Long Thought Models

We'll be using SmolLM3, a hybrid model designed to give either thoughtful or quick answers.  Sometimes, depending on your problem, such a short thought model is what's best for you.  Other times, like if doing math, logic, or puzzles then a longer thought reasoning model is most approrpaite.  You can change this via a hyperparameter - `think` or `no_think` -  and the default is the longer thought mode.  After you've run all seven prompts with the short thought model you can uncomment the reasoning model line with `no_think` and try the short reasonong model to see how it performs.

What is the meaning of the string - `HuggingFaceTB/SmolLM3-3B`.  The `HuggingFaceTB` portion means it comes from Hugging Face. `SmolLM3` is the name of the model.  `3B` refers to the variant of the model usually indicating the number of parameters. Finally, there is another model called `HuggingFaceTB/SmolLM3-3B-Base` means this model has only been pre-trained and not post-trained so it should not be good at following our instructions.

We'll continue to run the model fully postrained model.  We'll construct our prompt which we'll put in the messages list.  Note that the model is trained to do some dialog.  We can toggle back and forth between the 'user' and 'assistant' roles.  We can also just feed in the initial 'user' field if we just want one prompt.


Now let's try it for abstractive summarization.  Note that it takes a lot longer to generate answers because this model has 3 billion parameters.  The next cell can take up to 1 minutes to complete.

How good is the output from SmolLM3?  How can we measure the performance? What are all of the elements we need to say run ROUGE?

In [None]:
messages = [
            {"role": "system", "content": "You are an expert on natural language processing.  Please summarize the following content for a fifth grader. Your summary should be no longer than five sentences. /no_think"},
            {"role": "user", "content": ARTICLE},
]

prompt = smol_pipe.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)

#lets set some values to have more control over the output
outputs = smol_pipe(
    prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
pprint(outputs[0]["generated_text"][len(prompt):], compact=True)

('We studied different ways to ask computers to do things, called "prompting '
 'techniques." We focused on a type of prompt called "discrete prefix '
 'prompts," which are easy to understand and use. We looked at 58 different '
 'techniques and organized them into a helpful system so people can quickly '
 'find what they need. Some of these techniques help with tasks like math or '
 'answering questions. We also considered how to use these prompts with '
 'different languages and how to use more than just text, like pictures. Some '
 'prompts can use tools like the internet or calculators, which we call '
 '"agents." It\'s important to know how to check if the answers are right to '
 'make sure the prompts are working well.')


Try it yourself.  You can fill in the system and the user portion of the prompt.  See what kinds of questions it can answer and see how well it summarizes content.  What happens when you let the model "/think" as opposed to "/no_think".

In [None]:
messages = [
            {"role": "system", "content": "Your Value Here"},
            {"role": "user", "content": "Your Value Here"},
]

prompt = smol_pipe.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)


#lets set some values to have more control over the output
outputs = smol_pipe(
    prompt,
    max_new_tokens=1024,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
pprint(outputs[0]["generated_text"][len(prompt):], compact=True)

Setting `pad_token_id` to `eos_token_id`:128009 for open-end generation.


('It seems you just repeated the phrase. Could you provide more context or '
 'clarify what you mean by "Your Value Here"?')
