# [Hands-On] Large Language Model for Text Generation

Author: Your Name or Organization

> Educational Purpose

In this tutorial, we'll explore how to load and run a **4-bit quantized** Falcon-7B Instruct model using Hugging Face `transformers` and `bitsandbytes`. This approach can significantly reduce GPU/CPU memory usage while still delivering relatively strong performance for text generation tasks.

We'll walk through:
1. **Installing** necessary libraries  
2. **Loading** the Falcon-7B Instruct model in 4-bit precision  
3. **Generating** text responses from sample prompts  

## 1) Install Required Packages

In [12]:
!pip install --upgrade transformers accelerate bitsandbytes



- **transformers**: Core library for state-of-the-art NLP models.
- **accelerate**: Facilitates multi-GPU/distributed training or inference.
- **bitsandbytes**: Enables 8-bit or 4-bit quantization for model weights, saving memory.

---

## 2) Imports & Basic Setup

We import PyTorch, along with classes from `transformers` that help us load the tokenizer, model, and configuration settings for 4-bit quantization.


In [14]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

- `AutoTokenizer`: Automatically loads the correct tokenizer for our specified model.
- `AutoModelForCausalLM`: Loads a causal language model (decoder-only) for text generation.
- `BitsAndBytesConfig`: Configuration class that specifies 4-bit quantization settings.

---

## 3) Load the Falcon-7B-Instruct Model in 4-bit Precision

Below is a helper function that:
1. Defines the model name (`tiiuae/falcon-7b-instruct`).
2. Sets `BitsAndBytesConfig` to load the model in 4-bit precision.
3. Instantiates the tokenizer and model from the Hugging Face Hub.
4. Returns both the tokenizer and model for further use.

In [16]:
def load_falcon_7b_instruct_4bit():
    """
    Loads the Falcon-7B-Instruct model using 4-bit quantization and returns
    the tokenizer and model objects.
    """
    model_name = "tiiuae/falcon-7b-instruct"
    
    quant_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4"
    )

    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=quant_config,
        device_map="auto",      # Automatically places model layers on available GPUs
        trust_remote_code=True  # Trust custom code from the model repo
    )
    model.eval()  # Set to evaluation mode
    return tokenizer, model


**Note**:  
- `device_map="auto"` ensures the model is loaded onto your available GPUs/CPU.  
- 4-bit quantization helps reduce memory usage but may slightly affect generation quality.

---
## 4) Define a Text Generation Function

We create a function `generate_text` which:
1. Tokenizes the prompt into `input_ids`.
2. Uses `model.generate()` to produce an output sequence.
3. Decodes the generated tokens back to a string.

In [18]:
def generate_text(tokenizer, model, prompt, max_new_tokens=80, temperature=0.7, top_p=0.9):
    """
    Generates text from a given prompt using the provided tokenizer/model.
    """
    input_ids = tokenizer.encode(prompt, return_tensors='pt').to(model.device)
    
    with torch.no_grad():
        output_ids = model.generate(
            input_ids,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
            repetition_penalty=1.1
        )
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)


**Parameters**:
- `max_new_tokens`: Maximum number of tokens to generate.
- `temperature`: Controls randomness; higher means more diverse but potentially less coherent text.
- `top_p`: Used for nucleus sampling; considers tokens within the cumulative probability.
- `repetition_penalty`: Penalizes repeated tokens to reduce looping outputs.

---

## 5) Model Preparation
- Load the 4-bit quantized Falcon-7B-Instruct model.


In [37]:
# Load the tokenizer and model
tokenizer, model = load_falcon_7b_instruct_4bit()

Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.35s/it]


## 6) Demo: Generate Text for Individual Prompts

### Case A: Poem

In [20]:
# Case A: Poem
# Prompt:
prompt_case_a = """### Instruction:
Write a short poem about sunrise on a beach.
### Response:
"""

output_a = generate_text(tokenizer, model, prompt_case_a, max_new_tokens=60, temperature=0.7)
print("Generated Text:\n", output_a)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Generated Text:
 ### Instruction:
Write a short poem about sunrise on a beach.
### Response:
The sun rises on the horizon,
Splashing its golden hues across the sky.
The waves of the sea shimmer in the light,
As the horizon glows with a heavenly sight.
The air is filled with a new-born energy,
As the day slowly starts to bring serenity


- The model successfully follows the poetic prompt, describing the sunrise visually.  
- Even in 4-bit quantization, the model retains its ability to generate coherent and creative text.  
- It may occasionally produce repeated words, but overall captures the intended imagery.  

### Case B: Broad Knowledge

In [22]:
# Prompt:
prompt_case_b = """### Instruction:
Explain who Albert Einstein was and why he is famous.
### Response:
"""

output_b = generate_text(tokenizer, model, prompt_case_b, max_new_tokens=60, temperature=0.7)
print("Generated Text:\n", output_b)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Generated Text:
 ### Instruction:
Explain who Albert Einstein was and why he is famous.
### Response:
Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist who developed the theory of relativity, one of the two pillars of modern physics. His work is also recognized in the field of quantum theory, cosmology, and astrophysics


- Here, the model leverages its built-in knowledge base to explain Albert Einstein’s contributions (relativity, E=mc², quantum theory).  
- The generated text is succinct and factually accurate in broad strokes.  
- 4-bit quantization does not seem to hinder its factual output for well-known topics.

### Case C: Simple Reasoning

In [33]:
# Prompt:
prompt_case_c = """### Instruction:
You have 5 apples. You give 2 apples to your friend and then buy 3 more apples. How many apples do you have now?
### Response:
"""

output_c = generate_text(tokenizer, model, prompt_case_c, max_new_tokens=30, temperature=0.01)
print("Generated Text:\n", output_c)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Generated Text:
 ### Instruction:
You have 5 apples. You give 2 apples to your friend and then buy 3 more apples. How many apples do you have now?
### Response:
You have 8 apples now.


- The model incorrectly calculated 5 - 2 + 3 = 8 when the correct answer is 6 apples
- This happens because LLMs are pattern-matching text predictors, not calculators - they try to predict the next likely tokens rather than perform actual math
- Even with low temperature (0.01), we get incorrect math because reducing randomness doesn't improve the model's fundamental ability to calculate
- For reliable mathematical calculations, traditional programming methods should be used instead of LLMs

This example shows why LLMs, while powerful for language tasks, aren not sometimes suitable for precise arithmetic operations.

### Case D: Summarize Pride and Prejudice in Two Sentences

In [36]:
# Prompt:
prompt_case_d = """### Instruction:
Please summarize the main events of 'Pride and Prejudice' in two sentences. Focus on the relationships between the characters.
### Response:
"""

output_d = generate_text(tokenizer, model, prompt_case_d, max_new_tokens=60, temperature=0.7)
print("Generated Text:\n", output_d)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:11 for open-end generation.


Generated Text:
 ### Instruction:
Please summarize the main events of 'Pride and Prejudice' in two sentences. Focus on the relationships between the characters.
### Response:
Elizabeth Bennet, a witty and intelligent young woman, is introduced to wealthy and proud Mr. Darcy. They overcome their pride and prejudices, eventually realizing they are perfect for one another.


- The model does manage to summarize in roughly two sentences, focusing on the relationships, particularly between Elizabeth Bennet and Mr. Darcy.  
- It highlights the key themes (love, social class, individual growth).  
- For strict two-sentence outputs, you may need to adjust the model’s temperature or post-process the result.

## Conclusion

In this hands-on tutorial, we demonstrated how to:
1. **Install** the necessary libraries (`transformers`, `accelerate`, `bitsandbytes`).
2. **Load** a Falcon-7B-Instruct model with **4-bit quantization** to save memory.
3. **Generate** text responses from sample instructions or prompts.

With this setup:
- You can handle larger models on limited hardware, although some quality trade-offs might appear.
- By tweaking hyperparameters (temperature, top_p, repetition_penalty, etc.) or fine-tuning the model further (using methods such as LoRA, PEFT), you can adapt the system to more specialized tasks or improve output quality.

Feel free to experiment with additional prompts or integrate this approach into larger applications!