<a href="https://colab.research.google.com/github/calmrocks/master-machine-learning-engineer/blob/main/GenAI/PromptEngineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Advanced Prompt Engineering with Falcon-7B
This notebook demonstrates various prompt engineering techniques using the Falcon-7B model. We'll explore different prompting strategies and parameter tuning.

## Best Practices
Key takeaways:
1. Zero-shot works best for simple, straightforward tasks
2. Chain of Thought is excellent for complex reasoning
3. Few-shot is ideal when you have specific examples
4. Tree of Thoughts helps with decision-making tasks

Best practices:
- Start simple and increase complexity as needed
- Match the technique to the task type
- Consider computational efficiency
- Test different temperature values for optimal results

In [20]:
!pip install -q transformers accelerate
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import time
import gc

# Memory management utilities
def clear_gpu_memory():
    gc.collect()
    torch.cuda.empty_cache()
    print("GPU memory cleared")

def check_gpu_memory():
    if torch.cuda.is_available():
        print(f"Used: {torch.cuda.memory_allocated()/1e9:.2f}GB")
        print(f"Cached: {torch.cuda.memory_reserved()/1e9:.2f}GB")

## Model Setup
We'll use Falcon-7B-Instruct, a powerful 7B parameter model fine-tuned for instruction following. To run it on Colab's T4 GPU (16GB VRAM), we'll use 4-bit quantization to reduce memory usage.

In [None]:
def setup_model(temperature=0.7, top_p=0.9):
    model_id = "tiiuae/falcon-7b-instruct"

    print("Loading Falcon-7B model...")
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto",
        trust_remote_code=True,
        torch_dtype=torch.float16  # Using float16 instead of 4-bit quantization
    )

    generator = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=256,
        do_sample=True,
        temperature=temperature,
        top_p=top_p,
        num_return_sequences=1
    )

    return generator

# Initialize model once
try:
    print("Checking GPU...")
    print("CUDA available:", torch.cuda.is_available())
    if torch.cuda.is_available():
        print("GPU:", torch.cuda.get_device_name(0))

    global_generator = setup_model()
    check_gpu_memory()
except Exception as e:
    print(f"Error during setup: {e}")

Checking GPU...
CUDA available: True
GPU: Tesla T4
Loading Falcon-7B model...


model.safetensors.index.json:   0%|          | 0.00/17.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.95G [00:00<?, ?B/s]

## Generation Parameters
Let's explore how different parameters affect the model's output:
- Temperature: Controls randomness (higher = more creative, lower = more focused)
- Top-p: Controls diversity via nucleus sampling
- Max tokens: Controls response length

In [16]:
def generate_text(prompt, temperature=0.7, top_p=0.9, max_tokens=256):
    try:
        response = global_generator(
            prompt,
            max_new_tokens=max_tokens,
            do_sample=True,
            temperature=temperature,
            top_p=top_p
        )
        return response[0]['generated_text']
    except Exception as e:
        return f"Error generating text: {e}"

# Test generation with different temperatures
prompt = "Write a short story about a robot learning to paint:"

print("Testing different temperatures:")
temperatures = [0.3, 0.7, 1.2]

for temp in temperatures:
    print(f"\nTemperature = {temp}:")
    print("-" * 50)
    response = generate_text(prompt, temperature=temp)
    print(response)
    time.sleep(1)  # Add small delay between generations

Conservative (Temperature = 0.3):
Loading Falcon-7B model...


The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


ImportError: Using `bitsandbytes` 4-bit quantization requires the latest version of bitsandbytes: `pip install -U bitsandbytes`

## 1. Zero-shot Prompting
Zero-shot prompting is the simplest form of prompting where we directly ask the model to perform a task without providing any examples. This technique tests the model's ability to understand and perform tasks based solely on instructions.

Key characteristics:
- No examples provided
- Relies on model's pre-trained knowledge
- Simplest to implement but may not always give optimal results

In [3]:
print("Zero-shot Prompting Example")
print("-" * 50)

zero_shot_prompt = """Classify the sentiment of this text as positive, negative, or neutral:
Text: "I absolutely love this new phone! It's amazing!"
Sentiment:"""

response = generate_response(generator, zero_shot_prompt)
print(response)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Zero-shot Prompting Example
--------------------------------------------------




Classify the sentiment of this text as positive, negative, or neutral:
Text: "I absolutely love this new phone! It's amazing!"
Sentiment: "I absolutely love this new phone! It's amazing!"

Sentiment: "I absolutely love this new phone! It's amazing!"
Sentiment: "I absolutely love this new phone! It's amazing!"
Sentiment: "I absolutely love this new phone! It's amazing!"
Sentiment: "I absolutely love this new phone! It's amazing!"
Sentiment: "I absolutely love this new phone! It's amazing!"
Sentiment: "I absolutely love this new phone! It's amazing!"
Sentiment: "I absolutely love this new phone! It's amazing!"
Sentiment: "I absolutely love this new phone! It's amazing!"
Sentiment: "I absolutely love this new phone! It's amazing!"
Sentiment: "I absolutely love


## 2. One-shot Prompting
One-shot prompting provides the model with a single example of the desired task before asking it to perform a similar task. This can help the model better understand the expected format and type of response.

Benefits:
- Provides context through a single example
- Better performance than zero-shot
- Still maintains simplicity

In [4]:
print("One-shot Prompting Example")
print("-" * 50)

one_shot_prompt = """Classify the sentiment of text as positive, negative, or neutral:

Text: "This movie was terrible, I hated it."
Sentiment: negative

Text: "I absolutely love this new phone! It's amazing!"
Sentiment:"""

response = generate_response(generator, one_shot_prompt)
print(response)

One-shot Prompting Example
--------------------------------------------------
Classify the sentiment of text as positive, negative, or neutral:

Text: "This movie was terrible, I hated it."
Sentiment: negative

Text: "I absolutely love this new phone! It's amazing!"
Sentiment: positive

Text: "I love this movie, it's so funny!"
Sentiment: neutral

Text: "I love this movie, it's so funny!"
Sentiment: positive

Text: "I love this movie, it's so funny!"
Sentiment: neutral

Text: "I love this movie, it's so funny!"
Sentiment: positive

Text: "I love this movie, it's so funny!"
Sentiment: neutral

Text: "I love this movie, it's so funny!"
Sentiment: neutral

Text: "I love this movie, it's so funny!"
Sentiment: neutral

Text: "


## 3. Few-shot Prompting
Few-shot prompting extends one-shot prompting by providing multiple examples. This technique helps the model better understand patterns and expectations through multiple demonstrations.

Advantages:
- More robust performance
- Better pattern recognition
- Clearer context for complex tasks

In [None]:
# Cell 5: Few-shot Prompting
print("Few-shot Prompting Example")
print("-" * 50)

few_shot_prompt = """Solve the following math problems:

Problem: What is 15% of 200?
Solution: Let's solve this step by step:
1. To find 15% of 200, multiply 200 by 15/100
2. 200 × 15/100 = 200 × 0.15 = 30
Answer: 30

Problem: What is 25% of 80?
Solution: Let's solve this step by step:"""

response = generate_response(generator, few_shot_prompt)
print(response)

In [None]:
# Cell 6: Chain of Thought (CoT) Prompting
print("Chain of Thought Example")
print("-" * 50)

cot_prompt = """Let's solve this word problem step by step:

Problem: If a store has 150 apples and sells 30% of them on Monday, then sells 40% of the remaining apples on Tuesday, how many apples are left?

Let's think about this step by step:
1. First, let's calculate how many apples are sold on Monday
   * 30% of 150 = 150 × 0.30 = 45 apples sold
   * Remaining after Monday = 150 - 45 = 105 apples

2. Next, let's calculate Tuesday's sales
   * 40% of 105 = 105 × 0.40 = 42 apples sold
   * Remaining after Tuesday = 105 - 42 = 63 apples

Therefore, there are 63 apples left.

Problem: A restaurant has 200 customers per day and 15% are breakfast customers, 45% are lunch customers, and the rest are dinner customers. How many dinner customers are there?

Let's think about this step by step:"""

response = generate_response(generator, cot_prompt, max_length=400)
print(response)

In [None]:
# Cell 7: RAG (Retrieval-Augmented Generation) Example
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.utils import embedding_functions

# Setup knowledge base
knowledge_base = """
Python is a high-level, interpreted programming language created by Guido van Rossum in 1991.
Python supports multiple programming paradigms, including procedural, object-oriented, and functional programming.
Python's design philosophy emphasizes code readability with its notable use of significant whitespace.
"""

# Initialize ChromaDB and embed knowledge
chroma_client = chromadb.Client()
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction()

# Create collection
collection = chroma_client.create_collection(
    name="python_facts",
    embedding_function=sentence_transformer_ef
)

# Add documents
collection.add(
    documents=[knowledge_base],
    metadatas=[{"source": "python_docs"}],
    ids=["doc1"]
)

def rag_response(query, collection, generator):
    # Retrieve relevant information
    results = collection.query(
        query_texts=[query],
        n_results=1
    )

    # Construct prompt with retrieved information
    context = results['documents'][0][0]
    prompt = f"""Using the following information, answer the question.

Context: {context}

Question: {query}
Answer:"""

    return generate_response(generator, prompt)

# Example usage
query = "What is Python and who created it?"
response = rag_response(query, collection, generator)
print("RAG Example")
print("-" * 50)
print(response)

In [None]:
# Cell 8: Instruction-Oriented Prompting (IoT)
print("Instruction-Oriented Prompting Example")
print("-" * 50)

iot_prompt = """Instructions: Generate a professional email to reschedule a meeting. Use these guidelines:
- Be polite and professional
- Provide a reason for rescheduling
- Suggest two alternative times
- Ask for confirmation

Email:"""

response = generate_response(generator, iot_prompt)
print(response)

In [None]:
# Cell 9: Tree of Thoughts (ToT) Prompting
print("Tree of Thoughts Example")
print("-" * 50)

tot_prompt = """Problem: Plan a birthday party for a 10-year-old child.

Let's explore multiple thought paths:

Path 1 - Indoor Party:
1. Venue options:
   * Home party
   * Indoor playground
   * Party center
2. Activities:
   * Games
   * Crafts
   * Entertainment

Path 2 - Outdoor Party:
1. Venue options:
   * Park
   * Backyard
   * Sports facility
2. Activities:
   * Sports
   * Outdoor games
   * Nature activities

Let's evaluate each path and choose the best option considering:
1. Weather risks
2. Cost
3. Entertainment value
4. Practicality

Analysis:"""

response = generate_response(generator, tot_prompt, max_length=500)
print(response)