# Hands-On Prompt Engineering with Gemma

**DOST-ITDI AI Training Workshop**  
**Day 2 - Bonus Session: Using Open-Source LLMs for Research**

---

## Learning Objectives
1. Load and use Gemma (Google's open-source LLM)
2. Apply prompt engineering templates to real chemistry problems
3. Understand when to use open-source vs commercial LLMs
4. Practice crafting effective prompts for research tasks

## Why Gemma?

**Advantages:**
- ‚úÖ Free and open-source
- ‚úÖ Runs on Colab free tier (with T4 GPU)
- ‚úÖ No API keys needed
- ‚úÖ Good for chemistry/science tasks
- ‚úÖ Can run offline (once downloaded)

**When to use Gemma:**
- Quick calculations and explanations
- Learning and experimenting with LLMs
- Budget-constrained projects
- Privacy-sensitive tasks (can run locally)

**When to use commercial LLMs (GPT-4, Claude):**
- Critical research decisions
- Complex reasoning tasks
- Need for latest knowledge
- Production applications

## Part 1: Setup and Load Gemma

In [None]:
# Install required libraries
!pip install -q transformers accelerate

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import warnings
warnings.filterwarnings('ignore')

print("‚úì Libraries installed!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

# Check device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"\nUsing device: {device}")

if device == "cpu":
    print("\n[NOTE] Running on CPU - generation will be slower but will work!")
    print("[TIP] For faster results in Colab: Runtime ‚Üí Change runtime type ‚Üí T4 GPU")
else:
    print("\n[OK] GPU detected! Generation will be fast.")

In [None]:
# Load Gemma model (using 2B instruction-tuned version)
model_name = "google/gemma-2b-it"  # "it" = instruction-tuned

print(f"Loading {model_name}...")
print("This may take 1-2 minutes on first run (downloading ~5GB model)")
print("Subsequent runs will be much faster!\n")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

print("‚úì Gemma loaded successfully!")
print(f"Model size: {model.num_parameters() / 1e9:.1f}B parameters")

In [None]:
# Helper function for text generation
def generate_with_gemma(prompt, max_new_tokens=300, temperature=0.7):
    """
    Generate text using Gemma model

    Args:
        prompt: Input prompt/question
        max_new_tokens: Maximum tokens to generate
        temperature: Creativity (0.1=focused, 1.0=creative)

    Returns:
        Generated text
    """
    # Format for instruction-tuned model
    formatted_prompt = f"<start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n"

    # Tokenize
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

    # Generate
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        temperature=temperature,
        do_sample=True,
        top_p=0.9,
        pad_token_id=tokenizer.eos_token_id
    )

    # Decode and extract response
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    response = response.split("<start_of_turn>model")[-1].strip()

    return response

print("‚úì Helper function ready!")
print("\nTest it with: generate_with_gemma('What is aspirin?')")

## Part 2: Chain-of-Thought Prompting

**Best for:** Calculations, step-by-step reasoning, problem-solving

**Temperature:** Low (0.2-0.4) for factual accuracy

In [None]:
# Example 1: Molarity Calculation
prompt_molarity = """Let's think step by step to calculate the molarity of a solution.

Given:
- Mass of NaCl: 5.85 g
- Volume of solution: 500 mL
- Molecular weight of NaCl: 58.5 g/mol

Step 1: Calculate moles of NaCl
Step 2: Convert volume to liters
Step 3: Calculate molarity

Show your work for each step."""

print("Prompt Template: Chain-of-Thought")
print("="*80)
print(prompt_molarity)
print("\n" + "="*80)
print("Gemma's Response:")
print("="*80)

response = generate_with_gemma(prompt_molarity, max_new_tokens=400, temperature=0.3)
print(response)
print("="*80)

## Part 3: Data Interpretation

**Best for:** Analyzing spectroscopy data, chromatography, experimental results

**Temperature:** Medium (0.4-0.6) for balanced creativity and accuracy

In [None]:
# Example 2: IR Spectroscopy Interpretation
prompt_ir = """Interpret the following IR spectroscopy results:

Context:
- Compound: Unknown organic compound
- Analysis method: Infrared spectroscopy
- Research objective: Identify functional groups

Data:
- Strong, broad peak at 3300 cm‚Åª¬π
- Strong peak at 1710 cm‚Åª¬π
- Medium peaks at 2950 and 2850 cm‚Åª¬π

Please provide:
1. Identification of key peaks
2. Functional groups present
3. Possible compound type"""

print("Prompt Template: Data Interpretation")
print("="*80)
print(prompt_ir)
print("\n" + "="*80)
print("Gemma's Response:")
print("="*80)

response = generate_with_gemma(prompt_ir, max_new_tokens=350, temperature=0.5)
print(response)
print("="*80)

## Part 4: Role-Playing Expert

**Best for:** Getting domain-specific advice, design recommendations, evaluations

**Temperature:** Medium-High (0.5-0.7) for more diverse suggestions

In [None]:
# Example 3: Medicinal Chemistry Expert
prompt_expert = """You are an expert medicinal chemist with 20 years of experience in drug design.

Review this proposed molecule for potential drug development:
SMILES: CC(=O)Oc1ccccc1C(=O)O (Aspirin)

Evaluate:
1. Drug-likeness (Lipinski's Rule of Five)
2. Potential bioactivity
3. Synthesis feasibility
4. Recommendations for optimization"""

print("Prompt Template: Role-Playing Expert")
print("="*80)
print(prompt_expert)
print("\n" + "="*80)
print("Gemma's Response:")
print("="*80)

response = generate_with_gemma(prompt_expert, max_new_tokens=400, temperature=0.6)
print(response)
print("="*80)

## Part 5: Literature Summarization

**Best for:** Condensing research papers, extracting key findings, quick reviews

**Temperature:** Low-Medium (0.3-0.5) to maintain factual accuracy

In [None]:
# Example 4: Research Paper Summary
prompt_summary = """Summarize the following research paper about nanoparticle synthesis.

Focus on:
1. Main research question
2. Methodology used
3. Key findings

Format: Bullet points, concise

Paper abstract:
Silver nanoparticles were synthesized using Azadirachta indica leaf extract as a reducing agent.
UV-Vis spectroscopy confirmed formation with a peak at 420 nm. TEM analysis showed spherical
particles averaging 25 nm. Antimicrobial testing against E. coli and S. aureus demonstrated
MIC values of 10 and 15 ¬µg/mL respectively, indicating strong antibacterial properties suitable
for biomedical applications."""

print("Prompt Template: Literature Summarization")
print("="*80)
print(prompt_summary)
print("\n" + "="*80)
print("Gemma's Response:")
print("="*80)

response = generate_with_gemma(prompt_summary, max_new_tokens=250, temperature=0.4)
print(response)
print("="*80)

## Part 6: Experimental Design

**Best for:** Planning experiments, methodology suggestions, troubleshooting

**Temperature:** Medium (0.5-0.6) for creative but grounded suggestions

In [None]:
# Example 5: Experimental Protocol
prompt_experiment = """Design a simple experimental protocol for:

Research Goal: Test the antimicrobial activity of plant extracts

Available:
- Guava leaf extract
- E. coli bacterial culture
- Basic microbiology lab equipment
- Petri dishes and agar

Provide:
1. Step-by-step procedure (5-7 steps)
2. Expected observations
3. Safety considerations"""

print("Prompt Template: Experimental Design")
print("="*80)
print(prompt_experiment)
print("\n" + "="*80)
print("Gemma's Response:")
print("="*80)

response = generate_with_gemma(prompt_experiment, max_new_tokens=400, temperature=0.5)
print(response)
print("="*80)

## Part 7: Your Turn - Practice Prompting!

### Exercise 1: Modify the prompts above for your own research

Try these variations:
1. Change the compound in the IR spectroscopy example
2. Ask about YOUR research molecule using the expert template
3. Summarize a paper from your field
4. Design an experiment for your current project

### Exercise 2: Experiment with temperature

Run the same prompt with different temperatures:
- Temperature = 0.1 (very focused)
- Temperature = 0.5 (balanced)
- Temperature = 0.9 (very creative)

Compare the responses!

In [None]:
# YOUR PRACTICE AREA - Modify this cell!

my_prompt = """[YOUR PROMPT HERE]

Replace this with your own chemistry question or task!
"""

print("Your Custom Prompt:")
print("="*80)
print(my_prompt)
print("\n" + "="*80)
print("Gemma's Response:")
print("="*80)

# Uncomment to run:
# response = generate_with_gemma(my_prompt, max_new_tokens=300, temperature=0.5)
# print(response)
# print("="*80)

## Summary: Best Practices

### ‚úÖ DO:

1. **Be Specific**
   - Provide context and background
   - Specify desired output format
   - Give examples when possible

2. **Choose Right Temperature**
   - Low (0.2-0.4): Calculations, facts
   - Medium (0.5-0.6): Balanced tasks
   - High (0.7-0.9): Creative ideation

3. **Iterate and Refine**
   - Start simple, add details
   - Compare different phrasings
   - Build a library of effective prompts

4. **Verify Outputs**
   - Always fact-check chemistry data
   - Cross-reference with literature
   - Use as starting point, not final answer

### ‚ùå DON'T:

1. **Don't Trust Blindly**
   - LLMs can hallucinate facts
   - May generate plausible but wrong chemistry
   - Always validate critical information

2. **Don't Use for Critical Decisions**
   - Safety calculations
   - Drug dosing
   - Publication claims

3. **Don't Share Sensitive Data**
   - Unpublished research
   - Proprietary formulations
   - Confidential results

### When to Use Gemma vs Other Tools:

| Task | Gemma | GPT-4/Claude | Other Tools |
|------|-------|-------------|-------------|
| Quick explanations | ‚úÖ Good | ‚≠ê Better | - |
| Calculations | ‚úÖ Good | ‚≠ê Better | Python/Excel |
| Literature search | ‚ùå No | ‚ùå No | Elicit, Semantic Scholar |
| Code generation | ‚úÖ Good | ‚≠ê Better | GitHub Copilot |
| Data interpretation | ‚úÖ Good | ‚≠ê Better | Domain tools |
| Learning/practice | ‚≠ê Best | üí∞ Costly | - |

### Connection to Other Notebooks:

- **02_Regression, 03_Classification**: Use LLMs to explain model predictions
- **04_PyTorch**: Generate code snippets and explanations
- **05_LLMs**: Main concepts and other AI tools
- **05b_Model_Interpretability**: Combine SHAP + LLM explanations

---

**Great job! You now know how to use open-source LLMs for research tasks!**

**Next steps:**
1. Try Gemma with your own research questions
2. Build a prompt library for common tasks
3. Explore other open-source models (Llama, Mistral)
4. Integrate into your research workflow