# Introduction to HuggingFace Transformers

**DOST-ITDI AI Training Workshop**

Day 1, Session 4: Deep Learning with PyTorch and HuggingFace

---

## What is a Transformer?

### The Problem: Understanding Context

Imagine reading this sentence:

> "The **bank** was flooded after the heavy rain."

Is "bank" a:
- Financial institution (BDO, BPI)?
- River bank?

You know it's a **river bank** because of the words "flooded" and "rain".

**Transformers do the same thing** - they look at ALL words in a sentence to understand the meaning of each word.

---

### The Key Idea: Attention

Think of it like asking questions:

```
Word: "bank"
Question: "What other words help me understand 'bank'?"

Answer:
  - "flooded" -> Very relevant! (high attention)
  - "rain"    -> Relevant (high attention)
  - "The"     -> Not helpful (low attention)
  - "was"     -> Not helpful (low attention)
```

The model learns to **pay attention** to the right words.

---

### Simple Analogy: Group Discussion

Imagine a classroom discussion:

1. **Old way (RNN)**: Students pass notes one-by-one in a line
   - Slow, information gets lost
   - Student 10 barely remembers what Student 1 said

2. **Transformer way**: Everyone can talk to everyone directly
   - Fast, parallel processing
   - Each student decides who to listen to

---

### Why "Transformer"?

It **transforms** each word's representation by mixing in information from other relevant words.

```
Input:  [The] [bank] [was] [flooded]
           |     |      |      |
        (each word looks at all others)
           |     |      |      |
Output: [The] [river bank] [was] [flooded]
              (now enriched with context!)
```

## Famous Transformer Models

| Model | What it does | Analogy |
|-------|--------------|--------|
| **BERT** | Understands text | Reading comprehension expert |
| **GPT** | Generates text | Creative writer |
| **T5** | Text-to-text | Translator/Summarizer |

### For Science:
| Model | Domain |
|-------|--------|
| SciBERT | Scientific papers |
| ChemBERTa | Chemistry/Molecules |
| BioBERT | Biomedical text |

---

## Let's Use HuggingFace!

HuggingFace makes it easy to use these powerful models.

In [None]:
# Install
!pip install transformers torch -q
print("Ready!")

## 1. Sentiment Analysis

Is this text positive or negative?

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

texts = [
    "The experiment was a complete success!",
    "The reaction failed and we lost all samples.",
    "Results are still being analyzed."
]

for text in texts:
    result = classifier(text)[0]
    print(f"{result['label']}: {text}")

## 2. Named Entity Recognition (NER)

Find names, places, organizations in text.

In [None]:
from transformers import pipeline

ner = pipeline("ner", grouped_entities=True)

text = "Dr. Santos from DOST-ITDI published a study on paracetamol synthesis in Manila."

print(f"Text: {text}\n")
print("Entities found:")

for entity in ner(text):
    print(f"  {entity['word']:20} -> {entity['entity_group']}")

## 3. Question Answering

Ask questions about a paragraph.

In [None]:
from transformers import pipeline

qa = pipeline("question-answering")

context = """
Paracetamol, also known as acetaminophen, is a medication used to treat pain and fever.
It was first synthesized in 1877. The molecular weight is 151.16 g/mol.
It is one of the most commonly used medications worldwide.
"""

questions = [
    "What is paracetamol used for?",
    "When was it first synthesized?",
    "What is the molecular weight?"
]

for q in questions:
    answer = qa(question=q, context=context)
    print(f"Q: {q}")
    print(f"A: {answer['answer']}\n")

## 4. Zero-Shot Classification

Classify text WITHOUT any training data!

In [None]:
from transformers import pipeline

classifier = pipeline("zero-shot-classification")

texts = [
    "The solution turned blue after adding copper sulfate",
    "Patient showed improvement after 3 days of treatment",
    "The algorithm converged after 100 iterations"
]

labels = ["chemistry", "medicine", "computer science"]

for text in texts:
    result = classifier(text, labels)
    print(f"{result['labels'][0]:20} <- {text}")

## 5. Text Generation

In [None]:
from transformers import pipeline

generator = pipeline("text-generation", model="distilgpt2", max_length=40)

prompt = "In chemistry, a catalyst is"

result = generator(prompt, num_return_sequences=1)
print(result[0]['generated_text'])

## 6. ChemBERTa: Understanding Molecules

ChemBERTa can read SMILES (molecular text) and understand molecular structure!

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

# Load ChemBERTa
tokenizer = AutoTokenizer.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")
model = AutoModel.from_pretrained("seyonec/ChemBERTa-zinc-base-v1")

molecules = {
    "Ethanol": "CCO",
    "Methanol": "CO",
    "Benzene": "c1ccccc1",
    "Toluene": "Cc1ccccc1"
}

# Get embeddings
embeddings = {}
for name, smiles in molecules.items():
    tokens = tokenizer(smiles, return_tensors="pt")
    with torch.no_grad():
        output = model(**tokens)
    embeddings[name] = output.last_hidden_state.mean(dim=1).numpy()[0]
    print(f"{name:10} ({smiles:12}) -> embedding created")

In [None]:
# Compare molecular similarity
from sklearn.metrics.pairwise import cosine_similarity

names = list(embeddings.keys())
emb_matrix = np.array([embeddings[n] for n in names])
sim = cosine_similarity(emb_matrix)

print("Molecular Similarity:\n")
print(f"{'':12}", end="")
for n in names:
    print(f"{n:12}", end="")
print()

for i, n1 in enumerate(names):
    print(f"{n1:12}", end="")
    for j in range(len(names)):
        print(f"{sim[i,j]:12.2f}", end="")
    print()

print("\nNotice: Ethanol-Methanol are similar (alcohols)")
print("        Benzene-Toluene are similar (aromatics)")

## Summary

### What is Attention/Transformer?
- A way for models to understand context by looking at ALL words
- Each word "pays attention" to relevant words
- Much better than older sequential methods

### HuggingFace Pipelines

```python
from transformers import pipeline

# Just pick your task!
pipe = pipeline("sentiment-analysis")  
pipe = pipeline("ner")
pipe = pipeline("question-answering")
pipe = pipeline("zero-shot-classification")
pipe = pipeline("text-generation")
```

### Scientific Models
- **SciBERT**: Scientific text
- **ChemBERTa**: SMILES/molecules  
- **BioBERT**: Biomedical

### Resources
- https://huggingface.co/models
- https://huggingface.co/learn/nlp-course

## Exercise

1. Try sentiment analysis on your own research abstracts
2. Use NER to extract entities from a paper
3. Compare similarity of molecules in your research

In [None]:
# Your code here
