# üöÄ Lesson 1.2: Your First Model

**Duration:** 1 hour  
**Difficulty:** Beginner  
**Prerequisites:** Lesson 1.1 completed

---

## üéØ Learning Objectives

By the end of this lesson, you will:
1. Load your first pre-trained model
2. Understand what tokenizers do
3. Run inference (predictions) on text
4. Understand model inputs and outputs

**This is a hands-on lesson - you'll write and run real code!**

---

## üõ†Ô∏è Setup

First, let's make sure everything is installed. Run the cell below:

In [None]:
# Install required libraries (if not already installed)
# Uncomment the line below if running in Google Colab or fresh environment
# !pip install transformers torch

# Import libraries
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## üéà The Easiest Way: Using Pipelines

HuggingFace provides `pipelines` - super simple ways to use pre-trained models.

Let's try sentiment analysis (positive/negative classification):

In [None]:
# Create a sentiment analysis pipeline
# This automatically downloads a pre-trained model
print("Loading model... (this might take a minute the first time)")
classifier = pipeline("sentiment-analysis")

print("\n‚úÖ Model loaded!\n")

# Now let's use it!
result = classifier("I love this tutorial! It's so helpful!")
print(f"Text: 'I love this tutorial! It's so helpful!'")
print(f"Result: {result}")

# Try a negative example
result2 = classifier("This is terrible and frustrating.")
print(f"\nText: 'This is terrible and frustrating.'")
print(f"Result: {result2}")

### üéâ Congratulations!

You just:
1. Loaded a pre-trained model (DistilBERT)
2. Made predictions on real text
3. Got confidence scores

**That was easy, right?** But what's happening under the hood? Let's find out!

In [None]:
# üß™ YOUR TURN: Try your own texts!
# Replace the text below with anything you want

my_texts = [
    "The weather is beautiful today!",
    "I'm disappointed with the service.",
    "Just another day at work.",
    # Add your own texts here!
]

results = classifier(my_texts)

for text, result in zip(my_texts, results):
    print(f"\nText: {text}")
    print(f"‚Üí {result['label']}: {result['score']:.2%} confidence")

## üî§ Understanding Tokenizers

Models can't read text directly - they need numbers. **Tokenizers** convert text to numbers.

### What's a Token?

A token is a piece of text (could be a word, part of a word, or punctuation).

Example:
```
"I love fine-tuning" 
    ‚Üì (tokenization)
["I", "love", "fine", "-", "tuning"]
    ‚Üì (to numbers)
[1045, 2293, 2986, 1011, 17372]
```

Let's see this in action:

In [None]:
# Load a tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

# Example text
text = "I love fine-tuning transformers!"

# Tokenize
tokens = tokenizer.tokenize(text)
print(f"Original text: {text}")
print(f"Tokens: {tokens}")

# Convert to IDs (numbers)
token_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Token IDs: {token_ids}")

# The easy way (does both at once)
encoded = tokenizer(text)
print(f"\nEncoded (automatic): {encoded}")

### üéØ Special Tokens

Notice the `[CLS]` and `[SEP]` tokens? These are special:

- **[CLS]** - "Classification" token - goes at the start
- **[SEP]** - "Separator" token - marks the end
- **[PAD]** - "Padding" token - used to make sequences the same length
- **[UNK]** - "Unknown" token - for words not in vocabulary

Let's explore:

In [None]:
# See all special tokens
print("Special tokens:")
print(f"CLS token: {tokenizer.cls_token} (ID: {tokenizer.cls_token_id})")
print(f"SEP token: {tokenizer.sep_token} (ID: {tokenizer.sep_token_id})")
print(f"PAD token: {tokenizer.pad_token} (ID: {tokenizer.pad_token_id})")
print(f"UNK token: {tokenizer.unk_token} (ID: {tokenizer.unk_token_id})")

# Let's see a complete tokenization with special tokens
text = "Hello world!"
tokens_with_special = tokenizer.tokenize(text, add_special_tokens=True)
print(f"\nText: {text}")
print(f"Tokens with special tokens: {tokenizer.convert_ids_to_tokens(tokenizer.encode(text))}")

In [None]:
# üß™ YOUR TURN: Tokenize your own text

your_text = "Write anything here and see how it's tokenized!"

tokens = tokenizer.tokenize(your_text)
encoded = tokenizer(your_text)

print(f"Your text: {your_text}")
print(f"\nTokens: {tokens}")
print(f"Number of tokens: {len(tokens)}")
print(f"\nToken IDs: {encoded['input_ids']}")

## üß† Loading Models (The Manual Way)

Pipelines are easy, but let's understand what's happening step-by-step:

In [None]:
# Load model and tokenizer separately
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

print(f"Loading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

print("‚úÖ Model and tokenizer loaded!")
print(f"\nModel has {model.num_labels} labels (classes)")
print(f"Label mapping: {model.config.id2label}")

### Running Inference Step-by-Step

Now let's make predictions manually to understand the full process:

In [None]:
# Step 1: Prepare text
text = "This is amazing!"

# Step 2: Tokenize
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
print("Step 2 - Tokenized inputs:")
print(f"Input IDs shape: {inputs['input_ids'].shape}")
print(f"Input IDs: {inputs['input_ids']}")

# Step 3: Run through model
with torch.no_grad():  # Don't compute gradients (we're not training)
    outputs = model(**inputs)

print("\nStep 3 - Model outputs:")
print(f"Logits shape: {outputs.logits.shape}")
print(f"Logits (raw scores): {outputs.logits}")

# Step 4: Convert to probabilities
import torch.nn.functional as F
probabilities = F.softmax(outputs.logits, dim=-1)
print("\nStep 4 - Probabilities:")
print(f"Probabilities: {probabilities}")

# Step 5: Get prediction
predicted_class = torch.argmax(probabilities, dim=-1).item()
confidence = probabilities[0][predicted_class].item()

print("\nStep 5 - Final prediction:")
print(f"Text: '{text}'")
print(f"Predicted class: {model.config.id2label[predicted_class]}")
print(f"Confidence: {confidence:.2%}")

### üìä What Just Happened?

```
Text ‚Üí Tokenizer ‚Üí Numbers ‚Üí Model ‚Üí Logits ‚Üí Softmax ‚Üí Probabilities ‚Üí Prediction

"Amazing!" ‚Üí [101, 6429, 102] ‚Üí Model ‚Üí [-2.1, 3.4] ‚Üí Softmax ‚Üí [0.02, 0.98] ‚Üí POSITIVE (98%)
```

**Key Terms:**
- **Logits**: Raw scores from the model (can be any number)
- **Softmax**: Converts logits to probabilities (0-1, sum to 1)
- **Argmax**: Picks the class with highest probability

In [None]:
# üöÄ Processing Multiple Texts at Once (Batching)

texts = [
    "I love this!",
    "This is terrible.",
    "It's okay, I guess.",
    "Absolutely fantastic!",
    "Worst experience ever."
]

# Tokenize all at once
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)
    probabilities = F.softmax(outputs.logits, dim=-1)

# Display results
print("Batch predictions:\n")
for text, probs in zip(texts, probabilities):
    pred_class = torch.argmax(probs).item()
    confidence = probs[pred_class].item()
    label = model.config.id2label[pred_class]
    
    print(f"Text: '{text}'")
    print(f"‚Üí {label} ({confidence:.2%} confidence)\n")

## üåü Exploring Other Pre-trained Models

Let's try different models for different tasks!

In [None]:
# 1. Question Answering
qa_pipeline = pipeline("question-answering")

context = "Fine-tuning is the process of adapting a pre-trained model to a specific task. It requires less data and time than training from scratch."
question = "What is fine-tuning?"

result = qa_pipeline(question=question, context=context)
print("Question Answering:")
print(f"Q: {question}")
print(f"A: {result['answer']} (confidence: {result['score']:.2%})\n")

# 2. Named Entity Recognition
ner_pipeline = pipeline("ner", grouped_entities=True)

text = "Apple Inc. was founded by Steve Jobs in Cupertino, California."
entities = ner_pipeline(text)

print("\nNamed Entity Recognition:")
print(f"Text: {text}")
print("Entities found:")
for entity in entities:
    print(f"  - {entity['word']}: {entity['entity_group']} ({entity['score']:.2%})")

In [None]:
# 3. Text Generation (bonus - this is fun!)
generator = pipeline("text-generation", model="distilgpt2")

prompt = "Fine-tuning machine learning models is"
result = generator(prompt, max_length=50, num_return_sequences=1)

print("Text Generation:")
print(f"Prompt: '{prompt}'")
print(f"Generated: {result[0]['generated_text']}")

## üéØ Practice Exercise

**Challenge:** Create a function that:
1. Takes a list of texts
2. Classifies each as positive/negative
3. Returns only the positive ones with their scores

Try it yourself first, then check the solution below!

In [None]:
# YOUR CODE HERE
def filter_positive_texts(texts):
    """
    Filter and return only positive texts with their confidence scores.
    
    Args:
        texts: List of strings
    
    Returns:
        List of tuples (text, score)
    """
    # TODO: Implement this function
    pass

# Test your function
test_texts = [
    "I love learning about AI!",
    "This is frustrating and difficult.",
    "Great tutorial, very helpful!",
    "I don't understand anything.",
    "Amazing progress today!"
]

# positive_texts = filter_positive_texts(test_texts)
# print(positive_texts)

In [None]:
# SOLUTION (Run this after trying yourself!)
def filter_positive_texts(texts):
    classifier = pipeline("sentiment-analysis")
    results = classifier(texts)
    
    positive_texts = []
    for text, result in zip(texts, results):
        if result['label'] == 'POSITIVE':
            positive_texts.append((text, result['score']))
    
    return positive_texts

# Test
positive_texts = filter_positive_texts(test_texts)

print("Positive texts found:\n")
for text, score in positive_texts:
    print(f"‚úÖ '{text}' (confidence: {score:.2%})")

## üéì Summary

Congratulations! You now know how to:
- ‚úÖ Load pre-trained models using pipelines
- ‚úÖ Understand tokenization (text ‚Üí numbers)
- ‚úÖ Run inference manually (step-by-step)
- ‚úÖ Process multiple texts in batches
- ‚úÖ Use different model types (sentiment, QA, NER, generation)

---

## üîë Key Takeaways

1. **Pipelines = Easy Mode** (one line of code)
2. **Tokenizers convert text to numbers** that models understand
3. **Models output logits** ‚Üí softmax ‚Üí probabilities ‚Üí predictions
4. **Batching is more efficient** than processing one at a time
5. **Different models for different tasks** (classification, QA, NER, etc.)

---

## üìù Self-Check

Can you answer these?
1. What does a tokenizer do?
2. What are "special tokens" and why do we need them?
3. What's the difference between logits and probabilities?
4. What's the advantage of using batches?

---

## ‚û°Ô∏è Next Lesson

**Lesson 1.3: Understanding Your Data**
- Dataset formats and structure
- Data preparation and cleaning
- Train/validation/test splits
- Data quality checks

**Ready to learn about data? Let's go! üöÄ**

---

**Progress:** üü¢üü¢üîòüîòüîò (Lesson 2 of 15)