# 🤗 Introduction to Hugging Face Transformers API

This notebook introduces the core components of the Hugging Face Transformers library: **tokenizers**, **models**, and **pipelines**. It provides hands-on examples to help you understand how to work with pre-trained models before fine-tuning them.

## 1. Import Required Libraries

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

## 2. Load a Pre-trained Tokenizer

The tokenizer breaks text into tokens and converts them into input IDs that the model can understand.

- The tokenizer and model must be compatible (i.e., from the same model family).
    * The tokenizer acts like a lookup dictionary: it breaks input text into tokens and maps them to numerical IDs the model understands.
    * Each model is trained using a specific tokenizer, so using a different one can result in mismatched input — leading to errors, poor output, or model failure.

In [None]:
model_id = "openai-community/gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Example text
text = "To be, or not to be, that is the question."

# Tokenize
tokens = tokenizer(text)
print("Token IDs:", tokens["input_ids"])
print("Decoded back:", tokenizer.decode(tokens["input_ids"]))

## 3. Load a Pre-trained Model

We now load the GPT-2 language model to process the tokenized input.

In [None]:
model = AutoModelForCausalLM.from_pretrained(model_id)

# Convert input to tensor
input_ids = torch.tensor([tokens["input_ids"]])

# Generate output from model
with torch.no_grad():
    outputs = model(input_ids)

print("Model output shape:", outputs.logits.shape)

## 4. Use the Pipeline for Text Generation

Hugging Face's `pipeline` abstraction simplifies using models for common tasks.

In [None]:
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Generate text
prompt = "To be, or not to be, that is the question."
generated = generator(prompt, max_length=50, num_return_sequences=1)

print("Generated text:\n")
print(generated[0]["generated_text"])

## Additional Concepts to Know

Here are some key ideas and best practices when working with Hugging Face Transformers:

---

### 1. Always Check the Model Card on Hugging Face Hub
- Every model on [https://huggingface.co/models](https://huggingface.co/models) has a **Model Card**.
- It tells you:
  - What the model was trained on
  - Supported tasks
  - Limitations and licenses
  - Example usage

---

### 2.  `AutoModel` vs `AutoModelFor...`
- `AutoModel` gives you the **base model** that outputs raw hidden states (useful for feature extraction, embeddings).
- `AutoModelForCausalLM`, `AutoModelForSequenceClassification`, etc., include **task-specific heads**.
  - ✅ Example: `AutoModelForCausalLM` adds a **language modeling** head on top of GPT-2.
  
```python
from transformers import AutoModel
model = AutoModel.from_pretrained("bert-base-uncased")  # base model
```

---

### 3. `AutoTokenizer`
- Automatically loads the right tokenizer for a given model.
- Handles:
  - Lowercasing (if needed)
  - Byte Pair Encoding (BPE), WordPiece, SentencePiece, etc.
  - Padding, truncation, special tokens

---

### 4. `pipeline`: Easy Inference for Common Tasks
- Abstracts away model/tokenizer logic.
- Great for:
  - `text-generation`
  - `sentiment-analysis`
  - `translation`
  - `question-answering`
  
```python
from transformers import pipeline
summarizer = pipeline("summarization")
print(summarizer("Long text goes here..."))
```

---

These tools help you work quickly, but you can also drop into low-level APIs for full control.


## ✅ Summary

- **Tokenizer**: Converts raw text into tokens and input IDs.
- **Model**: Processes input IDs to produce predictions.
- **Pipeline**: A high-level interface for common tasks like text generation.

These components are the foundation for working with Hugging Face models, and you'll use them again when fine-tuning on your own datasets (e.g., Shakespeare).