# Introduction to Hugging Face 🤗

Welcome to the world of Hugging Face! This notebook will introduce you to the basics of the Hugging Face ecosystem and show you how to get started with pre-trained models.

## What is Hugging Face?

Hugging Face is a platform and library that provides:
- **Pre-trained models** for NLP, computer vision, and audio
- **Easy-to-use APIs** for common ML tasks
- **Community hub** for sharing models and datasets
- **Tools** for training and deploying models

## Learning Objectives

By the end of this notebook, you'll know how to:
1. Install and import Hugging Face libraries
2. Use pre-trained models with pipelines
3. Understand the Hugging Face Hub
4. Perform basic NLP tasks

Let's get started!

## 1. Installation and Setup

First, let's make sure we have the necessary libraries installed:

In [None]:
# Install required packages (uncomment if needed)
# !pip install transformers torch datasets tokenizers

# Import essential libraries
import transformers
import torch
print(f"Transformers version: {transformers.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## 2. Your First Hugging Face Pipeline

Pipelines are the easiest way to use pre-trained models. They handle tokenization, model inference, and post-processing automatically.

In [None]:
from transformers import pipeline

# Create a sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis")

# Test it with some examples
texts = [
    "I love learning about AI and machine learning!",
    "This movie was terrible and boring.",
    "The weather is okay today."
]

results = sentiment_pipeline(texts)

for text, result in zip(texts, results):
    print(f"Text: '{text}'")
    print(f"Sentiment: {result['label']} (confidence: {result['score']:.3f})")
    print("-" * 50)

## 3. Different Types of Pipelines

Hugging Face supports many different tasks through pipelines:

In [None]:
# Text Generation
print("=== TEXT GENERATION ===")
generator = pipeline("text-generation", model="gpt2")
result = generator("Artificial intelligence is", max_length=50, num_return_sequences=1)
print(result[0]['generated_text'])
print()

In [None]:
# Question Answering
print("=== QUESTION ANSWERING ===")
qa_pipeline = pipeline("question-answering")

context = """
Hugging Face is a company that develops tools for building applications using machine learning.
The company was founded in 2016 by Clément Delangue, Julien Chaumond, and Thomas Wolf.
Their main product is the Transformers library, which provides thousands of pre-trained models.
"""

questions = [
    "When was Hugging Face founded?",
    "What is the main product of Hugging Face?",
    "Who are the founders?"
]

for question in questions:
    answer = qa_pipeline(question=question, context=context)
    print(f"Q: {question}")
    print(f"A: {answer['answer']} (confidence: {answer['score']:.3f})")
    print()

In [None]:
# Named Entity Recognition (NER)
print("=== NAMED ENTITY RECOGNITION ===")
ner_pipeline = pipeline("ner", aggregation_strategy="simple")

text = "Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976."
entities = ner_pipeline(text)

print(f"Text: {text}\n")
print("Entities found:")
for entity in entities:
    print(f"- {entity['word']}: {entity['entity_group']} (confidence: {entity['score']:.3f})")

## 4. Exploring the Hugging Face Hub

The Hugging Face Hub hosts thousands of models. Let's see how to explore and use different models:

In [None]:
# List available pipeline tasks
from transformers.pipelines import SUPPORTED_TASKS

print("Available pipeline tasks:")
for i, task in enumerate(SUPPORTED_TASKS.keys(), 1):
    print(f"{i:2d}. {task}")

In [None]:
# Using a specific model from the Hub
print("=== USING A SPECIFIC MODEL ===")

# Use a different sentiment analysis model
specific_sentiment = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest"
)

tweet_texts = [
    "Just finished my first machine learning project! 🚀",
    "Traffic is so bad today 😤",
    "Reading a good book by the fireplace"
]

for text in tweet_texts:
    result = specific_sentiment(text)
    print(f"Text: '{text}'")
    print(f"Sentiment: {result[0]['label']} (score: {result[0]['score']:.3f})")
    print()

## 5. Understanding Model Information

Let's learn how to get information about models:

In [None]:
from transformers import AutoTokenizer, AutoModel

# Load a model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

print(f"Model: {model_name}")
print(f"Vocabulary size: {tokenizer.vocab_size:,}")
print(f"Max sequence length: {tokenizer.model_max_length:,}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Model size in MB: {sum(p.numel() * p.element_size() for p in model.parameters()) / 1024**2:.1f} MB")

## 6. Working with Tokenizers

Tokenizers convert text into numbers that models can understand:

In [None]:
# Example text
text = "Hello, Hugging Face! Let's tokenize this text."

# Tokenize the text
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)
decoded_text = tokenizer.decode(token_ids)

print(f"Original text: {text}")
print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")
print(f"Decoded text: {decoded_text}")
print(f"Number of tokens: {len(tokens)}")

## 7. Batch Processing

Process multiple texts efficiently:

In [None]:
# Batch processing with sentiment analysis
batch_texts = [
    "I love this tutorial!",
    "This is confusing.",
    "Machine learning is fascinating.",
    "I'm not sure about this.",
    "Great explanation, thank you!"
]

print("=== BATCH SENTIMENT ANALYSIS ===")
batch_results = sentiment_pipeline(batch_texts)

for text, result in zip(batch_texts, batch_results):
    label = result['label']
    score = result['score']
    emoji = "😊" if label == "POSITIVE" else "😞"
    print(f"{emoji} '{text}' → {label} ({score:.3f})")

## 8. Error Handling and Best Practices

Let's learn about common issues and how to handle them:

In [None]:
# Handling long texts
long_text = "This is a very long text. " * 100  # Repeat to make it long

print(f"Text length: {len(long_text)} characters")
print(f"Token count: {len(tokenizer.tokenize(long_text))} tokens")
print(f"Max model length: {tokenizer.model_max_length} tokens")

# Truncate if necessary
truncated_encoding = tokenizer(
    long_text,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

print(f"Truncated token count: {truncated_encoding['input_ids'].shape[1]} tokens")

## 9. Saving and Loading Models

Learn how to save models locally for offline use:

In [None]:
import os

# Save model and tokenizer locally
local_model_path = "./local_distilbert"

# Create directory if it doesn't exist
os.makedirs(local_model_path, exist_ok=True)

# Save the model and tokenizer
model.save_pretrained(local_model_path)
tokenizer.save_pretrained(local_model_path)

print(f"Model saved to: {local_model_path}")
print(f"Files in directory: {os.listdir(local_model_path)}")

# Load from local path
local_tokenizer = AutoTokenizer.from_pretrained(local_model_path)
local_model = AutoModel.from_pretrained(local_model_path)

print("\nSuccessfully loaded model from local directory!")

## 10. Performance Tips and Device Management

In [None]:
# Check available devices
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")

# Create pipeline with specific device
sentiment_gpu = pipeline("sentiment-analysis", device=0 if torch.cuda.is_available() else -1)
print(f"\nPipeline created on device: {'GPU' if torch.cuda.is_available() else 'CPU'}")

## 🎉 Congratulations!

You've completed your first Hugging Face tutorial! Here's what you've learned:

✅ **Installation and setup** of Hugging Face libraries  
✅ **Using pipelines** for quick model inference  
✅ **Different NLP tasks** (sentiment analysis, QA, NER, text generation)  
✅ **Working with tokenizers** and understanding tokenization  
✅ **Batch processing** for efficiency  
✅ **Model management** (saving and loading)  
✅ **Best practices** for handling long texts and devices  

## 🚀 Next Steps

**Explore more models** on the [Hugging Face Hub](https://huggingface.co/models)
**Try different tasks** with various pipeline types in the blank cell below
**Move to the next notebook**: `02_tokenizers.ipynb`

Happy learning! 🤗

In [None]:
# Dream here!

