# Introduction to Hugging Face 🤗

Welcome to the world of Hugging Face! This notebook will introduce you to the basics of the Hugging Face ecosystem and show you how to get started with pre-trained models.

## What is Hugging Face?

Hugging Face is a platform and library that provides:
- **Pre-trained models** for NLP, computer vision, and audio
- **Easy-to-use APIs** for common ML tasks
- **Community hub** for sharing models and datasets
- **Tools** for training and deploying models

## Learning Objectives

By the end of this notebook, you'll know how to:
1. Install and import Hugging Face libraries
2. Use pre-trained models with pipelines
3. Understand the Hugging Face Hub
4. Perform basic NLP tasks

Let's get started!

## 1. Installation and Setup

First, let's make sure we have the necessary libraries installed:

In [1]:
# Install required packages (uncomment if needed)
# !pip install transformers torch datasets tokenizers

# Import essential libraries
import transformers
import torch
print(f"Transformers version: {transformers.__version__}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

Transformers version: 4.55.4
PyTorch version: 2.8.0+cpu
CUDA available: False


## 2. Your First Hugging Face Pipeline

Pipelines are the easiest way to use pre-trained models. They handle tokenization, model inference, and post-processing automatically.

In [2]:
from transformers import pipeline

# Create a sentiment analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis")

# Test it with some examples
texts = [
    "I love learning about AI and machine learning!",
    "This movie was terrible and boring.",
    "The weather is okay today."
]

results = sentiment_pipeline(texts)

for text, result in zip(texts, results):
    print(f"Text: '{text}'")
    print(f"Sentiment: {result['label']} (confidence: {result['score']:.3f})")
    print("-" * 50)

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Device set to use cpu


Text: 'I love learning about AI and machine learning!'
Sentiment: POSITIVE (confidence: 1.000)
--------------------------------------------------
Text: 'This movie was terrible and boring.'
Sentiment: NEGATIVE (confidence: 1.000)
--------------------------------------------------
Text: 'The weather is okay today.'
Sentiment: POSITIVE (confidence: 1.000)
--------------------------------------------------


## 3. Different Types of Pipelines

Hugging Face supports many different tasks through pipelines:

In [3]:
# Text Generation
print("=== TEXT GENERATION ===")
generator = pipeline("text-generation", model="gpt2")
result = generator("Artificial intelligence is", max_length=50, num_return_sequences=1)
print(result[0]['generated_text'])
print()

=== TEXT GENERATION ===


Device set to use cpu


Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Artificial intelligence is a very exciting field.

What's next for AI?

A lot of the work we do for AI is still going on. But we are getting there.

Why are we doing this?

We are trying to understand the basic processes we need to do to detect and control things. We are getting more and more sophisticated.

Why is this important?

We are still much smaller than we were before the start of this.

Why are there so many more advances in AI?

We are still in development. It's just a matter of time before we get to that point.

AI is a very important subject to study. And I think that there is a lot more that could be done.

What are your priorities?

We are very focused on making sure that the future of AI is as good as it could be.

What challenges have you faced in the past?

I have been involved in many other fields where there has been much more opportunity for AI and machine learning.

Do you think machines will be able to solve all of the problems that AI currently faces?

I would s

In [4]:
# Question Answering
print("=== QUESTION ANSWERING ===")
qa_pipeline = pipeline("question-answering")

context = """
Hugging Face is a company that develops tools for building applications using machine learning.
The company was founded in 2016 by Clément Delangue, Julien Chaumond, and Thomas Wolf.
Their main product is the Transformers library, which provides thousands of pre-trained models.
"""

questions = [
    "When was Hugging Face founded?",
    "What is the main product of Hugging Face?",
    "Who are the founders?"
]

for question in questions:
    answer = qa_pipeline(question=question, context=context)
    print(f"Q: {question}")
    print(f"A: {answer['answer']} (confidence: {answer['score']:.3f})")
    print()

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


=== QUESTION ANSWERING ===


Device set to use cpu


Q: When was Hugging Face founded?
A: 2016 (confidence: 0.983)

Q: What is the main product of Hugging Face?
A: Transformers library (confidence: 0.790)

Q: Who are the founders?
A: Clément Delangue, Julien Chaumond, and Thomas Wolf (confidence: 0.955)



In [5]:
# Named Entity Recognition (NER)
print("=== NAMED ENTITY RECOGNITION ===")
ner_pipeline = pipeline("ner", aggregation_strategy="simple")

text = "Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976."
entities = ner_pipeline(text)

print(f"Text: {text}\n")
print("Entities found:")
for entity in entities:
    print(f"- {entity['word']}: {entity['entity_group']} (confidence: {entity['score']:.3f})")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


=== NAMED ENTITY RECOGNITION ===


Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Device set to use cpu


Text: Apple Inc. was founded by Steve Jobs in Cupertino, California in 1976.

Entities found:
- Apple Inc: ORG (confidence: 1.000)
- Steve Jobs: PER (confidence: 0.989)
- Cupertino: LOC (confidence: 0.973)
- California: LOC (confidence: 0.999)


## 4. Exploring the Hugging Face Hub

The Hugging Face Hub hosts thousands of models. Let's see how to explore and use different models:

In [6]:
# List available pipeline tasks
from transformers.pipelines import SUPPORTED_TASKS

print("Available pipeline tasks:")
for i, task in enumerate(SUPPORTED_TASKS.keys(), 1):
    print(f"{i:2d}. {task}")

Available pipeline tasks:
 1. audio-classification
 2. automatic-speech-recognition
 3. text-to-audio
 4. feature-extraction
 5. text-classification
 6. token-classification
 7. question-answering
 8. table-question-answering
 9. visual-question-answering
10. document-question-answering
11. fill-mask
12. summarization
13. translation
14. text2text-generation
15. text-generation
16. zero-shot-classification
17. zero-shot-image-classification
18. zero-shot-audio-classification
19. image-classification
20. image-feature-extraction
21. image-segmentation
22. image-to-text
23. image-text-to-text
24. object-detection
25. zero-shot-object-detection
26. depth-estimation
27. video-classification
28. mask-generation
29. image-to-image


In [7]:
# Using a specific model from the Hub
print("=== USING A SPECIFIC MODEL ===")

# Use a different sentiment analysis model
specific_sentiment = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest"
)

tweet_texts = [
    "Just finished my first machine learning project! 🚀",
    "Traffic is so bad today 😤",
    "Reading a good book by the fireplace"
]

for text in tweet_texts:
    result = specific_sentiment(text)
    print(f"Text: '{text}'")
    print(f"Sentiment: {result[0]['label']} (score: {result[0]['score']:.3f})")
    print()

=== USING A SPECIFIC MODEL ===


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Device set to use cpu


Text: 'Just finished my first machine learning project! 🚀'
Sentiment: positive (score: 0.975)

Text: 'Traffic is so bad today 😤'
Sentiment: negative (score: 0.948)

Text: 'Reading a good book by the fireplace'
Sentiment: positive (score: 0.884)



## 5. Understanding Model Information

Let's learn how to get information about models:

In [8]:
from transformers import AutoTokenizer, AutoModel

# Load a model and tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

print(f"Model: {model_name}")
print(f"Vocabulary size: {tokenizer.vocab_size:,}")
print(f"Max sequence length: {tokenizer.model_max_length:,}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Model size in MB: {sum(p.numel() * p.element_size() for p in model.parameters()) / 1024**2:.1f} MB")

Model: distilbert-base-uncased
Vocabulary size: 30,522
Max sequence length: 512
Model parameters: 66,362,880
Model size in MB: 253.2 MB


## 6. Working with Tokenizers

Tokenizers convert text into numbers that models can understand:

In [9]:
# Example text
text = "Hello, Hugging Face! Let's tokenize this text."

# Tokenize the text
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)
decoded_text = tokenizer.decode(token_ids)

print(f"Original text: {text}")
print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")
print(f"Decoded text: {decoded_text}")
print(f"Number of tokens: {len(tokens)}")

Original text: Hello, Hugging Face! Let's tokenize this text.
Tokens: ['hello', ',', 'hugging', 'face', '!', 'let', "'", 's', 'token', '##ize', 'this', 'text', '.']
Token IDs: [101, 7592, 1010, 17662, 2227, 999, 2292, 1005, 1055, 19204, 4697, 2023, 3793, 1012, 102]
Decoded text: [CLS] hello, hugging face! let ' s tokenize this text. [SEP]
Number of tokens: 13


## 7. Batch Processing

Process multiple texts efficiently:

In [10]:
# Batch processing with sentiment analysis
batch_texts = [
    "I love this tutorial!",
    "This is confusing.",
    "Machine learning is fascinating.",
    "I'm not sure about this.",
    "Great explanation, thank you!"
]

print("=== BATCH SENTIMENT ANALYSIS ===")
batch_results = sentiment_pipeline(batch_texts)

for text, result in zip(batch_texts, batch_results):
    label = result['label']
    score = result['score']
    emoji = "😊" if label == "POSITIVE" else "😞"
    print(f"{emoji} '{text}' → {label} ({score:.3f})")

=== BATCH SENTIMENT ANALYSIS ===
😊 'I love this tutorial!' → POSITIVE (1.000)
😞 'This is confusing.' → NEGATIVE (0.999)
😊 'Machine learning is fascinating.' → POSITIVE (1.000)
😞 'I'm not sure about this.' → NEGATIVE (1.000)
😊 'Great explanation, thank you!' → POSITIVE (1.000)


## 8. Error Handling and Best Practices

Let's learn about common issues and how to handle them:

In [11]:
# Handling long texts
long_text = "This is a very long text. " * 100  # Repeat to make it long

print(f"Text length: {len(long_text)} characters")
print(f"Token count: {len(tokenizer.tokenize(long_text))} tokens")
print(f"Max model length: {tokenizer.model_max_length} tokens")

# Truncate if necessary
truncated_encoding = tokenizer(
    long_text,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

print(f"Truncated token count: {truncated_encoding['input_ids'].shape[1]} tokens")

Token indices sequence length is longer than the specified maximum sequence length for this model (700 > 512). Running this sequence through the model will result in indexing errors


Text length: 2600 characters
Token count: 700 tokens
Max model length: 512 tokens
Truncated token count: 512 tokens


## 9. Saving and Loading Models

Learn how to save models locally for offline use:

In [12]:
import os

# Save model and tokenizer locally
local_model_path = "./local_distilbert"

# Create directory if it doesn't exist
os.makedirs(local_model_path, exist_ok=True)

# Save the model and tokenizer
model.save_pretrained(local_model_path)
tokenizer.save_pretrained(local_model_path)

print(f"Model saved to: {local_model_path}")
print(f"Files in directory: {os.listdir(local_model_path)}")

# Load from local path
local_tokenizer = AutoTokenizer.from_pretrained(local_model_path)
local_model = AutoModel.from_pretrained(local_model_path)

print("\nSuccessfully loaded model from local directory!")

Model saved to: ./local_distilbert
Files in directory: ['config.json', 'model.safetensors', 'special_tokens_map.json', 'tokenizer.json', 'tokenizer_config.json', 'vocab.txt']

Successfully loaded model from local directory!


## 10. Performance Tips and Device Management

In [13]:
# Check available devices
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA Version: {torch.version.cuda}")

# Create pipeline with specific device
sentiment_gpu = pipeline("sentiment-analysis", device=0 if torch.cuda.is_available() else -1)
print(f"\nPipeline created on device: {'GPU' if torch.cuda.is_available() else 'CPU'}")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Using device: cpu


Device set to use cpu



Pipeline created on device: CPU


## 🎉 Congratulations!

You've completed your first Hugging Face tutorial! Here's what you've learned:

✅ **Installation and setup** of Hugging Face libraries  
✅ **Using pipelines** for quick model inference  
✅ **Different NLP tasks** (sentiment analysis, QA, NER, text generation)  
✅ **Working with tokenizers** and understanding tokenization  
✅ **Batch processing** for efficiency  
✅ **Model management** (saving and loading)  
✅ **Best practices** for handling long texts and devices  

## 🚀 Next Steps

**Explore more models** on the [Hugging Face Hub](https://huggingface.co/models)
**Try different tasks** with various pipeline types in the blank cell below
**Move to the next notebook**: `02_tokenizers.ipynb`

Happy learning! 🤗

In [14]:
# Dream here!

