![Alt Text](https://raw.githubusercontent.com/msfasha/307304-Data-Mining/main/20242/images/header.png)

<div style="display: flex; justify-content: flex-start; align-items: center;">
   <a href="https://colab.research.google.com/github/msfasha/307307-BI-Methods/blob/main/20242-NLP-LLM/Part%203%20-%20Introduction%20to%20DL%20and%20LLMs/2-huggingface_tutorial.ipynb" target="_parent"><img 
   src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
</div>

### Hugging Face Pipelines Tutorial
Hugging Face Pipelines provide a simple, high-level interface for using pre-trained models for various NLP, computer vision, and audio tasks. They abstract away much of the complexity involved in preprocessing, model inference, and postprocessing.

#### Installation
First, install the transformers library:

In [1]:
! pip install transformers torch

# For additional features:
! pip install transformers[torch] datasets evaluate

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting accelerate>=0.26.0 (from transformers[torch])
  Downloading accelerate-1.7.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-20.0.0-cp310-cp310-win_amd64.whl.metadata (3.4 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-win_amd64.whl.metadata (13 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec>=2023.5.0 (from huggingface-hub<1.0,>=0.30.0->transformers[torch])
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting aiohttp!=4.0.0a0,!=4.0.0a1 (from fsspec[http]<=2025.3.0,>=2023.1.0->datasets)
  Downloading aiohttp-3.11.1

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchaudio 2.7.0+cpu requires torch==2.7.0, but you have torch 2.1.2+cpu which is incompatible.


Basic Pipeline Usage
1. Text Classification (Sentiment Analysis)

In [2]:
from transformers import pipeline

# Create a sentiment analysis pipeline
classifier = pipeline("sentiment-analysis")

# Analyze single text
result = classifier("I love using Hugging Face!")
print(result)
# Output: [{'label': 'POSITIVE', 'score': 0.9998}]

# Analyze multiple texts
texts = [
    "I hate this product",
    "This is amazing!",
    "It's okay, nothing special"
]
results = classifier(texts)
for text, result in zip(texts, results):
    print(f"Text: {text}")
    print(f"Sentiment: {result['label']}, Score: {result['score']:.4f}\n")

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


[{'label': 'POSITIVE', 'score': 0.9997085928916931}]
Text: I hate this product
Sentiment: NEGATIVE, Score: 0.9998

Text: This is amazing!
Sentiment: POSITIVE, Score: 0.9999

Text: It's okay, nothing special
Sentiment: NEGATIVE, Score: 0.8076



2. Named Entity Recognition (NER)

In [3]:
# NER pipeline
ner = pipeline("ner", aggregation_strategy="simple")

text = "My name is John and I live in New York. I work at Google."
entities = ner(text)

for entity in entities:
    print(f"Entity: {entity['word']}")
    print(f"Label: {entity['entity_group']}")
    print(f"Score: {entity['score']:.4f}")
    print(f"Start: {entity['start']}, End: {entity['end']}\n")

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you 

Entity: John
Label: PER
Score: 0.9985
Start: 11, End: 15

Entity: New York
Label: LOC
Score: 0.9990
Start: 30, End: 38

Entity: Google
Label: ORG
Score: 0.9986
Start: 50, End: 56



3. Question Answering

In [None]:
# Question answering pipeline
qa = pipeline("question-answering")

context = """
Hugging Face is a company that develops tools for building applications using machine learning. 
They are especially known for their work in natural language processing. The company was founded in 2016 
and is headquartered in New York.
"""

questions = [
    "When was Hugging Face founded?",
    "Where is Hugging Face headquartered?",
    "What is Hugging Face known for?"
]

for question in questions:
    result = qa(question=question, context=context)
    print(f"Question: {question}")
    print(f"Answer: {result['answer']}")
    print(f"Score: {result['score']:.4f}\n")

4. Text Generation

In [None]:
# Text generation pipeline
generator = pipeline("text-generation", model="gpt2")

# Generate text with custom parameters
prompts = [
    "The future of artificial intelligence is",
    "In a world where robots exist,"
]

for prompt in prompts:
    generated = generator(
        prompt,
        max_length=50,
        num_return_sequences=2,
        temperature=0.7,
        do_sample=True,
        pad_token_id=generator.tokenizer.eos_token_id
    )
    
    print(f"Prompt: {prompt}")
    for i, gen in enumerate(generated):
        print(f"Generation {i+1}: {gen['generated_text']}\n")

5. Text Summarization

In [None]:
# Summarization pipeline
summarizer = pipeline("summarization")

article = """
Machine learning is a subset of artificial intelligence that enables computers to learn and improve 
from experience without being explicitly programmed. It focuses on the development of computer programs 
that can access data and use it to learn for themselves. The process of learning begins with observations 
or data, such as examples, direct experience, or instruction, in order to look for patterns in data and 
make better decisions in the future based on the examples that we provide. The primary aim is to allow 
the computers to learn automatically without human intervention or assistance and adjust actions accordingly.
"""

summary = summarizer(article, max_length=50, min_length=25, do_sample=False)
print("Original length:", len(article.split()))
print("Summary:", summary[0]['summary_text'])
print("Summary length:", len(summary[0]['summary_text'].split()))

6. Translation

In [None]:
# Translation pipeline
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-fr")

texts = [
    "Hello, how are you today?",
    "Machine learning is fascinating.",
    "I would like to order a coffee."
]

for text in texts:
    translated = translator(text)
    print(f"English: {text}")
    print(f"French: {translated[0]['translation_text']}\n")

## Advanced Pipeline Usage

### Specifying Custom Models

In [None]:
# Use a specific model
classifier = pipeline(
    "sentiment-analysis",
    model="cardiffnlp/twitter-roberta-base-sentiment-latest",
    tokenizer="cardiffnlp/twitter-roberta-base-sentiment-latest"
)

# Test with social media text
tweet = "Just had the best coffee ever! ☕ #perfect"
result = classifier(tweet)
print(f"Tweet: {tweet}")
print(f"Sentiment: {result[0]['label']}, Score: {result[0]['score']:.4f}")

### Batch Processing

In [None]:
# Efficient batch processing
texts = [
    "I love this product!",
    "This is terrible.",
    "It's okay.",
    "Absolutely amazing!",
    "Could be better."
]

# Process in batches for efficiency
classifier = pipeline("sentiment-analysis")
results = classifier(texts, batch_size=2)

for text, result in zip(texts, results):
    print(f"{text} -> {result['label']} ({result['score']:.3f})")

### Device Configuration

In [None]:
import torch

# Check if CUDA is available
device = 0 if torch.cuda.is_available() else -1

# Create pipeline with specific device
classifier = pipeline(
    "sentiment-analysis",
    device=device  # 0 for GPU, -1 for CPU
)

# The pipeline will now run on GPU if available
result = classifier("This will run faster on GPU!")
print(result)

## Computer Vision Pipelines

### Image Classification

In [None]:
from PIL import Image
import requests

# Image classification pipeline
classifier = pipeline("image-classification")

# Load image from URL
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = Image.open(requests.get(url, stream=True).raw)

# Classify image
results = classifier(image)
print("Top predictions:")
for result in results[:3]:
    print(f"Label: {result['label']}, Score: {result['score']:.4f}")

### Object Detection

In [None]:
# Object detection pipeline
detector = pipeline("object-detection")

# Detect objects in image
results = detector(image)

print(f"Found {len(results)} objects:")
for result in results:
    box = result['box']
    print(f"Label: {result['label']}")
    print(f"Score: {result['score']:.4f}")
    print(f"Box: {box}\n")

## Audio Pipelines

### Automatic Speech Recognition

In [None]:
# Note: You'll need to install additional dependencies
# pip install librosa soundfile

# ASR pipeline
transcriber = pipeline("automatic-speech-recognition")

# Transcribe audio file
# audio_file = "path/to/your/audio.wav"
# result = transcriber(audio_file)
# print(f"Transcription: {result['text']}")

## Custom Preprocessing and Postprocessing

In [None]:
# Create a custom pipeline with preprocessing
def preprocess_text(text):
    # Remove extra whitespace and convert to lowercase
    return text.strip().lower()

def postprocess_result(result):
    # Add custom formatting
    label = result['label']
    score = result['score']
    
    if score > 0.9:
        confidence = "Very confident"
    elif score > 0.7:
        confidence = "Confident"
    else:
        confidence = "Less confident"
    
    return {
        'prediction': label,
        'confidence_level': confidence,
        'raw_score': score
    }

# Usage example
classifier = pipeline("sentiment-analysis")

text = "   THIS IS AMAZING!!!   "
preprocessed = preprocess_text(text)
result = classifier(preprocessed)[0]
final_result = postprocess_result(result)

print(f"Original: {text}")
print(f"Preprocessed: {preprocessed}")
print(f"Final result: {final_result}")

## Pipeline with Return Tensors

In [None]:
# Get raw model outputs
classifier = pipeline("sentiment-analysis", return_all_scores=True)

text = "I'm not sure how I feel about this."
results = classifier(text)

print("All scores:")
for result in results[0]:  # Pipeline returns list of lists
    print(f"{result['label']}: {result['score']:.4f}")

## Error Handling and Best Practices

In [None]:
from transformers import pipeline
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)

try:
    # Create pipeline with error handling
    classifier = pipeline("sentiment-analysis")
    
    # Handle various input types
    inputs = [
        "Normal text",
        "",  # Empty string
        "A" * 1000,  # Very long text
        None,  # None value
    ]
    
    for inp in inputs:
        try:
            if inp is None:
                print(f"Input: {inp} -> Skipped (None value)")
                continue
                
            if len(inp) == 0:
                print(f"Input: '{inp}' -> Skipped (empty string)")
                continue
                
            result = classifier(inp)
            print(f"Input: {inp[:50]}... -> {result[0]['label']}")
            
        except Exception as e:
            print(f"Error processing '{inp[:50]}...': {str(e)}")
            
except Exception as e:
    print(f"Error creating pipeline: {str(e)}")

## Performance Tips

In [None]:
# 1. Reuse pipelines instead of creating new ones
classifier = pipeline("sentiment-analysis")

# 2. Use batch processing for multiple inputs
texts = ["Text 1", "Text 2", "Text 3"]
results = classifier(texts)  # More efficient than individual calls

# 3. Adjust max_length for your use case
classifier = pipeline("sentiment-analysis", max_length=128, truncation=True)

# 4. Use appropriate model size for your needs
# smaller models: distilbert-base-uncased-finetuned-sst-2-english
# larger models: roberta-large-mnli