# Getting Started with Hugging Face Models (L40S Compatible)

This notebook will guide you through:
1. Setting up your environment for Hugging Face
2. Pulling a model from the Hugging Face Hub
3. Loading and using the model for inference
4. Fine-tuning a model for your specific needs (basic example)
5. Saving and sharing your model

All examples are optimized to run on a single NVIDIA L40S GPU.

## 1. Setting Up Your Environment

First, we need to install the necessary packages. We'll install the exact versions needed to avoid compatibility issues.

In [None]:
# Install the required packages
!pip install transformers==4.36.2 \
          datasets==2.15.0 \
          torch==2.1.2 \
          accelerate==0.25.0 \
          evaluate==0.4.1 \
          pillow==10.1.0 \
          huggingface-hub==0.19.4 \
          sentencepiece==0.1.99 \
          protobuf==4.25.1 \
          safetensors==0.4.1 \
          hf_xet  # Adding this to address the warnings

# Check the version to make sure installation was successful
import transformers
print(f"Transformers version: {transformers.__version__}")

### Setting up Hugging Face Authentication

You'll need to authenticate with Hugging Face to access certain models. Let's set this up properly.

In [None]:
from huggingface_hub import login

# IMPORTANT: How to create a token with the right scope
print("To create a Hugging Face token with the appropriate scope:")
print("1. Visit https://huggingface.co/settings/tokens")
print("2. Click 'New token'")
print("3. Give it a name (e.g., 'Notebook Access')")
print("4. For Role, select 'Read' if you only need to download models")
print("5. Select 'Write' if you plan to upload your models to the Hub")
print("6. Click 'Generate a token'")
print("7. Copy the token and paste it below when prompted")
print("\nNote: If the login cell takes too long, you can set the token directly:")
print("import os")
print("os.environ['HUGGINGFACE_TOKEN'] = 'your_token_here'")

# Optionally set token directly to avoid the interactive prompt
# Uncomment the lines below and replace with your token
# import os
# os.environ['HUGGINGFACE_TOKEN'] = 'your_token_here'  # Replace with your actual token
# print("Token set via environment variable")

# Or use the interactive login
# Comment this out if you used the environment variable approach above
login()

## 2. Pulling Models from Hugging Face Hub

Let's start with a smaller, efficient model that works well on an L40S GPU.

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# First check if GPU is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# This is a lightweight sentiment analysis model
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# Load the tokenizer and model with explicit truncation settings to avoid warnings
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Move model to GPU if available
model = model.to(device)

print(f"Model loaded: {model_name}")
print(f"Model size: {model.num_parameters():,} parameters")

### Understanding Model and Tokenizer

Let's explore what the model can do.

In [None]:
# Check the model's configuration
print("Model configuration:")
print(model.config)

## 3. Using the Model for Inference

Now let's use our model to analyze some text.

In [None]:
import torch
import numpy as np

# Define a function to get sentiment predictions
def analyze_sentiment(text):
    # Tokenize the input text with explicit truncation setting
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
    
    # Move inputs to the same device as the model
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    # Get predictions from the model
    with torch.no_grad():  # Disable gradient calculation for inference
        outputs = model(**inputs)
    
    # Convert to probabilities with softmax
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
    # Get the most likely class
    predicted_class = torch.argmax(probabilities, dim=-1).item()
    
    # Map class index to label
    sentiment = "positive" if predicted_class == 1 else "negative"
    confidence = probabilities[0][predicted_class].item()
    
    return sentiment, confidence

In [None]:
# Try it with some example texts
examples = [
    "I love this product! It's amazing and works perfectly.",
    "The service was terrible and the staff was rude.",
    "The movie was okay, not great but not terrible either."
]

for text in examples:
    sentiment, confidence = analyze_sentiment(text)
    print(f"Text: {text}")
    print(f"Sentiment: {sentiment} (Confidence: {confidence:.4f})")
    print("---")

### Try a Different Model: Text Generation

Let's also try a text generation model optimized for L40S GPU.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

# Using a small GPT-2 model for text generation that fits comfortably on an L40S
gen_model_name = "distilgpt2"

# Create a text generation pipeline with explicit truncation
generator = pipeline(
    'text-generation', 
    model=gen_model_name, 
    device=0 if torch.cuda.is_available() else -1,  # Use GPU if available
    truncation=True  # Explicitly set truncation
)

# Generate text
prompt = "In a world where AI and humans work together, "
result = generator(
    prompt, 
    max_length=50, 
    num_return_sequences=1,
    pad_token_id=50256  # Explicitly set pad token to avoid warning
)

print("Generated text:")
print(result[0]['generated_text'])

## 4. Fine-tuning a Model for Your Specific Task

Let's see how to fine-tune a model on a small dataset to fit within L40S GPU memory constraints.

In [None]:
from datasets import load_dataset
from transformers import Trainer, TrainingArguments

# Load a small subset of the dataset to fit in memory
dataset = load_dataset("tweet_eval", "sentiment", split={'train': 'train[:2000]', 'validation': 'validation[:500]'})
print(dataset)

In [None]:
# Examine some examples from the dataset
for i in range(3):
    print(f"Text: {dataset['train'][i]['text']}")
    print(f"Label: {dataset['train'][i]['label']}")
    print("---")

In [None]:
# Prepare the dataset for fine-tuning
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

# Load a small model for fine-tuning
model_name_ft = "distilbert-base-uncased"
tokenizer_ft = AutoTokenizer.from_pretrained(model_name_ft)
model_ft = AutoModelForSequenceClassification.from_pretrained(model_name_ft, num_labels=3)

# Tokenize the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Prepare train and validation datasets
train_dataset = tokenized_datasets["train"]
eval_dataset = tokenized_datasets["validation"]

In [None]:
# Define training arguments - NOTE: fixed previous error with evaluation_strategy
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",  # Changed from evaluation_strategy to eval_strategy
    learning_rate=2e-5,
    per_device_train_batch_size=8,  # Reduced batch size to fit in memory
    per_device_eval_batch_size=8,   # Reduced batch size to fit in memory
    num_train_epochs=2,             # Reduced epochs for demonstration
    weight_decay=0.01,
    load_best_model_at_end=True,
    push_to_hub=False,
    fp16=True                       # Use mixed precision for faster training
)

# Initialize Trainer
trainer = Trainer(
    model=model_ft,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

# Fine-tune the model
# Uncomment to run the training
# trainer.train()

## 5. Saving Your Fine-tuned Model Locally

After fine-tuning, you can save your model locally.

In [None]:
# If you've trained the model, you can save it
# Uncomment these lines after running training
"""
# Save the model and tokenizer locally
model_ft.save_pretrained("./my_finetuned_model")
tokenizer_ft.save_pretrained("./my_finetuned_model")
print("Model saved to ./my_finetuned_model")
"""

## 6. Sharing Your Model on Hugging Face Hub (Optional)

If you want to share your model with the community, you can push it to Hugging Face Hub.

In [None]:
# Code to upload your model to Hugging Face Hub
# Uncomment and modify for your needs
"""
from huggingface_hub import HfFolder

# Define your model name (must be unique on your account)
model_name = "your-username/tweet-sentiment-model"

# Push the model to the Hub
if HfFolder.get_token() is not None:  # Check if logged in
    trainer.push_to_hub(model_name, private=True)  # Set private=False to make it public
    print(f"Model pushed to Hugging Face Hub: {model_name}")
else:
    print("You need to login first using login() function")
"""

## 7. Using Pre-trained Models for Other Tasks

Let's explore some other task-specific models that work well on the L40S GPU.

### Named Entity Recognition

Let's use a more efficient NER model.

In [None]:
from transformers import pipeline

# Create a named entity recognition pipeline with a smaller model
ner_pipeline = pipeline(
    "ner", 
    model="dslim/bert-base-NER",  # Smaller model than previous example
    device=0 if torch.cuda.is_available() else -1  # Use GPU if available
)

# Try it out
text = "My name is Sarah and I work at Google in New York City."
entities = ner_pipeline(text)

print("Named Entities:")
for entity in entities:
    print(f"{entity['word']}: {entity['entity']} (Score: {entity['score']:.4f})")

### Image Classification

Let's use a more memory-efficient image classifier for the L40S GPU.

In [None]:
# Install PIL explicitly if not already installed
!pip install pillow

from transformers import AutoFeatureExtractor, AutoModelForImageClassification
from PIL import Image
import requests

# Download an example image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Display the image in Jupyter notebook
display(image)  # Works better in Jupyter than image.show()

# Load a smaller image classification model
model_name = "microsoft/resnet-18"  # Smaller than ViT, works better on L40S
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = AutoModelForImageClassification.from_pretrained(model_name)
model = model.to(device)  # Move to GPU

# Preprocess the image and get predictions
inputs = feature_extractor(images=image, return_tensors="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}  # Move to same device as model

with torch.no_grad():
    outputs = model(**inputs)
    
logits = outputs.logits

# Get the predicted class
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

## 8. Deploying a Model for Production

Here's a simple Flask API example for deploying your model.

In [None]:
# Example code for a Flask API - save to a file called app.py
"""
from flask import Flask, request, jsonify
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

app = Flask(__name__)

# Set device - use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model (do this outside of route to load only once)
model_path = "./my_finetuned_model"  # Or use a model from Hugging Face
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
model = model.to(device)  # Move model to GPU if available

@app.route('/predict', methods=['POST'])
def predict():
    data = request.json
    text = data['text']
    
    # Tokenize and predict
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}  # Move to same device as model
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(probabilities, dim=-1).item()
    confidence = probabilities[0][predicted_class].item()
    
    # For tweet_eval sentiment: 0 = negative, 1 = neutral, 2 = positive
    sentiment_map = {0: "negative", 1: "neutral", 2: "positive"}
    sentiment = sentiment_map[predicted_class]
    
    return jsonify({
        'sentiment': sentiment,
        'confidence': confidence
    })

if __name__ == '__main__':
    app.run(debug=True)
"""

# To run this:
# python app.py
# Then, make POST requests to http://localhost:5000/predict with JSON data: {"text": "your text here"}

## 9. Optimizing for L40S GPU

Here are some techniques to get the most out of your L40S GPU when working with Hugging Face models:

### Quantization

Quantization can significantly reduce memory usage.

In [None]:
# Example of using 8-bit quantization
"""
from transformers import AutoModelForSequenceClassification

# Load model in 8-bit precision
model_8bit = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased-finetuned-sst-2-english",
    load_in_8bit=True,
    device_map="auto"
)

print(f"8-bit model loaded. Memory usage significantly reduced!")
"""

### Gradient Checkpointing

For training larger models on the L40S, gradient checkpointing trades compute for memory.

In [None]:
# Example of using gradient checkpointing
"""
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
model.gradient_checkpointing_enable()  # Enable gradient checkpointing

print("Gradient checkpointing enabled. This will use less memory during training.")
"""

### Mixed Precision Training

Using mixed precision can significantly speed up training on the L40S GPU.

In [None]:
# Example of training arguments with mixed precision
"""
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    fp16=True,  # Enable mixed precision training
    fp16_opt_level="O1",  # Optimization level
    # ... other arguments
)
"""

## 10. Selecting the Right Models for L40S

Here are some of the most popular and efficient models that work well on a single L40S GPU:

### Text Classification
- distilbert-base-uncased
- roberta-base
- albert-base-v2

### Named Entity Recognition
- dslim/bert-base-NER
- dbmdz/bert-base-cased-finetuned-conll03-english

### Text Generation
- distilgpt2
- EleutherAI/gpt-neo-125M
- bigscience/bloom-560m

### Image Classification
- microsoft/resnet-18
- microsoft/resnet-50
- google/vit-base-patch16-224

### Translation
- Helsinki-NLP/opus-mt-en-fr (or other language pairs)
- facebook/mbart-large-50-one-to-many-mmt

### Summarization
- facebook/bart-base
- t5-small
- google/pegasus-xsum

These models strike a good balance between performance and efficiency, making them suitable for an L40S GPU.

## Conclusion

In this notebook, we've covered how to work with Hugging Face models on an NVIDIA L40S GPU, including:

- Setting up your environment with the right dependencies
- Properly authenticating with Hugging Face
- Loading and using pre-trained models for various tasks
- Fine-tuning models with memory-efficient settings
- Optimizing models for inference and deployment
- Selecting the right models that work well on a single L40S GPU

By following these best practices, you can effectively leverage Hugging Face models even with GPU memory constraints.

### Next Steps

1. Explore the [Model Hub](https://huggingface.co/models) for more L40S-compatible models
2. Try quantization techniques for using larger models on limited hardware
3. Experiment with parameter-efficient fine-tuning techniques like LoRA or Adapters
4. Join the Hugging Face community to share your experiences and learn from others

Happy modeling!