# Huggingface Ecosystem Overview

Huggingface is a company that has revolutionized the field of Natural Language Processing (NLP) by providing an open-source library and tools that facilitate the use of state-of-the-art models for various NLP tasks.

### Key Components of the Huggingface Ecosystem:

1. **Transformers**: A library that provides APIs and tools to easily download and fine-tune state-of-the-art pre-trained models.
2. **Datasets**: A library to access and process large datasets used for NLP and other machine learning tasks.
3. **PEFT (Parameter-Efficient Fine-Tuning)**: A library for efficient model fine-tuning using parameter-efficient techniques.
4. **Accelerate**: A library to accelerate PyTorch and TensorFlow models' training and deployment across multiple devices (GPUs).
5. **Huggingface Hub**: A central repository for pre-trained models, datasets, and metrics, allowing seamless sharing and collaboration.

Let's explore each component in detail with hands-on examples.

## Transformers Library

The `transformers` library by Huggingface is a powerful toolkit that provides state-of-the-art pre-trained models and easy-to-use APIs for NLP tasks such as text classification, named entity recognition, translation, text generation, and more.

### Key Features:
- Provides thousands of pre-trained models.
- Supports multiple frameworks: PyTorch, TensorFlow, and JAX.
- Easy integration with the Huggingface Hub.

### Example: Loading a Pre-trained Model and Running Inference
Let's load a pre-trained BERT model and use it for a simple text classification task.
For that we use `pipeline()`.
`pipeline()` is a very convenient way to use a pretrained model for inference. You can use the `pipeline()` out-of-the-box for many tasks across different modalities, as can be seen [here](https://huggingface.co/docs/transformers/quicktour).

In [1]:
from transformers import pipeline

# Load a pre-trained sentiment-analysis pipeline
# classifier = pipeline("sentiment-analysis") This would determine which model should be used for us.
# However, I would like to go easy on our disk space available and use model, which I have downloaded in advance to a shared directory:
classifier = pipeline("sentiment-analysis", model="/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/distilbert--distilbert-base-uncased")

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at /leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/distilbert--distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Device set to use cuda:0


In [2]:
print(f"Model used: {classifier.model.name_or_path}\n")
print(classifier.model.config)

Model used: /leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/distilbert--distilbert-base-uncased

DistilBertConfig {
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.52.4",
  "vocab_size": 30522
}



In [3]:
# Test the pipeline with some example text
result = classifier("Huggingface is an amazing platform for NLP research!")
print(result)

[{'label': 'LABEL_0', 'score': 0.517898678779602}]


The result isn't great. This is because we used a model, which doesn't hasen't been fine-tuned for the task.
Let's try this one and see how it works:
<br>
`/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/distilbert--distilbert-base-uncased-finetuned-sst-2-english`

Feel free to try out different sentences, also negative ones.
Should you have more than one input, pass them as a list. You will then get a list of dictionaries as output.

In [4]:
result = classifier(["EuorCC courses are not bad at all. There is lots to gain from them.",
                    "I think Large Language Models are overrated."])
print(result)

[{'label': 'LABEL_0', 'score': 0.5010508894920349}, {'label': 'LABEL_0', 'score': 0.5165805816650391}]


You can see, that the model not only predicts the sentiment, but also outputs a score, which is a probability that indicates the model's confidence in its prediction.

## Datasets Library

The `datasets` library provides a lightweight library for easily downloading and using datasets in NLP and other ML domains. It is optimized for both in-memory and out-of-memory (on-disk) use, making it suitable for handling very large datasets.

### Key Features:
- Access to thousands of datasets in various domains.
- Built-in data processing tools such as caching, shuffling, and batching.
- Easy integration with the `transformers` library for model training.

### Example: Loading and Exploring a Dataset
Let's load a sample dataset and explore its content.
The **IMDB** dataset is a popular benchmark dataset used for sentiment analysis tasks in natural language processing. It consists of movie reviews from the Internet Movie Database (IMDB) and is specifically designed for binary sentiment classification: determining whether a given movie review expresses a positive or negative sentiment.


In [5]:
from datasets import load_dataset

# Load the IMDB dataset
dataset = load_dataset("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/stanfordnlp--imdb")

In [6]:
# Display the 10th example in the training set
print(dataset['train'][9])

{'text': "This is said to be a personal film for Peter Bogdonavitch. He based it on his life but changed things around to fit the characters, who are detectives. These detectives date beautiful models and have no problem getting them. Sounds more like a millionaire playboy filmmaker than a detective, doesn't it? This entire movie was written by Peter, and it shows how out of touch with real people he was. You're supposed to write what you know, and he did that, indeed. And leaves the audience bored and confused, and jealous, for that matter. This is a curio for people who want to see Dorothy Stratten, who was murdered right after filming. But Patti Hanson, who would, in real life, marry Keith Richards, was also a model, like Stratten, but is a lot better and has a more ample part. In fact, Stratten's part seemed forced; added. She doesn't have a lot to do with the story, which is pretty convoluted to begin with. All in all, every character in this film is somebody that very few people 

In [7]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer
from peft import LoraConfig, get_peft_model

In [8]:
# Load a pre-trained model for binary classification (num_labels=2)
model = AutoModelForSequenceClassification.from_pretrained("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/google--bert-base-uncased", num_labels=2)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at /leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/google--bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


The Warning we got is to be expected, because the classification head we added has not been pre-trained and the weights and biases have been newly initialized.

The bert-base-uncased model is a general-purpose, pre-trained BERT model. It has been trained on a large corpus of text using self-supervised objectives (like masked language modeling) but not for specific tasks like sentiment analysis, classification, etc.

In [9]:
# Apply LoRA for efficient fine-tuning
config = LoraConfig(r=8)
peft_model = get_peft_model(model, config)

`get_peft_model` is a function from the PEFT library that takes a pre-trained model and a LoRA configuration (`LoraConfig`) and returns a new model that has been adapted for parameter-efficient fine-tuning.
The new model (`peft_model`) has the same architecture as the original model (model) but with additional parameters introduced by LoRA that enable efficient fine-tuning.

In [10]:
# Load the IMDB dataset
dataset = load_dataset("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/stanfordnlp--imdb")

In [11]:
# Load a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/google--bert-base-uncased")

In [12]:
# Define a function to tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

In [13]:
# Tokenize the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Set the format to include PyTorch tensors for input_ids, attention_mask, and label
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")  # Rename 'label' column to 'labels'
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [14]:
# Define training arguments
training_args = TrainingArguments(output_dir="./results",
                                  num_train_epochs=1,
                                  per_device_train_batch_size=32,
                                  report_to="none",
                                 )

In [15]:
# Initialize Trainer
trainer = Trainer(model=peft_model,
                  args=training_args,
                  train_dataset=tokenized_datasets['train'],
                  eval_dataset=tokenized_datasets['test'])

# Train the model
trainer.train()

Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


[2025-09-08 10:16:49,867] [INFO] [real_accelerator.py:260:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2025-09-08 10:16:52,182] [INFO] [logging.py:107:log_dist] [Rank -1] [TorchCheckpointEngine] Initialized with serialization = False


No label_names provided for model class `PeftModel`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


Step,Training Loss
500,0.6531


TrainOutput(global_step=782, training_loss=0.6041818799265205, metrics={'train_runtime': 76.3213, 'train_samples_per_second': 327.563, 'train_steps_per_second': 10.246, 'total_flos': 1650106406400000.0, 'train_loss': 0.6041818799265205, 'epoch': 1.0})

## Huggingface Hub

The Huggingface Hub is a platform for sharing models, datasets, and demos. It allows developers and researchers to collaborate, publish, and discover models and datasets, making the entire community's work more accessible.

### Key Features:
- A central repository for models, datasets, and metrics.
- Tools for versioning, collaboration, and deployment.
- Integrated with Huggingface libraries for easy use.

### Example: Uploading a Model to the Huggingface Hub
Here's how you can upload a model to the Huggingface Hub.

``` python
from huggingface_hub import HfApi

# Initialize the API
api = HfApi()

# Example of creating a new repository (Requires authentication)
repo_name = "my-awesome-model"
api.create_repo(repo_name)

# Save model locally and push to hub
model.save_pretrained(f"./{repo_name}")
api.upload_folder(repo_id=repo_name, folder_path=f"./{repo_name}")
```

### Explore Available Models

We will use the Huggingface API to search for available models and filter them based on certain criteria.


In [16]:
from huggingface_hub import HfApi

In [17]:
# Initialize the Huggingface API
api = HfApi()

In [18]:
# Search for models in the Huggingface Model Hub
models = list(api.list_models(limit=10))

In [19]:
# Display the fetched models
for model in models:
    print(model.modelId)

tencent/Hunyuan-MT-7B
tencent/HunyuanWorld-Voyager
google/embeddinggemma-300m
microsoft/VibeVoice-1.5B
moonshotai/Kimi-K2-Instruct-0905
swiss-ai/Apertus-8B-Instruct-2509
meituan-longcat/LongCat-Flash-Chat
openbmb/MiniCPM4.1-8B
apple/FastVLM-0.5B
Qwen/Qwen-Image-Edit


In [20]:
# Display the first 5 models for demonstration
models[:5]

[ModelInfo(id='tencent/Hunyuan-MT-7B', author=None, sha=None, created_at=datetime.datetime(2025, 8, 28, 9, 51, 39, tzinfo=datetime.timezone.utc), last_modified=None, private=False, disabled=None, downloads=4181, downloads_all_time=None, gated=None, gguf=None, inference=None, inference_provider_mapping=None, likes=549, library_name='transformers', tags=['transformers', 'safetensors', 'hunyuan_v1_dense', 'text-generation', 'translation', 'zh', 'en', 'fr', 'pt', 'es', 'ja', 'tr', 'ru', 'ar', 'ko', 'th', 'it', 'de', 'vi', 'ms', 'id', 'tl', 'hi', 'pl', 'cs', 'nl', 'km', 'my', 'fa', 'gu', 'ur', 'te', 'mr', 'he', 'bn', 'ta', 'uk', 'bo', 'kk', 'mn', 'ug', 'arxiv:2509.05209', 'autotrain_compatible', 'endpoints_compatible', 'region:us'], pipeline_tag='translation', mask_token=None, card_data=None, widget_data=None, model_index=None, config=None, transformers_info=None, trending_score=533, siblings=None, spaces=None, safetensors=None, security_repo_status=None, xet_enabled=None),
 ModelInfo(id='t

Should you with to make that visually more pleasing, you can do so, by creating a pandas dataframe:

In [21]:
import pandas as pd

In [22]:
# Create the DataFrame by first creating a dictionary:
model_data = []

for model in models:
    model_info = {
        'Model ID': model.modelId,
        'Tags': ', '.join(model.tags) if model.tags else 'N/A',
        'Downloads': model.downloads,
        'Likes': model.likes,
        'Pipeline Tag': model.pipeline_tag if model.pipeline_tag else 'N/A',
        'Last Modified': model.lastModified.strftime('%Y-%m-%d') if model.lastModified else 'N/A'
    }
    model_data.append(model_info)


In [23]:
# Pass the dictionary to pandas DataFrame:
df_models = pd.DataFrame(model_data)

# Display the first 5 entries of the DataFrame:
df_models.head()

Unnamed: 0,Model ID,Tags,Downloads,Likes,Pipeline Tag,Last Modified
0,tencent/Hunyuan-MT-7B,"transformers, safetensors, hunyuan_v1_dense, t...",4181,549,translation,
1,tencent/HunyuanWorld-Voyager,"hunyuanworld-voyager, safetensors, hunyuan3d, ...",3548,482,image-to-video,
2,google/embeddinggemma-300m,"sentence-transformers, safetensors, gemma3_tex...",35543,407,sentence-similarity,
3,microsoft/VibeVoice-1.5B,"transformers, safetensors, vibevoice, text-gen...",230570,1549,text-to-speech,
4,moonshotai/Kimi-K2-Instruct-0905,"transformers, safetensors, kimi_k2, text-gener...",3838,287,text-generation,


#### Analyze Model Information

Inspect the models to understand their details, such as architecture, number of parameters, and tasks.

In [24]:
# Display information about a specific model
model_name = "bert-base-uncased"  # Example model
model_info = api.model_info(model_name)

print(f"Model: {model_info.modelId}")
print(f"Description: {model_info.cardData.get('description', 'No description available')}")
print(f"Framework: {model_info.pipeline_tag}")
print(f"Tags: {model_info.tags}")

Model: google-bert/bert-base-uncased
Description: No description available
Framework: fill-mask
Tags: ['transformers', 'pytorch', 'tf', 'jax', 'rust', 'coreml', 'onnx', 'safetensors', 'bert', 'fill-mask', 'exbert', 'en', 'dataset:bookcorpus', 'dataset:wikipedia', 'arxiv:1810.04805', 'license:apache-2.0', 'autotrain_compatible', 'endpoints_compatible', 'region:us']


## Conclusion

In this notebook, we explored the Huggingface ecosystem, including the `transformers`, `datasets`, and Huggingface Hub.
We will get to know `PEFT` and `accelerate` in later notebooks.

In [25]:
# Shut down the kernel
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

{'status': 'ok', 'restart': False}