# Huggingface Ecosystem Overview

Huggingface is a company that has revolutionized the field of Natural Language Processing (NLP) by providing an open-source library and tools that facilitate the use of state-of-the-art models for various NLP tasks.

### Key Components of the Huggingface Ecosystem:

1. **Transformers**: A library that provides APIs and tools to easily download and fine-tune state-of-the-art pre-trained models.
2. **Datasets**: A library to access and process large datasets used for NLP and other machine learning tasks.
3. **PEFT (Parameter-Efficient Fine-Tuning)**: A library for efficient model fine-tuning using parameter-efficient techniques.
4. **Accelerate**: A library to accelerate PyTorch and TensorFlow models' training and deployment across multiple devices (GPUs).
5. **Huggingface Hub**: A central repository for pre-trained models, datasets, and metrics, allowing seamless sharing and collaboration.

Let's explore each component in detail with hands-on examples.

## Transformers Library

The `transformers` library by Huggingface is a powerful toolkit that provides state-of-the-art pre-trained models and easy-to-use APIs for NLP tasks such as text classification, named entity recognition, translation, text generation, and more.

### Key Features:
- Provides thousands of pre-trained models.
- Supports multiple frameworks: PyTorch, TensorFlow, and JAX.
- Easy integration with the Huggingface Hub.

### Example: Loading a Pre-trained Model and Running Inference
Let's load a pre-trained BERT model and use it for a simple text classification task.
For that we use `pipeline()`.
`pipeline()` is a very convenient way to use a pretrained model for inference. You can use the `pipeline()` out-of-the-box for many tasks across different modalities, as can be seen [here](https://huggingface.co/docs/transformers/quicktour).

In [None]:
from transformers import pipeline

# Load a pre-trained sentiment-analysis pipeline
# classifier = pipeline("sentiment-analysis") This would determine which model should be used for us.
# However, I would like to go easy on our disk space available and use model, which I have downloaded in advance to a shared directory:
classifier = pipeline("sentiment-analysis", model="/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/distilbert--distilbert-base-uncased")

In [None]:
print(f"Model used: {classifier.model.name_or_path}\n")
print(classifier.model.config)

In [None]:
# Test the pipeline with some example text
result = classifier("Huggingface is an amazing platform for NLP research!")
print(result)

The result isn't great. This is because we used a model, which doesn't hasen't been fine-tuned for the task.
Let's try this one and see how it works:
<br>
`/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/distilbert--distilbert-base-uncased-finetuned-sst-2-english`

Feel free to try out different sentences, also negative ones.
Should you have more than one input, pass them as a list. You will then get a list of dictionaries as output.

In [None]:
result = classifier(["EuorCC courses are not bad at all. There is lots to gain from them.",
                    "I think Large Language Models are overrated."])
print(result)

You can see, that the model not only predicts the sentiment, but also outputs a score, which is a probability that indicates the model's confidence in its prediction.

## Datasets Library

The `datasets` library provides a lightweight library for easily downloading and using datasets in NLP and other ML domains. It is optimized for both in-memory and out-of-memory (on-disk) use, making it suitable for handling very large datasets.

### Key Features:
- Access to thousands of datasets in various domains.
- Built-in data processing tools such as caching, shuffling, and batching.
- Easy integration with the `transformers` library for model training.

### Example: Loading and Exploring a Dataset
Let's load a sample dataset and explore its content.
The **IMDB** dataset is a popular benchmark dataset used for sentiment analysis tasks in natural language processing. It consists of movie reviews from the Internet Movie Database (IMDB) and is specifically designed for binary sentiment classification: determining whether a given movie review expresses a positive or negative sentiment.


In [None]:
from datasets import load_dataset

# Load the IMDB dataset
dataset = load_dataset("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/stanfordnlp--imdb")

In [None]:
# Display the 10th example in the training set
print(dataset['train'][9])

In [None]:
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer
from peft import LoraConfig, get_peft_model

In [None]:
# Load a pre-trained model for binary classification (num_labels=2)
model = AutoModelForSequenceClassification.from_pretrained("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/google--bert-base-uncased", num_labels=2)

The Warning we got is to be expected, because the classification head we added has not been pre-trained and the weights and biases have been newly initialized.

The bert-base-uncased model is a general-purpose, pre-trained BERT model. It has been trained on a large corpus of text using self-supervised objectives (like masked language modeling) but not for specific tasks like sentiment analysis, classification, etc.

In [None]:
# Apply LoRA for efficient fine-tuning
config = LoraConfig(r=8)
peft_model = get_peft_model(model, config)

`get_peft_model` is a function from the PEFT library that takes a pre-trained model and a LoRA configuration (`LoraConfig`) and returns a new model that has been adapted for parameter-efficient fine-tuning.
The new model (`peft_model`) has the same architecture as the original model (model) but with additional parameters introduced by LoRA that enable efficient fine-tuning.

In [None]:
# Load the IMDB dataset
dataset = load_dataset("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/datasets/stanfordnlp--imdb")

In [None]:
# Load a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("/leonardo_scratch/fast/EUHPC_D20_063/huggingface/models/google--bert-base-uncased")

In [None]:
# Define a function to tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

In [None]:
# Tokenize the dataset
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Set the format to include PyTorch tensors for input_ids, attention_mask, and label
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")  # Rename 'label' column to 'labels'
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

In [None]:
# Define training arguments
training_args = TrainingArguments(output_dir="./results",
                                  num_train_epochs=1,
                                  per_device_train_batch_size=32,
                                  report_to="none",
                                 )

In [None]:
# Initialize Trainer
trainer = Trainer(model=peft_model,
                  args=training_args,
                  train_dataset=tokenized_datasets['train'],
                  eval_dataset=tokenized_datasets['test'])

# Train the model
trainer.train()

## Huggingface Hub

The Huggingface Hub is a platform for sharing models, datasets, and demos. It allows developers and researchers to collaborate, publish, and discover models and datasets, making the entire community's work more accessible.

### Key Features:
- A central repository for models, datasets, and metrics.
- Tools for versioning, collaboration, and deployment.
- Integrated with Huggingface libraries for easy use.

### Example: Uploading a Model to the Huggingface Hub
Here's how you can upload a model to the Huggingface Hub.

``` python
from huggingface_hub import HfApi

# Initialize the API
api = HfApi()

# Example of creating a new repository (Requires authentication)
repo_name = "my-awesome-model"
api.create_repo(repo_name)

# Save model locally and push to hub
model.save_pretrained(f"./{repo_name}")
api.upload_folder(repo_id=repo_name, folder_path=f"./{repo_name}")
```

### Explore Available Models

We will use the Huggingface API to search for available models and filter them based on certain criteria.


In [None]:
from huggingface_hub import HfApi

In [None]:
# Initialize the Huggingface API
api = HfApi()

In [None]:
# Search for models in the Huggingface Model Hub
models = list(api.list_models(limit=10))

In [None]:
# Display the fetched models
for model in models:
    print(model.modelId)

In [None]:
# Display the first 5 models for demonstration
models[:5]

Should you with to make that visually more pleasing, you can do so, by creating a pandas dataframe:

In [None]:
import pandas as pd

In [None]:
# Create the DataFrame by first creating a dictionary:
model_data = []

for model in models:
    model_info = {
        'Model ID': model.modelId,
        'Tags': ', '.join(model.tags) if model.tags else 'N/A',
        'Downloads': model.downloads,
        'Likes': model.likes,
        'Pipeline Tag': model.pipeline_tag if model.pipeline_tag else 'N/A',
        'Last Modified': model.lastModified.strftime('%Y-%m-%d') if model.lastModified else 'N/A'
    }
    model_data.append(model_info)


In [None]:
# Pass the dictionary to pandas DataFrame:
df_models = pd.DataFrame(model_data)

# Display the first 5 entries of the DataFrame:
df_models.head()

#### Analyze Model Information

Inspect the models to understand their details, such as architecture, number of parameters, and tasks.

In [None]:
# Display information about a specific model
model_name = "bert-base-uncased"  # Example model
model_info = api.model_info(model_name)

print(f"Model: {model_info.modelId}")
print(f"Description: {model_info.cardData.get('description', 'No description available')}")
print(f"Framework: {model_info.pipeline_tag}")
print(f"Tags: {model_info.tags}")

## Conclusion

In this notebook, we explored the Huggingface ecosystem, including the `transformers`, `datasets`, and Huggingface Hub.
We will get to know `PEFT` and `accelerate` in later notebooks.

In [None]:
# Shut down the kernel
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)