# **HUGGING FACE**

it is really hard, if not impossible, talking about transformers in a general way without talking about HuggingFace.



Hugging Face the biggest whole ecosystem of easy-to-use open source tools for NLP, vision, and beyond.

The central component of their ecosystem is the Transformers library, which allows you to easily download a pretrained model, including its corresponding tokenizer, and then fine-tune it on your own dataset, if needed. Plus, the library supports TensorFlow, PyTorch, and JAX (with the Flax
library).

The simplest way to use the Transformers library is to use the transformers. pipe line()
function: you just specify which task you want, such as sentiment analysis, and it downloads a
default pretrained model, ready to be used—it really couldn’t be any simpler:

In [1]:
import transformers
from transformers import pipeline

in this case, as you can see, we are not specifying any specific model we want to use. Therefore, the default sentiment-analysis pipeline downloads a ~250MB model (distilbert-base-uncased-finetuned-sst-2-english) from Hugging Face Hub.

if this is your first time downloading the model, it is totally normal that this cell will take a bit of time.

First-time download breakdown:

Model size: ~250MB (distilbert-base-uncased-finetuned-sst-2-english)
Tokenizer files: ~1-5MB
Config files: <1MB
Total download: ~255MB

the first time will take a bit of time. But after the first, it will be saved into the cache (Downloads model to: ~/.cache/huggingface/transformers/) and it will load in less time.

In [None]:
classifier = pipeline("sentiment-analysis") # many other tasks are available

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use mps:0


In [4]:
result = classifier("The actors were very convincing")

In [5]:
print(result)

[{'label': 'POSITIVE', 'score': 0.9998143315315247}]


In this example, the model correctly found that the sentence is positive, with around 99.98% confidence. 

Of course, you can also pass a batch of sentences to the model:

In [7]:
classifier(["I am from India.", "I am from Iraq."])

[{'label': 'POSITIVE', 'score': 0.9896161556243896},
 {'label': 'NEGATIVE', 'score': 0.9811071157455444}]

clearly, bias is still a problem in AI. You should think twice before deploying a model to production and the harm that it may do. 
How can you try to fix bias ? although it is still an active area of research, the solution depends on the problem, and may involve:
- rebalancing the dataset
- fine-tuning on a different dataset
- switching to another pretrained model
- tweaking the model's architecture of hyperparameters
etc.

## choosing your model in the pipeline

The pipeline() function uses the default model for the given task. For example, for text
classification tasks such as sentiment analysis, at the time of writing, it defaults to distilbert-
base-uncased-finetuned-sst-2-english—a DistilBERT model with an uncased
tokenizer, trained on English Wikipedia and a corpus of English books, and fine-tuned on the
Stanford Sentiment Treebank v2 (SST 2) task. It’s also possible to manually specify a different
model. For example, you could use a DistilBERT model fine-tuned on the Multi-Genre Natural
Language Inference (MultiNLI) task, which classifies two sentences into three classes:
contradiction, neutral, or entailment. Here is how:

In [6]:
model_name = "huggingface/distilbert-base-uncased-finetuned-mnli"
classifier_mnli = pipeline("text-classification", model=model_name)
classifier_mnli("She loves me. [SEP] She loves me not.")

config.json:   0%|          | 0.00/729 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/58.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use mps:0


[{'label': 'contradiction', 'score': 0.9790191650390625}]

### **TIP**

You can find the available models at https://huggingface.co/models, and the list of tasks at https://huggingface.co/tasks.

## personalization (choosing the right tokenizer, model, configuration, callback etc etc)

The pipeline API is very simple and convenient, but sometimes you will need more control. For such cases, the Transformers library provides many classes, including all sorts of tokenizers, models, configurations, callbacks, and much more. For example, let’s load the same DistilBERT model, along with its corresponding tokenizer, using the TFAutoModelForSequenceClassification and AutoTokenizer classes:

Different tokenizers are used for different reasons. Here's why: [Deepen: TOKENIZER](./05_02_deepen_Tokenizer.ipynb)

- Model-specific tokenizers
Each pre-trained model was trained with a specific tokenizer, so you must use the matching one:

In [14]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [None]:

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

Next, let’s tokenize a couple of pairs of sentences. In this code, we activate padding and specify that
we want TensorFlow tensors instead of Python lists:

In [13]:
# Change "tf" to "pt" for PyTorch tensors
token_ids = tokenizer(["I like soccer. [SEP] We all love soccer!",
                      "Joe lived for a very long time. [SEP] Joe is old."],
                      padding=True, return_tensors="pt")

print("Token IDs shape:", token_ids['input_ids'].shape)
print("Attention mask shape:", token_ids['attention_mask'].shape)

Token IDs shape: torch.Size([2, 15])
Attention mask shape: torch.Size([2, 15])


The output is a dictionary-like instance of the BatchEncoding class, which contains the
sequences of token IDs, as well as a mask containing 0s for the padding tokens:

In [12]:
token_ids

{'input_ids': tensor([[ 101, 1045, 2066, 4715, 1012,  102, 2057, 2035, 2293, 4715,  999,  102,
            0,    0,    0],
        [ 101, 3533, 2973, 2005, 1037, 2200, 2146, 2051, 1012,  102, 3533, 2003,
         2214, 1012,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

If you set return_token_type_ids=True when calling the tokenizer, you will also get an
extra tensor that indicates which sentence each token belongs to. This is needed by some models,
but not DistilBERT.
Next, we can directly pass this BatchEncoding object to the model; it returns a
TFSequenceClassifierOutput object containing its predicted class logits:

In [None]:
### riprendo da qui :)) - pag 808

outputs = model(token_ids)

TypeError: list indices must be integers or slices, not tuple