# Huggingface Basics

# Setup

In [12]:
import tensorflow as tf
from transformers import (
    pipeline,
    TFBertForSequenceClassification
)


---
# Utility

In [4]:
def create_model(max_sequence, model_name, num_labels):
    bert_model = TFBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
    
    # This is the input for the tokens themselves(words from the dataset after encoding):
    input_ids = tf.keras.layers.Input(shape=(max_sequence,), dtype=tf.int32, name='input_ids')

    # attention_mask - is a binary mask which tells BERT which tokens to attend and which not to attend.
    # Encoder will add the 0 tokens to the some sequence which smaller than MAX_SEQUENCE_LENGTH, 
    # and attention_mask, in this case, tells BERT where is the token from the original data and where is 0 pad token:
    attention_mask = tf.keras.layers.Input((max_sequence,), dtype=tf.int32, name='attention_mask')
    
    # Use previous inputs as BERT inputs:
    output = bert_model([input_ids, attention_mask])[0]

    # We can also add dropout as regularization technique:
    #output = tf.keras.layers.Dropout(rate=0.15)(output)

    # Provide number of classes to the final layer:
    output = tf.keras.layers.Dense(num_labels, activation='softmax')(output)

    # Final model:
    model = tf.keras.models.Model(inputs=[input_ids, attention_mask], outputs=output)
    return model

---
# Resources

* [Quick Tour](https://huggingface.co/docs/transformers/main/en/quicktour)
* [Transformers Notebooks](https://github.com/huggingface/transformers/tree/main/notebooks)

> You can find here a list of the official notebooks provided by Hugging Face.

* [Fine tuning the model](https://huggingface.co/docs/transformers/main/en/training)

## Pipeline

* [Hugging Face course](https://huggingface.co/course/chapter0/1?fw=pt)

> Welcome to the Hugging Face course! This introduction will guide you through setting up a working environment. If you’re just starting the course, we recommend you first take a look at Chapter 1, then come back and set up your environment so you can try the code yourself.

* [Huggingface Github - pipeline source](https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/__init__.py#L494)


## Fine-tune a pretrained model  
> There are significant benefits to using a pretrained model. It reduces computation costs, your carbon footprint, and allows you to use state-of-the-art models without having to train one from scratch. 🤗 Transformers provides access to thousands of pretrained models for a wide range of tasks. When you use a pretrained model, you train it on a dataset specific to your task. This is known as fine-tuning, an incredibly powerful training technique. In this tutorial, you will fine-tune a pretrained model with a deep learning framework of your choice:
> 
> * Fine-tune a pretrained model with 🤗 Transformers Trainer.
> * Fine-tune a pretrained model in TensorFlow with Keras.
> * Fine-tune a pretrained model in native PyTorch.

See: 
* [Transformers Notebooks](https://github.com/huggingface/transformers/tree/main/notebooks)
* [Training TFBertForSequenceClassification with custom X and Y data](https://stackoverflow.com/a/63295240/4281353)


## HuggingFace on SageMaker

* [Hugging Face on Amazon SageMaker](https://huggingface.co/docs/sagemaker/index)
* [Deploy models to Amazon SageMaker](https://huggingface.co/docs/sagemaker/inference)


---
# Terminologies

## Pre-tokenized

* [Pre-tokenized](https://huggingface.co/transformers/preprocessing.html#pre-tokenized-inputs)

It is a ***list of string words*** e.g. ```["Hello", "I'm", "a", "single", "sentence"]```. The usage of **tokenized** is misleading in the documentation.

> Pre-tokenized does not mean your inputs are already tokenized (you wouldn’t need to pass them through the tokenizer if that was the case) but just split into words.
> If you want to use pre-tokenized inputs, you **MUST** set **```is_split_into_words=True```** when passing your inputs to the tokenizer.

When passing ```["Hello", "I'm", "a", "single", "sentence"]``` instead of a sentence ```"Hello I'm a single sentence```, the list is called **Pre-tokenized inputs**.
```
encoded_input = tokenizer(["Hello", "I'm", "a", "single", "sentence"], is_split_into_words=True)
```


---
# Config file

When instantiating a model, you need to define the model initialization parameters that are defined in the Transformers configuration file. The base class is PretrainedConfig.

* [PretrainedConfig](https://huggingface.co/transformers/main_classes/configuration.html#pretrainedconfig)

> Base class for all configuration classes. Handles a few parameters common to all models’ configurations as well as methods for loading/downloading/saving configurations.

Each sub class has its own parameters. For instance, Bert pretrained models have the BertConfig.

* [BertConfig](https://huggingface.co/transformers/model_doc/bert.html#transformers.BertConfig)

> This is the configuration class to store the configuration of a BertModel or a TFBertModel. It is used to instantiate a BERT model according to the specified arguments, defining the model architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of the BERT bert-base-uncased architecture.

The ```num_labels``` parameter is from the [PretrainedConfig](https://huggingface.co/transformers/main_classes/configuration.html#transformers.PretrainedConfig)

> num_labels (int, optional) – Number of labels to use in the last layer added to the model, typically for a classification task.

```
TFBertForSequenceClassification.from_pretrained(model_name, num_labels=num_labels)
``` 

## Example Config File
The configuration file for the model ```bert-base-uncased``` is published at [Huggingface model - bert-base-uncased - config.json](https://huggingface.co/bert-base-uncased/blob/main/config.json).

```
{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.6.0.dev0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}
```


## Creating model from config file

In [7]:
# Max length of encoded string(including special tokens such as [CLS] and [SEP]):
MAX_SEQUENCE_LENGTH = 64 

# Standard BERT model with lowercase chars only:
PRETRAINED_MODEL_NAME = 'bert-base-uncased' 

# Batch size for fitting:
BATCH_SIZE = 16 

# Number of epochs:
EPOCHS=5

# Number of labels
NUM_LABELS: int = 2

In [9]:
model = create_model(MAX_SEQUENCE_LENGTH, PRETRAINED_MODEL_NAME, NUM_LABELS)

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

2024-02-15 11:48:32.817997: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-15 11:48:32.846369: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-02-15 11:48:32.846476: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

---
# Pipeline

A utility class for convenience.

* [pipeline](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=pipeline#transformers.pipeline)

> These pipelines are objects that abstract most of the complex code from the library, offering a simple API dedicated to several tasks, including Named Entity Recognition, Masked Language Modeling, Sentiment Analysis, Feature Extraction and Question Answering. See the task summary for examples of use.
><br>


Based on the task specified, ```pipeline``` auto-loads the appropriate model pretrained for the task.

## Tutorial

See:
* [Summary of the tasks](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/task_summary.ipynb)
* [Quick Tour](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/quicktour.ipynb)

## Available Tasks

> * "feature-extraction" -  FeatureExtractionPipeline
> * "text-classification" -  TextClassificationPipeline
> * "sentiment-analysis" (alias of "text-classification") TextClassificationPipeline
> * "token-classification" -  TokenClassificationPipeline
> * "ner" (alias of "token-classification") - TokenClassificationPipeline
> * "question-answering" -  QuestionAnsweringPipeline
> * "fill-mask" -  FillMaskPipeline
> * "summarization" -  SummarizationPipeline
> * "translation_xx_to_yy" -  TranslationPipeline
> * "text2text-generation" -  Text2TextGenerationPipeline
> * "text-generation" -  TextGenerationPipeline
> * "zero-shot-classification: -  ZeroShotClassificationPipeline
> * "conversational" -  ConversationalPipeline


In [14]:
classifier = pipeline("sentiment-analysis", framework='tf')

result = classifier("I hate you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

result = classifier("I love you")[0]
print(f"label: {result['label']}, with score: {round(result['score'], 4)}")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFDistilBertForSequenceClassification.

All the weights of TFDistilBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFDistilBertForSequenceClassification for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

label: NEGATIVE, with score: 0.9991
label: POSITIVE, with score: 0.9999


In [13]:
Conversation = pipeline("conversational")

conversation_1 = Conversation("Going to the movies tonight - any suggestions?")
conversation_2 = Conversation("What's the last book you have read?")

No model was supplied, defaulted to microsoft/DialoGPT-medium and revision 8bada3b (https://huggingface.co/microsoft/DialoGPT-medium).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/642 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/863M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

ValueError: ConversationalPipeline, expects Conversation as inputs

In [None]:
conversation_1

In [None]:
conversational_pipeline([conversation_1, conversation_2])

conversation_1.add_user_input("Is it an action movie?")
conversation_2.add_user_input("What is the genre of this book?")

conversational_pipeline([conversation_1, conversation_2])

---
# Using specific Model

Use ```TFAutoModelForSequenceClassification``` and ```AutoTokenizer``` to load the pretrained model and it’s associated tokenizer.



```
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
classifier("Nous sommes très heureux de vous présenter la bibliothèque 🤗 Transformers.")
```

See for details:
* [Auto Model](https://huggingface.co/docs/transformers/v4.27.2/en/model_doc/auto)

---

# Fine-Tuning (Transfer Learning)

* [Training TFBertForSequenceClassification with custom X and Y data](https://stackoverflow.com/a/68171171/4281353)

For instance, utilize the [Sequence Classification](https://huggingface.co/transformers/task_summary.html#sequence-classification) capabilty of BERT for the text classification by fine-tuing the pre-trained BERT model upon the data provided. 

* [Fine-tuning a pretrained model](https://huggingface.co/transformers/training.html)
> How to fine-tune a pretrained model from the Transformers library. In TensorFlow, models can be directly trained using Keras and the fit method. 

* [Fine-tuning with custom datasets](https://huggingface.co/transformers/custom_datasets.html)
> This tutorial will take you through several examples of using 🤗 Transformers models with your own datasets.<br>
> [Fine-tuning with native PyTorch/TensorFlow](https://huggingface.co/transformers/custom_datasets.html#fine-tuning-with-native-pytorch-tensorflow)
> ```
> from transformers import TFDistilBertForSequenceClassification
> 
> model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
> 
> optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
> model.compile(optimizer=optimizer, loss=model.compute_loss) # can also use any keras loss fn
> model.fit(train_dataset.shuffle(1000).batch(16), epochs=3, batch_size=16)
> ```

* [HuggingFace Text classification examples](https://github.com/huggingface/transformers/tree/master/examples/tensorflow/text-classification)
> This folder contains some scripts showing examples of text classification with the hugs Transformers library. 

[run_text_classification.py](https://github.com/huggingface/transformers/blob/master/examples/tensorflow/text-classification/run_text_classification.py) is the example for text classification fine-tuning for TensorFlow(https://huggingface.co/transformers/custom_datasets.html).