<a href="https://colab.research.google.com/github/osanseviero/khipu_workshop/blob/main/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Intro 

### Why Transformers?

Deep learning is currently undergoing a period of rapid progress across a wide variety of domains, including: 

* 📖 Natural language processing
* 👀 Computer vision
* 🔊 Audio
* 🧬 Biology
* and many more!

One of the main drivers of these breakthroughs is the **Transformer** -- a novel **neural network** developed by Google researchers in 2017. 

Here's a few examples of what Transformers can do:

* 💻 They can **generate code** as in products like [GitHub Copilot](https://copilot.github.com/), which is based on OpenAI's family of [GPT models](https://huggingface.co/gpt2?text=My+name+is+Clara+and+I+am).
* ❓ They can be used for **improve search engines**, like [Google did](https://www.blog.google/products/search/search-language-understanding-bert/) with a Transformer called [BERT](https://huggingface.co/bert-base-uncased).
* 🗣️ They can **process speech in multiple languages** to perform speech recognition, speech translation, and language identification. For example, Facebook's [XLS-R model](https://huggingface.co/spaces/facebook/XLS-R-2B-22-16) can automatically transcribe audio in one language to another!

Training these models **from scratch** involves **a lot of resources**: you need large amounts of compute, data, and days to train for 😱.

Fortunately, you don't need to do this in most cases! Thanks to a technique known as **transfer learning**, it is possible to adapt a model that has been trained from scratch (usually called a **pretrained model**), to a variety of downstream tasks. This process is called **fine-tuning** and can typically be carried with a single GPU and a dataset of the size that you're like to find in your university or company.

The models that we'll be looking at in this section are all examples of existing fine-tuned models.

Now, Transformers are coolest kids in town, but how can we use them? If only there was a library that could help us ... oh wait, there is! The [Hugging Face Transformers library](https://github.com/huggingface/transformers) provides a unified API across dozens of Transformer architectures, as well as the means to train models and run inference with them. So to get started, let's install the library with the following command:

In [None]:
%%capture
%pip install transformers[sentencepiece] datasets evaluate gradio

The fastest way to learn what Transformers can do is via the `pipeline()` function. This function loads a model from the Hugging Face Hub and takes care of all the preprocessing and postprocessing steps that are needed to convert inputs into predictions.

Let's start with a basic sentiment analysis task, leveraging a pretrained model from the Hugging Face Hub to categorize the following snippet according to its sentiment (positive or negative):

In [None]:
text = """Estimado Amazon, la semana pasada pedí una figura de acción de \
Optimus Prime de su tienda online en Alemania. Desafortunadamente, cuando abrí \
el paquete descubrí con un gran horror que me habían enviado una figura de acción de \
Megatron en su lugar. Como enemigo de toda la vida de los Decepticons, espero \
que puedan entender mi dilema. Para resolver el problema, exijo un intercambio \
de Megatron por la figura de Optimus Prime que pedí. Adjunto copias de mis \
registros relacionados con esta compra. Espero tener noticias suyas pronto. \
Sinceramente, Bumblebee."""

In [None]:
from transformers import pipeline

sentiment_pipeline =  pipeline('text-classification', 
                              model="pysentimiento/robertuito-sentiment-analysis")

In [None]:
sentiment_pipeline(text)

In [None]:
sentiment_pipeline(["estoy triste", "estoy feliz", "gran workshop!"])

Let's now do something a little more sophisticated. Instead of just finding the overall sentiment, let's see if we can extract **entities** such as organizations, locations, or individuals from the text. This task is called named entity recognition, or NER for short. Instead of predicting just a class for the whole text **a class is predicted for each token**, as shown in the example below:

In [None]:
ner_pipeline = pipeline('ner', model="mrm8488/bert-spanish-cased-finetuned-ner")

In [None]:
entities = ner_pipeline(text, aggregation_strategy="simple")
print(entities)

In [None]:
for entity in entities:
    print(f"{entity['word']}: {entity['entity_group']} ({entity['score']:.2f})")

We can also leverage one of the existing Machine Translation models on the Hugging Face Hub to automatically translate the snippet from English to Spanish -- let's see how it does!

In [None]:
translator = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-es-en")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

There are other useful models on the Hub as well, for instance the 
[XLM-RoBERTa-large-XNLI-ANLI](https://huggingface.co/vicgalle/xlm-roberta-large-xnli-anli) model, which is a large language model (XLM-RoBERTa-large) that was finetuned over several natural language inference datasets, intended to be use for zero-shot classification in multiple languages:

In [None]:
zero_shot_classifier = pipeline("zero-shot-classification",
                                model="vicgalle/xlm-roberta-large-xnli-anli")

In [None]:
text = "Algún día iré a ver el mundo"
classes = ['viaje', 'cocina', 'danza']

In [None]:
zero_shot_classifier(text, classes, multi_label=True)

## Datasets

In this section we'll learn the basics of the `datasets` library.

We can load dataset builders from the Hub to show information about datasets without loading the full thing, such as the dataset description:

In [None]:
from datasets import load_dataset_builder
ds_builder = load_dataset_builder("amazon_reviews_multi")

ds_builder.info.description

As well as the dataset features (so we know whether this dataset is useful for us, for example).

In [None]:
ds_builder.info.features

Once we've decided that this dataset works for our purpose, we can then load the whole thing

In [None]:
from datasets import load_dataset

dataset = load_dataset("amazon_reviews_multi", "es")
dataset

In [None]:
dataset

In [None]:
dataset["train"][0]["review_body"]

In [None]:
dataset["train"][0]["stars"]

In [None]:
dataset.set_format("pandas")
df = dataset["train"][:]
df.head()

In [None]:
df["stars"].value_counts()

In [None]:
dataset.reset_format()

## Fine-tuning

In order to access all of the features of the Hub, you'll need to create an account at https://huggingface.co/ -- you can then create an Access Token and use it to log in directly from the notebook. This is a free service and will allow you to share your models.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

As we want to train a binary classifier, we will map the number of stars of each review to a label of 0 or 1. We discard the neutral reviews (3 stars).

In [None]:
dataset = dataset.filter(lambda x : x["stars"] != 3)

def merge_star_ratings(examples):
    if examples["stars"] <= 2:
        label = 0
    else:
        label = 1
    return {"labels": label}

dataset = dataset.map(merge_star_ratings)

And then we can load a pretrained model (Roberta base) from the Hub and continue fine-tuning it. We first load its tokenizer, which we will use to tokenize the dataset we loaded above:

In [None]:
model_checkpoint = "BSC-TeMU/roberta-base-bne"

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

This is what it will look like once a text has been tokenized: 

In [None]:
text = "¡hola, me llamo Omar!"
tokenized_text = tokenizer.encode(text)

for token in tokenized_text:
  print(token, tokenizer.decode([token]))

In [None]:
encoded_text = tokenizer(text, return_tensors="pt")
encoded_text

We can then write a tokenization function and use it to tokenize the whole dataset:

In [None]:
def tokenize_reviews(examples):
  return tokenizer(examples["review_body"], truncation=True)

columns = dataset["train"].column_names
columns.remove("labels")
encoded_dataset = dataset.map(tokenize_reviews, batched=True, remove_columns=columns)

In [None]:
encoded_dataset

In [None]:
encoded_dataset["train"][0]

We can then load the pretrained model itself and define how many labels we want it to have:

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=2)

In [None]:
outputs = model(**encoded_text)
outputs

We can then define the training arguments, as well as the model ID for the hub (change it to your own username!)

In [None]:
from transformers import TrainingArguments

model_name = model_checkpoint.split("/")[-1]

batch_size = 16
num_train_epochs=2
num_train_samples = 1000
train_dataset = encoded_dataset["train"].shuffle(seed=42).select(range(num_train_samples))
logging_steps = len(train_dataset) // (2 * batch_size * num_train_epochs)

training_args = TrainingArguments(
    output_dir="results",
    num_train_epochs=num_train_epochs,     
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch", 
    logging_steps=logging_steps,
    push_to_hub=True,
    push_to_hub_model_id=f"khipu-finetuned-amazon_reviews_multi"
)

Before starting model training, we'll define the way to compute metrics with `evaluate`, a library for doing systematic and principled evaluation of ML models (which means that you can do all sorts of model evaluations without writing custom scripts!)

For this purpose, we will only be evaluating our model's accuracy, but keep in mind that there are tons of other metrics you can use (see the full list [here](https://huggingface.co/metrics)!)

In [None]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")
metric

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return metric.compute(predictions=predictions, references=labels)

Now let's start training! 🚀

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model, 
    args=training_args, 
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer
)

In [None]:
trainer.train()

We can then push both the trained model and the tokenizer to the Hub so we can use them later

In [None]:
trainer.push_to_hub()

And... magic!✨ We can directly use our finetuned model in a `pipeline`! 😮

In [None]:
from transformers import pipeline 

pipe = pipeline("text-classification", model="osanseviero/khipu-finetuned-amazon_reviews_multi", return_all_scores=True)

We can then take a new review (that's not from the Yelp dataset) and feed it through our finetuned model, to get its sentiment:

In [None]:
text = """Estimado Amazon, la semana pasada pedí una figura de acción de \
Optimus Prime de su tienda online en Alemania. Desafortunadamente, cuando abrí \
el paquete descubrí con un gran horror que me habían enviado una figura de acción de \
Megatron en su lugar. Como enemigo de toda la vida de los Decepticons, espero \
que puedan entender mi dilema. Para resolver el problema, exijo un intercambio \
de Megatron por la figura de Optimus Prime que pedí. Adjunto copias de mis \
registros relacionados con esta compra. Espero tener noticias suyas pronto. \
Sinceramente, Bumblebee."""
pipe(text)

## Demo

Now let's look at creating easy, interactive ML demos using [Gradio](https://gradio.app/)!

You can write any custom function and then call it using `gr.Interace`, specifying what kind of elements it takes as input and produces as output -- e.g. text, image, sound, etc. 
This automatically creates a simple, interactive interface, and launches it directly in your notebook:

In [None]:
import numpy as np
import gradio as gr

def greet(name):
    return f"Hello {name}"

gr.Interface(
    greet,
    "text",
    "text",
    title="Greet!",
    allow_flagging=False
).launch()

You can also call the pipeline we fine-tuned above in this way, and also wrap it in an interface, which would allow people to use it in an interactive way, feeding it text and getting the labels out:

In [None]:
import gradio as gr

def predict(input):
    res = {}
    for pred in pipe(input)[0]:
      res[pred["label"]] = pred["score"]
    return res

gr.Interface(
    predict,
    "text",
    "label",
    title="Classify!",
    examples=[[text]],
    allow_flagging=False
).launch(debug=True)