[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/juanhuguet/intro_to_nlp/blob/main/notebooks/05-transformers-text-classification.ipynb)

# Text classification using transformers

We have seen the power and effectiveness of dense word embeddings.

However, they have a weakness:

* the **word representations are contextless** and
* do **not capture long-term dependencies between words** due to the limited span of the sliding window.

To illustrate this example, let’s imagine the next sentences:

> An apple a day keeps the doctor away!
> 

> My doctor buys Apple stocks every day
> 

We, humans, can quickly differentiate between both “apples” as we are context aware.

The first sentence refers to a fruit whereas the second refers to a popular company.

# Technology landscape

[Transformers library](https://huggingface.co/docs/transformers/index)

<img src="https://file.notion.so/f/f/003df94c-172d-46b4-9c84-4a2f90ef0ed1/b6f0d5fb-306a-41fd-9df7-f623e0fed832/Screenshot_2023-05-04_at_03.23.55.png?id=f370c67d-f17e-4a86-ad54-b12dbe892c6b&table=block&spaceId=003df94c-172d-46b4-9c84-4a2f90ef0ed1&expirationTimestamp=1705687200000&signature=ZKtSsNyKJMIBn5W_QzSFu5p5NQhp2ncr6Epj_qkX48w&downloadName=Screenshot+2023-05-04+at+03.23.55.png" width="400" height="200">

Whenever you face a new NLP task/challenge, review what has been already done by the community, choose the best base model and adapt it to your needs. Most of the times you will find an out-of-the-box solution that performs well enough or that needs little fine tuning

## Transfer learning and fine tuning

**Pre-training**, **fine-tuning**, and **transfer learning**:

Using large transformer models has become the basis for many state-of-the-art NLP models.

By pre-training a large transformer model on massive amounts of text data, and then fine-tuning it on smaller datasets for specific tasks, it is possible to achieve highly effective transfer learning. 

This approach has been used to achieve state-of-the-art results on a wide range of NLP tasks, including language modeling, sentiment analysis, machine translation, and more.

### Transfer learning

The key difference between fine-tuning and transfer learning is that fine-tuning involves adjusting the pre-trained model's parameters to fit a specific task, while transfer learning involves applying the pre-trained model to a new task without further training.

<img src="https://file.notion.so/f/f/003df94c-172d-46b4-9c84-4a2f90ef0ed1/00ccfd76-c52d-45e8-9241-83523de6f02d/Screenshot_2023-05-04_at_03.20.21.png?id=ee4650de-c979-4395-b519-129e4e0d937f&table=block&spaceId=003df94c-172d-46b4-9c84-4a2f90ef0ed1&expirationTimestamp=1705687200000&signature=jrq6g5xV0wH8eLlpNCZAbLXMwDkGYMQlFkXYYKKMNjQ&downloadName=Screenshot+2023-05-04+at+03.20.21.png" width="400" height="200">

# Let's preload some dataset...

Check the datasets available from hugging face hub:

https://huggingface.co/datasets

We will choose the `yelp` reviews

In [None]:
from datasets import load_dataset

In [None]:
dataset = load_dataset("yelp_review_full")

In [None]:
dataset

In [None]:
dataset["train"][0]

In [None]:
dataset["test"][0]

# Out of the box classifier

* Transformers has a layered API that allows you to interact with the library at various levels of abstraction.

* `pipelines` abstract away all the steps needed to convert raw text into a set of predictions from a fine-tuned model

* `pipelines` support many of out-of-the box nlp tasks and can be used with a variety of models

[model for sentiment classification](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)

[behind the pipeline](https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt)

<img src="https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg" width="800" height="200">

In [None]:
from transformers import pipeline

#use a text classification pipeline
classifier = pipeline(...,
                      model="distilbert-base-uncased-finetuned-sst-2-english",
                     )

In [None]:
classifier(...)

In [None]:
classifier(...)

* We see that there is a good approach here, however, it lacks the granularity we may have in the training data

## Fine tuning your own model

In [None]:
model = "distilbert-base-cased"

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model)

In [None]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [None]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

### Let's inspect what has the tokenizer done

In [None]:
dataset

In [None]:
tokenized_datasets

In [None]:
dataset["train"][0]

In [None]:
tokenized_datasets["train"][0]

In [None]:
tokenizer.convert_ids_to_tokens(1917)

In [None]:
len(tokenizer.vocab)

In [None]:
tokenizer.vocab["everything"]

## Now we have our text tokenized, let's train our model using the high level wrapper

In [None]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model,
                                                           num_labels=5
                                                          )

The warning tells us the pre-trained head of the BERT model is discarded, and replaced with a randomly initialized classification head. 

You will fine-tune this new model head on your sequence classification task, transferring the knowledge of the pretrained model to it.

In [None]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

In [None]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [None]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer",
                                  evaluation_strategy="steps",
                                  num_train_epochs=1,
                                  logging_steps=30,
                                  use_mps_device=True, )

In [None]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,)

In [None]:
trainer.train()

In [None]:
trainer.save_model("custom_model")
tokenizer.save_pretrained("custom_model")

## Now, let's get the model into a pipeline and run it over some examples

In [None]:
clf = pipeline("text-classification", model="custom_model")

In [None]:
clf("The movie is great")

In [None]:
n = 89

In [None]:
tokenized_datasets["test"][n]["text"], tokenized_datasets["test"][n]["label"]

In [None]:
clf(tokenized_datasets["test"][n]["text"])