[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/juanhuguet/intro_to_nlp/blob/main/notebooks/05-transformers-text-classification.ipynb)

# Text classification using transformers

# Let's preload some dataset...

Check the datasets available from hugging face hub:

https://huggingface.co/datasets

We will choose the `yelp` reviews

In [1]:
from datasets import load_dataset

In [2]:
dataset = load_dataset("yelp_review_full")

Found cached dataset yelp_review_full (/Users/jhuguet/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf)


  0%|          | 0/2 [00:00<?, ?it/s]

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [4]:
dataset["train"][0]

{'label': 4,
 'text': "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."}

In [5]:
dataset["test"][0]

{'label': 0,
 'text': 'I got \'new\' tires from them and within two weeks got a flat. I took my car to a local mechanic to see if i could get the hole patched, but they said the reason I had a flat was because the previous patch had blown - WAIT, WHAT? I just got the tire and never needed to have it patched? This was supposed to be a new tire. \\nI took the tire over to Flynn\'s and they told me that someone punctured my tire, then tried to patch it. So there are resentful tire slashers? I find that very unlikely. After arguing with the guy and telling him that his logic was far fetched he said he\'d give me a new tire \\"this time\\". \\nI will never go back to Flynn\'s b/c of the way this guy treated me and the simple fact that they gave me a used tire!'}

# Out of the box classifier

* Transformers has a layered API that allows you to interact with the library at various levels of abstraction.

* `pipelines` abstract away all the steps needed to convert raw text into a set of predictions from a fine-tuned model

* `pipelines` support many of out-of-the box nlp tasks and can be used with a variety of models

In [6]:
from transformers import pipeline

classifier = pipeline("text-classification",
                      model="distilbert-base-uncased-finetuned-sst-2-english",
                     )

In [7]:
classifier("this is a bad movie")

[{'label': 'NEGATIVE', 'score': 0.9997872710227966}]

In [8]:
classifier(dataset["test"][0]["text"])

[{'label': 'NEGATIVE', 'score': 0.9980941414833069}]

* We see that there is a good approach here, however, it lacks the granularity we may have in the training data

## Fine tuning your own model

In [9]:
model = "distilbert-base-cased"

In [10]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model)

Downloading config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

In [11]:
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

In [12]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Loading cached processed dataset at /Users/jhuguet/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-98e6c10f36dee458.arrow
Loading cached processed dataset at /Users/jhuguet/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-0da6a9754f8874b4.arrow


### Let's inspect what has the tokenizer done

In [13]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 50000
    })
})

In [14]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 650000
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 50000
    })
})

In [15]:
dataset["train"][0]

{'label': 4,
 'text': "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank."}

In [16]:
tokenized_datasets["train"][0]

{'label': 4,
 'text': "dr. goldberg offers everything i look for in a general practitioner.  he's nice and easy to talk to without being patronizing; he's always on time in seeing his patients; he's affiliated with a top-notch hospital (nyu) which my parents have explained to me is very important in case something happens and you need surgery; and you can get referrals to see specialists without having to see him first.  really, what more do you need?  i'm sitting here trying to think of any complaints i have about him, but i'm really drawing a blank.",
 'input_ids': [101,
  173,
  1197,
  119,
  2284,
  2953,
  3272,
  1917,
  178,
  1440,
  1111,
  1107,
  170,
  1704,
  22351,
  119,
  1119,
  112,
  188,
  3505,
  1105,
  3123,
  1106,
  2037,
  1106,
  1443,
  1217,
  10063,
  4404,
  132,
  1119,
  112,
  188,
  1579,
  1113,
  1159,
  1107,
  3195,
  1117,
  4420,
  132,
  1119,
  112,
  188,
  6559,
  1114,
  170,
  1499,
  118,
  23555,
  2704,
  113,
  183,
  9379,
  114,
  1

In [17]:
tokenizer.convert_ids_to_tokens(1917)

'everything'

In [18]:
len(tokenizer.vocab)

28996

In [19]:
tokenizer.vocab["everything"]

1917

## Now we have our text tokenized, let's train our model using the high level wrapper

In [20]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(model,
                                                           num_labels=5
                                                          )

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.b

The warning tells us the pre-trained head of the BERT model is discarded, and replaced with a randomly initialized classification head. 

You will fine-tune this new model head on your sequence classification task, transferring the knowledge of the pretrained model to it.

In [21]:
import numpy as np
import evaluate

metric = evaluate.load("accuracy")

In [22]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

In [23]:
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer",
                                  evaluation_strategy="steps",
                                  num_train_epochs=1,
                                  logging_steps=30,
                                  use_mps_device=True, )

In [24]:
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

Loading cached shuffled indices for dataset at /Users/jhuguet/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-3d04d2361e5a1bdd.arrow
Loading cached shuffled indices for dataset at /Users/jhuguet/.cache/huggingface/datasets/yelp_review_full/yelp_review_full/1.0.0/e8e18e19d7be9e75642fc66b198abadb116f73599ec89a69ba5dd8d1e57ba0bf/cache-bd9a5dc44bcab3fb.arrow


In [25]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,)

In [26]:
trainer.train()



Step,Training Loss,Validation Loss,Accuracy
30,1.6113,1.613645,0.187
60,1.5818,1.521957,0.333
90,1.4637,1.393469,0.41
120,1.347,1.33172,0.422


TrainOutput(global_step=125, training_loss=1.491864486694336, metrics={'train_runtime': 116.7003, 'train_samples_per_second': 8.569, 'train_steps_per_second': 1.071, 'total_flos': 132474485760000.0, 'train_loss': 1.491864486694336, 'epoch': 1.0})

In [27]:
trainer.save_model("custom_model")
tokenizer.save_pretrained("custom_model")

('custom_model/tokenizer_config.json',
 'custom_model/special_tokens_map.json',
 'custom_model/vocab.txt',
 'custom_model/added_tokens.json',
 'custom_model/tokenizer.json')

## Now, let's get the model into a pipeline and run it over some examples

In [28]:
clf = pipeline("text-classification", model="custom_model")

In [29]:
clf("The movie is great")

[{'label': 'LABEL_4', 'score': 0.30081331729888916}]

In [30]:
n = 89

In [31]:
tokenized_datasets["test"][n]["text"], tokenized_datasets["test"][n]["label"]

("I'm not a car girl and Lester couldn't have been more accommodating.  I needed an inspection and he was able to take me the same day I called.  He's open past 7pm so the evening hours make it convenient to pick up or drop off. \\n\\nVery honest and upfront about the work he did, too...typical Pittsburgher who knows his stuff. Labor was $70/hour and parts were cheaper than other places. Cost me $3 to have a break light replaced.  Couldn't have been happier with the quick, honest, convenient service.  Lester's the best!",
 4)

In [32]:
clf(tokenized_datasets["test"][n]["text"])

[{'label': 'LABEL_0', 'score': 0.486929714679718}]