<a href="https://colab.research.google.com/github/juanprida/nlp_with_transformers/blob/main/text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install datasets
!pip install transformers

Collecting datasets
  Downloading datasets-2.9.0-py3-none-any.whl (462 kB)
Collecting aiohttp
  Downloading aiohttp-3.8.3-cp38-cp38-win_amd64.whl (324 kB)
Collecting xxhash
  Downloading xxhash-3.2.0-cp38-cp38-win_amd64.whl (30 kB)
Collecting pyarrow>=6.0.0
  Downloading pyarrow-11.0.0-cp38-cp38-win_amd64.whl (20.6 MB)
Collecting fsspec[http]>=2021.11.1
  Downloading fsspec-2023.1.0-py3-none-any.whl (143 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
Collecting tqdm>=4.62.1
  Downloading tqdm-4.64.1-py2.py3-none-any.whl (78 kB)
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting async-timeout<5.0,>=4.0.0a3
  Downloading async_timeout-4.0.2-py3-none-any.whl (5.8 kB)
Collecting multidict<7.0,>=4.5
  Downloading multidict-6.0.4-cp38-cp38-win_amd64.whl (28 kB)
Collecting yarl<2.0,>=1.0
  Downloading yarl-1.8.2-cp38-cp38-win_amd64.whl (56 kB)
Collecting aiosignal>=1.1.2
  Downloading aiosignal-1.3.1-py3-none

# 🤗 Datasets
- [Datasets](https://huggingface.co/docs/datasets/) is a library to easily share and access datasets in Python. It provides a unified API to access a growing list of datasets, with a focus on NLP datasets.
- It is maintained by the Hugging Face team and the community.
- It is built on top of PyArrow, a fast and efficient columnar data format.
- It is designed to work with NumPy, Pandas, PyTorch, TensorFlow, JAX, and many other libraries.


In [37]:
# Datasets imports
from datasets import list_datasets, load_dataset

In [38]:
# Check some datasets.
ds = list_datasets()
print(f"5 firsts Datasets available{ds[:5]}")

# Load emotion dataset.
emotions = load_dataset("emotion")
print(f"keys in the Dataset are {emotions.keys()}")

5 firsts Datasets available['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc', 'afrikaans_ner_corpus']


No config specified, defaulting to: emotion/split
Found cached dataset emotion (C:/Users/e10115582/.cache/huggingface/datasets/emotion/split/1.0.0/cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd)


  0%|          | 0/3 [00:00<?, ?it/s]

keys in the Dataset are dict_keys(['train', 'validation', 'test'])


In [39]:
# Keep only train Dataset for now.
ds_train = emotions["train"]
print(f"First element in the train set {ds_train[0]}")

First element in the train set {'text': 'i didnt feel humiliated', 'label': 0}


In [6]:
# Pandas conversion
emotions.set_format(type="pandas")
df_train = emotions["train"][:]
df_train.head()

# Add label nanme
def get_label_name(label):
    """Get label name associated to encoded label."""
    return emotions["train"].features["label"].names[label]

df_train["label_name"] = df_train["label"].apply(get_label_name)

# 🤗Tokenizers
- [Tokenizers](https://huggingface.co/docs/tokenizers/) is a library to easily share and access tokenizers in Python. It provides a unified API to access a growing list of tokenizers, with a focus on NLP tokenizers.
- It is maintained by the Hugging Face team and the community.
- It is built on top of Rust, a fast and efficient language.
- It is designed to work with NumPy, Pandas, PyTorch, TensorFlow, JAX, and many other libraries.

In [40]:
# Tokenizer import
from transformers import AutoTokenizer

In [80]:
# AutoTokenizer will automatically select the tokenizer associated to the model.
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

# Tokenize a text example.
text = "This is a silly test for a silly tokenizer."
encoded_text = tokenizer(text)
print(encoded_text)
# Decode the text.
print(tokenizer.convert_ids_to_tokens(encoded_text.input_ids))

{'input_ids': [101, 2023, 2003, 1037, 10021, 3231, 2005, 1037, 10021, 19204, 17629, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
['[CLS]', 'this', 'is', 'a', 'silly', 'test', 'for', 'a', 'silly', 'token', '##izer', '.', '[SEP]']


In [64]:
# Tokenize a whole dataset.

encoded_ds = emotions["train"].map(lambda x: tokenizer(x["text"], padding=True, truncation=True), batched=True, batch_size=None)
print(f"First element in the encoded train set {encoded_ds[0]}")

Loading cached processed dataset at C:\Users\e10115582\.cache\huggingface\datasets\emotion\split\1.0.0\cca5efe2dfeb58c1d098e0f9eeb200e9927d889b5a03c67097275dfb5fe463bd\cache-85e02d861ad2b6f9.arrow


First element in the encoded train set {'text': 'i didnt feel humiliated', 'label': 0, 'input_ids': [101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}


# 🤗 Transformers
- [Transformers](https://huggingface.co/transformers/) is a library to easily share and access state-of-the-art NLP models in Python. It provides a unified API to access a growing list of pretrained models, with a focus on NLP models.
- We use a pretrained model from the [Hugging Face Model Hub](https://huggingface.co/models) to classify the text.
-We will use the [DistilBERT](https://huggingface.co/distilbert-base-uncased) model, which is a small, fast, cheap and light Transformer model trained by distilling BERT base.
- In the first approach, we won't change the model's weights, we will only use it to extract features from the text.
- In the second approach, we will fine-tune the model to classify the text.

In [None]:
# Transformers import
from transformers import AutoModel
# General imports
import torch

In [54]:
model_ckpt = "distilbert-base-uncased"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = AutoModel.from_pretrained(model_ckpt).to(device)

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [85]:
# Extract the last hidden states.
# Set format to torch to be able to use the model.
encoded_ds.set_format(type="torch", columns=["input_ids", "attention_mask", "label"])

def extract_last_hidden_states(batch):
    """Extract the last hidden states from the model."""
    inputs = {k: v.to(device) for k, v in batch.items() if k in tokenizer.model_input_names}
    with torch.no_grad():
        outputs = model(**inputs).last_hidden_state
    # Take the last hidden state of the [CLS] token.
    return {"hidden_state": outputs[:, 0, :].cpu().numpy()}

# Extract the last hidden states.
last_hidden_states = encoded_ds.map(extract_last_hidden_states, batched=True)
last_hidden_states

  0%|          | 0/16 [00:00<?, ?ba/s]

KeyboardInterrupt: 

### Create a classifier on top of the last hidden state of the model.
- We use a random classifier on top of the last hidden state of the model.

In [None]:
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier

In [None]:
X_train = np.array(last_hidden_states["train"]["hidden_state"])
y_train = np.array(last_hidden_states["train"]["label"])
X_valid = np.array(last_hidden_states["validation"]["hidden_state"])
y_valid = np.array(last_hidden_states["validation"]["label"])

# Scale the data.
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_valid = scaler.transform(X_valid)

# Train a simple classifier.
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
print(f"Accuracy on validation set: {clf.score(X_valid, y_valid)}")

# Plot confusion matrix.
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

# Compute confusion matrix.
cm = confusion_matrix(y_valid, clf.predict(X_valid))
# Normalize the confusion matrix.
cm = cm.astype("float") / cm.sum(axis=1)[:, np.newaxis]
# Plot the confusion matrix.
fig, ax = plt.subplots(figsize=(10, 10))
sns.heatmap(cm, annot=True, fmt=".2f", xticklabels=emotions["train"].features["label"].names, yticklabels=emotions["train"].features["label"].names)
plt.ylabel("Actual")
plt.xlabel("Predicted")
plt.show()