https://huggingface.co/models

In [1]:
import sys
import torch
from transformers import __version__ as transformers_version
import pandas as pd

print('Python version:', sys.version)
print('PyTorch version:', torch.__version__)
print('Transformers version:', transformers_version)

Python version: 3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]
PyTorch version: 1.11.0
Transformers version: 4.21.2


### Do we have GPU?

In [2]:
torch.cuda.is_available()

True

In [3]:
torch.cuda.get_device_name()

'NVIDIA GeForce MX330'

In [4]:
torch.cuda.get_device_properties(0)

_CudaDeviceProperties(name='NVIDIA GeForce MX330', major=6, minor=1, total_memory=2047MB, multi_processor_count=3)

Set auto device to GPU for PyTorch (and transformers)

In [5]:
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
print('Will use:', device)

Will use: cuda:0


## PART I - Getting to know the HuggingFace syntax

### How to use BERT (and BERT-based) models?

Luckily HuggingFace is the hub for all transformers, so the syntax is the same for any BERT model we want to use. Same loading, training, predicting, saving, etc... You learn it once, you can apply it to everything.


Let's choose a smallert BERT (to speed things up): Distilbert, uncased --> text will need to be turned to lowercase before applying model.

Every BERT model has 2 parts, both need to be loaded:
1. Tokenizer (turns text into tokens to be fed into BERT)
2. Model (takes the tokenized input and creates the embeddings, can also be fine tuned by training additional layers for classification)

In [6]:
from transformers import DistilBertTokenizer, DistilBertModel # if using TensorFlow, models will have a TF_ prefix (TFDistilBertModel)

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertModel.from_pretrained("distilbert-base-uncased")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


### Check tokenizer

In [7]:
tokenizer

PreTrainedTokenizer(name_or_path='distilbert-base-uncased', vocab_size=30522, model_max_len=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'})

In [8]:
text = "I don't like running"

In [9]:
tokenizer.encode(text)

[101, 1045, 2123, 1005, 1056, 2066, 2770, 102]

In [10]:
tokenizer.encode(text, return_tensors='np')

array([[ 101, 1045, 2123, 1005, 1056, 2066, 2770,  102]])

In [11]:
tokenizer.encode(text, return_tensors='pt')
#tokenizer.encode(text, return_tensors='tf')

tensor([[ 101, 1045, 2123, 1005, 1056, 2066, 2770,  102]])

In [12]:
tokenizer.encode_plus(text, return_tensors='pt')

{'input_ids': tensor([[ 101, 1045, 2123, 1005, 1056, 2066, 2770,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]])}

In [13]:
tokenizer.convert_ids_to_tokens(tokenizer.encode_plus(text, return_tensors='pt')['input_ids'][0])

['[CLS]', 'i', 'don', "'", 't', 'like', 'running', '[SEP]']

When to use padding? When there are multiple input sequences, and their lengths are different

In [14]:
texts = ["I don't like running", 
         "I enjoy delicious Napoletan pizza", 
         "Do you like swimming in salty water?"]

tokenized_texts = tokenizer.batch_encode_plus(texts, return_tensors='pt', padding = True)

for i in range(tokenized_texts['input_ids'].shape[0]):
    print(tokenizer.convert_ids_to_tokens(tokenized_texts['input_ids'][i])) 

['[CLS]', 'i', 'don', "'", 't', 'like', 'running', '[SEP]', '[PAD]', '[PAD]']
['[CLS]', 'i', 'enjoy', 'delicious', 'nap', '##ole', '##tan', 'pizza', '[SEP]', '[PAD]']
['[CLS]', 'do', 'you', 'like', 'swimming', 'in', 'salty', 'water', '?', '[SEP]']


In [15]:
for i in range(tokenized_texts['input_ids'].shape[0]):
    print(tokenized_texts['attention_mask'][i])

tensor([1, 1, 1, 1, 1, 1, 1, 1, 0, 0])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 0])
tensor([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])


### Check BERT model

In [16]:
model.config

DistilBertConfig {
  "_name_or_path": "distilbert-base-uncased",
  "activation": "gelu",
  "architectures": [
    "DistilBertForMaskedLM"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.21.2",
  "vocab_size": 30522
}

In [17]:
model.embeddings

Embeddings(
  (word_embeddings): Embedding(30522, 768, padding_idx=0)
  (position_embeddings): Embedding(512, 768)
  (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
  (dropout): Dropout(p=0.1, inplace=False)
)

In [18]:
model.num_parameters()

66362880

How to use model?

In [19]:
print(text)
print(tokenizer.encode(text))
print(tokenizer.encode(text, return_tensors='pt'))

I don't like running
[101, 1045, 2123, 1005, 1056, 2066, 2770, 102]
tensor([[ 101, 1045, 2123, 1005, 1056, 2066, 2770,  102]])


In [20]:
model(tokenizer.encode(text))

AttributeError: 'list' object has no attribute 'size'

In [21]:
model(tokenizer.encode(text, return_tensors='pt'))

BaseModelOutput(last_hidden_state=tensor([[[ 0.0652,  0.0028, -0.0174,  ...,  0.0461,  0.1736,  0.3138],
         [ 0.4311,  0.1628, -0.1936,  ..., -0.0356,  0.2380,  0.5381],
         [-0.1097,  0.3686,  0.4179,  ..., -0.3509,  0.0401,  0.1781],
         ...,
         [ 0.4491,  0.1565,  0.6133,  ..., -0.1435, -0.2276, -0.1262],
         [ 0.4262, -0.4698, -0.5305,  ...,  0.1044, -0.5408, -0.0299],
         [ 0.3381,  0.1836,  0.2806,  ...,  0.3698, -0.0713,  0.0624]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)

In [22]:
output = model(tokenizer.encode(text, return_tensors='pt'))
#output = model(**tokenizer.encode_plus(text, return_tensors='pt')) # here we return input_ids AND attention_masks so the ** lets the function map the dictionary's keys to its parameters

In [23]:
output.keys()

odict_keys(['last_hidden_state'])

In [24]:
output.last_hidden_state

tensor([[[ 0.0652,  0.0028, -0.0174,  ...,  0.0461,  0.1736,  0.3138],
         [ 0.4311,  0.1628, -0.1936,  ..., -0.0356,  0.2380,  0.5381],
         [-0.1097,  0.3686,  0.4179,  ..., -0.3509,  0.0401,  0.1781],
         ...,
         [ 0.4491,  0.1565,  0.6133,  ..., -0.1435, -0.2276, -0.1262],
         [ 0.4262, -0.4698, -0.5305,  ...,  0.1044, -0.5408, -0.0299],
         [ 0.3381,  0.1836,  0.2806,  ...,  0.3698, -0.0713,  0.0624]]],
       grad_fn=<NativeLayerNormBackward0>)

In [25]:
output.last_hidden_state.shape

torch.Size([1, 8, 768])

In [26]:
output.last_hidden_state[0].shape

torch.Size([8, 768])

In [27]:
output.last_hidden_state[0][0].shape

torch.Size([768])

In [28]:
output.last_hidden_state[0][0].detach().numpy().shape

(768,)

### We haven't been using GPU!

With PyTorch it's very easy, you literally tell your processes to put them onto the GPU with `.to(device)`

In [75]:
model = model.to(device)

In [76]:
encoded_texts = tokenizer.batch_encode_plus(texts*10, return_tensors='pt', padding = True).to(device)
output = model(**encoded_texts)

In [77]:
torch.cuda.empty_cache()

Using GPU would help a lot, but even this BERT model is too big, already uses a lot of GPU memory, so I'll use CPU later. But a strong GPU with nice vram does wonders.


Can we simply access stuff that's on GPU?

In [81]:
encoded_texts['input_ids'][0].numpy()

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

In [82]:
encoded_texts['input_ids'][0].cpu().numpy()

array([ 101, 1045, 2123, 1005, 1056, 2066, 2770,  102,    0,    0],
      dtype=int64)

## PART II - Simple Text Classification

Will use a simple logistic regression for all scenarios



<img src='https://camo.githubusercontent.com/fbead0577c812a46a779b83851f37e492658a7d19588605cef22e76ffc8a9747/68747470733a2f2f6a616c616d6d61722e6769746875622e696f2f696d616765732f64697374696c424552542f626572742d64697374696c626572742d73656e74656e63652d636c617373696669636174696f6e2d6578616d706c652e706e67' width=1000>


The goal is to fill this table's eval metric column:

| Applied Model | Text Preprocessing | AUC |
| --- | --- | --- |
| tf-idf | Extensive (lower, spec/numb/stop removed, lemmatized) | |
| tf-idf | Minimal (lower, spec/numb removed) | |
| tf-idf | Nothing (lower) |  |
| BERT | Extensive (lower, spec/numb/stop removed, lemmatized) |   |
| BERT | Minimal (lower, spec/numb removed) |  |
| BERT | Nothing (lower) | |

In [109]:
data = pd.read_csv('https://github.com/mohitgupta-omg/Kaggle-SMS-Spam-Collection-Dataset-/blob/master/spam.csv?raw=true', 
                    on_bad_lines='skip', 
                    encoding = "ISO-8859-1", 
                    usecols=[0, 1])

data.columns = ['is_spam', 'text']
data['is_spam'].replace({'ham' : 0, 'spam' : 1}, inplace = True)

print('Shape of data:', data.shape)
print('Ratio of spam:', round(data.loc[data['is_spam'] == 1].shape[0] / data.shape[0] * 100, 2), '%')

data.head(3)

Shape of data: (5572, 2)
Ratio of spam: 13.41 %


Unnamed: 0,is_spam,text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...


In [None]:
model(**tokenizer.batch_encode_plus(texts*10000, return_tensors='pt', padding = True))