<a href="https://colab.research.google.com/github/lkarjun/fastai-huggingface-workouts/blob/main/notebook2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Package

In [3]:
!pip install -qq transformers[sentencepiece]

[K     |████████████████████████████████| 3.5 MB 5.4 MB/s 
[K     |████████████████████████████████| 596 kB 44.5 MB/s 
[K     |████████████████████████████████| 895 kB 43.1 MB/s 
[K     |████████████████████████████████| 6.8 MB 37.7 MB/s 
[K     |████████████████████████████████| 67 kB 5.0 MB/s 
[K     |████████████████████████████████| 1.2 MB 44.2 MB/s 
[?25h

In [4]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [5]:
classifier(["I'm waiting for you..."])

[{'label': 'POSITIVE', 'score': 0.9989559650421143}]

In [6]:
print(classifier.model.name_or_path)

distilbert-base-uncased-finetuned-sst-2-english


# Behind the Pipeline Pytorch

## Step 1: Tokenize

In [7]:
checkpoint = 'distilbert-base-uncased-finetuned-sst-2-english'

In [8]:
from transformers import AutoTokenizer

In [9]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [10]:
raw_inputs = ["I've been waiting for you.", "I have you."]

In [11]:
inputs = tokenizer(raw_inputs, 
                   padding = True, 
                   truncation = True,
                   return_tensors='pt')

In [12]:
print(inputs)

{'input_ids': tensor([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 2017, 1012,  102],
        [ 101, 1045, 2031, 2017, 1012,  102,    0,    0,    0,    0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 0, 0, 0, 0]])}


## Step 2: Run inputs through model

In [13]:
from transformers import AutoModel

In [14]:
model = AutoModel.from_pretrained(checkpoint)

Some weights of the model checkpoint at distilbert-base-uncased-finetuned-sst-2-english were not used when initializing DistilBertModel: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [15]:
outputs = model(**inputs)

print(outputs.last_hidden_state.shape)

torch.Size([2, 10, 768])


In [16]:
from transformers import AutoModelForSequenceClassification

In [17]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

outputs = model(**inputs)

In [18]:
print(model)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [19]:
print(outputs.logits.shape)

print(outputs.logits)

torch.Size([2, 2])
tensor([[-2.9881,  3.0930],
        [-4.1106,  4.4167]], grad_fn=<AddmmBackward0>)


## Step 3: Process outputs

In [20]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

print(predictions)

tensor([[2.2805e-03, 9.9772e-01],
        [1.9794e-04, 9.9980e-01]], grad_fn=<SoftmaxBackward0>)


In [21]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

# Behind the pipeline (Blurr)

## Step: 1

In [22]:
!pip install -qq fastai
!pip install -qq ohmeow-blurr

[?25l[K     |███▋                            | 10 kB 17.7 MB/s eta 0:00:01[K     |███████▏                        | 20 kB 17.1 MB/s eta 0:00:01[K     |██████████▊                     | 30 kB 10.0 MB/s eta 0:00:01[K     |██████████████▍                 | 40 kB 8.3 MB/s eta 0:00:01[K     |██████████████████              | 51 kB 5.4 MB/s eta 0:00:01[K     |█████████████████████▌          | 61 kB 5.5 MB/s eta 0:00:01[K     |█████████████████████████       | 71 kB 5.5 MB/s eta 0:00:01[K     |████████████████████████████▊   | 81 kB 6.2 MB/s eta 0:00:01[K     |████████████████████████████████| 91 kB 4.1 MB/s 
[K     |████████████████████████████████| 311 kB 29.5 MB/s 
[K     |████████████████████████████████| 43 kB 1.9 MB/s 
[K     |████████████████████████████████| 189 kB 42.5 MB/s 
[K     |████████████████████████████████| 56 kB 3.9 MB/s 
[K     |████████████████████████████████| 243 kB 47.5 MB/s 
[K     |████████████████████████████████| 1.1 MB 42.2 MB/s 
[K     |

In [23]:
from fastai.text.all import *
from blurr.utils import *
from blurr.data.core import *
from blurr.modeling.core import *

In [24]:
print(checkpoint)

distilbert-base-uncased-finetuned-sst-2-english


In [25]:
path = untar_data(URLs.IMDB_SAMPLE)

print(path.ls())

imdb_df = pd.read_csv(path/'texts.csv')

[Path('/root/.fastai/data/imdb_sample/texts.csv')]


In [26]:
imdb_df.head(2)

Unnamed: 0,label,text,is_valid
0,negative,"Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!",False
1,positive,"This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som...",False


In [27]:
hf_arch, hf_config, hf_tokenizer, hf_model = BLURR.get_hf_objects(checkpoint, model_cls=AutoModelForSequenceClassification)

In [28]:
print(hf_arch)
print(type(hf_config))
print(type(hf_tokenizer))
print(type(hf_model))

distilbert
<class 'transformers.models.distilbert.configuration_distilbert.DistilBertConfig'>
<class 'transformers.models.distilbert.tokenization_distilbert_fast.DistilBertTokenizerFast'>
<class 'transformers.models.distilbert.modeling_distilbert.DistilBertForSequenceClassification'>


In [29]:
print(HF_TextBlock)

<class 'blurr.data.core.HF_TextBlock'>


In [30]:
blocks = (HF_TextBlock(hf_arch, hf_config, hf_tokenizer, hf_model, max_length = None, padding = True, truncation = True), CategoryBlock)

dblock = DataBlock(blocks = blocks, get_x = ColReader('text'), get_y = ColReader('label'), splitter = ColSplitter())

In [31]:
dls = dblock.dataloaders(imdb_df, bs=4)

In [32]:
dls.show_batch(dataloaders = dls, max_n = 2)

Unnamed: 0,text,target
0,"raising victor vargas : a review < br / > < br / > you know, raising victor vargas is like sticking your hands into a big, steaming bowl of oatmeal. it's warm and gooey, but you're not sure if it feels right. try as i might, no matter how warm and gooey raising victor vargas became i was always aware that something didn't quite feel right. victor vargas suffers from a certain overconfidence on the director's part. apparently, the director thought that the ethnic backdrop of a latino family on the lower east side, and an idyllic storyline would make the film critic proof. he was right, but it didn't fool me. raising victor vargas is the story about a seventeen - year old boy called, you guessed it, victor vargas ( victor rasuk ) who lives his teenage years chasing more skirt than the rolling stones could do in all the years they've toured. the movie starts off in ` ugly fat'donna's bedroom where victor is sure to seduce her, but a cry from outside disrupts his plans when his best - friend harold ( kevin rivera ) comes - a - looking for him. caught in the attempt by harold and his sister, victor vargas runs off for damage control. yet even with the embarrassing implication that he's been boffing the homeliest girl in the neighborhood, nothing dissuades young victor from going off on the hunt for more fresh meat. on a hot, new york city day they make way to the local public swimming pool where victor's eyes catch a glimpse of the lovely young nymph judy ( judy marte ), who's not just pretty, but a strong and independent too. the relationship that develops between victor and judy becomes the focus of the film. the story also focuses on victor's family that is comprised of his grandmother or abuelita ( altagracia guzman ), his brother nino ( also played by real life brother to victor, silvestre rasuk ) and his sister vicky ( krystal rodriguez ). the action follows victor between scenes with judy and scenes with his family. victor tries to cope with being an oversexed pimp - daddy, his feelings for judy and his grandmother's conservative catholic upbringing. < br / > < br / > the problems that arise from raising victor vargas are a few, but glaring errors. throughout the film you get to know certain characters like vicky, nino, grandma, judy and even",negative
1,"i watched grendel the other night and am compelled to put together a public service announcement. < br / > < br / > grendel is another version of beowulf, the thousand - year - old anglo - saxon epic poem. the scifi channel has a growing catalog of inoffensive and uninteresting movies, and the previews promised an inauthentic low - budget mini - epic, but this one refused to let me switch channels. it was staggeringly, overwhelmingly, bad. i watched in fascination and horror at the train wreck you couldn't tear your eyes away from. i reached for a notepad and managed to capture part of what i was seeing. the following may contain spoilers or might just save your sanity. you've been warned. < br / > < br / > - just to get it over with, beowulf's warriors wore horned helmets. trivial issue compared to what came after. it also appears that the helmets were in a bin and handed to whichever actor wandered by next. fit, appearance and function were apparently irrelevant. < br / > < br / > - marina sirtis had obviously been blackmailed into doing the movie by the ringling brothers, barnum and bailey circus. she managed to avoid a red rubber nose, but the clowns had already done the rest of her makeup. < br / > < br / > - ben cross pretended not to be embarrassed as the king. his character, hrothgar, must have become king of the danes only minutes before the film opened and hadn't had a chance to get the crown resized to fit him yet. < br / > < br / > - to facilitate the actors'return to their day jobs waiting tables, none were required to change their hairstyles at all. the variety of hair included cornrows, sideburns, buzz cuts and a mullet and at least served to distract from the dialog. to prove it was a multi - national cast, all were encouraged to retain whatever accent they chose. < br / > < br / > - as is typical with this type of movie ( at least since mad max ), leather armor was a requirement. in this case it was odd - shaped, ill - fitting and brand - new. < br / > < br / > - the female love interest, ingrid, played by alexis peters, followed a long - standing tradition of hotties who should be watched with the volume turned completely down.",negative


In [33]:
xb, yb = dls.one_batch()

In [34]:
xb

{'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1],
         [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0'),
 'input_ids': tensor([[  101,  6274,  5125,  ...,  1998,  2130,   102],
         [  101,  1996,  4497,  ...,  2090,  1005,   102],
         [  101,  2116, 19046,  ...,  1010,  2004,   102],
         [  101,  1045,  3427,  ...,  2091,  1012,   102]], device='cuda:0')}

In [35]:
yb

TensorCategory([0, 1, 1, 0], device='cuda:0')

In [36]:
len(xb), xb['input_ids'].shape, xb['attention_mask'].shape, len(xb['input_ids']), yb.shape

(2, torch.Size([4, 512]), torch.Size([4, 512]), 4, torch.Size([4]))

## Step 2: Run inputs through Model

In [38]:
hf_model.cuda()

outputs = hf_model(**xb)

In [39]:
print(outputs.logits.shape)

torch.Size([4, 2])


In [40]:
print(outputs.logits)

tensor([[-1.0525,  1.2515],
        [ 0.9266, -0.6908],
        [ 0.3341, -0.1177],
        [ 3.9893, -3.3104]], device='cuda:0', grad_fn=<AddmmBackward0>)


## Step 3: Process outputs

In [44]:
predicitons = nn.functional.softmax(outputs.logits, dim=-1)

In [47]:
print(predicitons)

tensor([[9.0789e-02, 9.0921e-01],
        [8.3443e-01, 1.6557e-01],
        [6.1107e-01, 3.8893e-01],
        [9.9932e-01, 6.7533e-04]], device='cuda:0', grad_fn=<SoftmaxBackward0>)


In [48]:
hf_model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}