!pip install rich transformers torch

Task: Sentiment Analysis using SequenceClassification Object
Models : distilbert-base-uncased-finetuned-sst-2-english, distilbert-base-uncased

Task: Text Classification
There are classification variants as below 

- Nat Lang Inference (Entailment, Contradiction, Neutral)

- Question Nat Lang Inference Finding if answer present for given question in the context  

In [2]:
from transformers import pipeline

# default_model = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
 
classifier = pipeline("sentiment-analysis")
classifier("I loved Star Wars so much!")
##  [{'label': 'POSITIVE', 'score': 0.99}

  from .autonotebook import tqdm as notebook_tqdm
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.999840259552002}]

In [69]:
classifier.framework

'pt'

In [70]:
classifier.model  # Provides the output for the model

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [96]:
# try another classification model
distilbert = 'distilbert-base-uncased'

distilpipe = pipeline('sentiment-analysis',
                      model=distilbert)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [97]:
distilpipe("This is a best way to get things done")

[{'label': 'LABEL_0', 'score': 0.5283176302909851}]

In [98]:
distilpipe("This is a worst way to explore the amazon rainforest")

[{'label': 'LABEL_0', 'score': 0.5297880172729492}]

In [71]:
# Lets dive into the understanding how pipeline is working
# Then think how a similar function can be implemented

In [1]:
from transformers import (
    AutoTokenizer, 
    AutoModelForSequenceClassification
)
import torch

  from .autonotebook import tqdm as notebook_tqdm


In [74]:
default_model = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"

In [54]:
default_tokenizer = AutoTokenizer.from_pretrained(default_model)

In [55]:
text = ["This is a great day today", "There is so much trouble in that place"]
default_tokens = default_tokenizer(text, return_tensors='pt',
                                  padding=True)
default_tokens

{'input_ids': tensor([[ 101, 2023, 2003, 1037, 2307, 2154, 2651,  102,    0,    0],
        [ 101, 2045, 2003, 2061, 2172, 4390, 1999, 2008, 2173,  102]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [57]:
default_tokenizer.encode(text, padding=True, return_tensors='pt')

tensor([[ 101, 2023, 2003, 1037, 2307, 2154, 2651,  102, 2045, 2003, 2061, 2172,
         4390, 1999, 2008, 2173,  102]])

In [75]:
# How to get the dimension of the embedding layer?
from transformers import AutoConfig

default_config = AutoConfig.from_pretrained(default_model)
default_config

DistilBertConfig {
  "_name_or_path": "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "finetuning_task": "sst-2",
  "hidden_dim": 3072,
  "id2label": {
    "0": "NEGATIVE",
    "1": "POSITIVE"
  },
  "initializer_range": 0.02,
  "label2id": {
    "NEGATIVE": 0,
    "POSITIVE": 1
  },
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "output_past": true,
  "pad_token_id": 0,
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "transformers_version": "4.39.3",
  "vocab_size": 30522
}

In [None]:
# Understand the difference between tokenizer and embedding model

In [9]:
# How to visualise the Embedding? 
# Sentence Transformers models create embeding vectors
# while the tokenizer create encoded numbers for the tokens

from sentence_transformers import SentenceTransformer
model_tformers = "all-mpnet-base-v2"
embed = SentenceTransformer(model_tformers)

In [50]:
passage_embedding = embed.encode(text)

In [51]:
passage_embedding.shape

(2, 768)

In [76]:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

dber_orig_tokenizer = DistilBertTokenizer.from_pretrained(default_model)

In [77]:
distilbert_original = DistilBertForSequenceClassification.from_pretrained(
    pretrained_model_name_or_path=default_model
)

In [80]:
with torch.no_grad():
    classifier_output = distilbert_original(**default_tokens)
    classifier_output

In [86]:
out_logits = classifier_output.logits

In [88]:
out_logits[0]

tensor([-4.1843,  4.5188])

In [83]:
distilbert_original.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

In [93]:
predid = classifier_output.logits[0].argmax().item()
distilbert_original.config.id2label[predid]

'POSITIVE'

In [94]:
predid = classifier_output.logits[1].argmax().item()
distilbert_original.config.id2label[predid]

'NEGATIVE'

In [None]:
def make_pipeline(arg1, arg2, *args, **kwargs):
# Now make a function that takes text, along with other 
# args, and returns a sentiment label
    pass

In [3]:
nli_classifier = pipeline("text-classification",
                          model="roberta-large-mnli")
nli_output = nli_classifier("A soccer game with multiple males playing. Some men are playing a sport.")
nli_output

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[{'label': 'ENTAILMENT', 'score': 0.9883741140365601}]

In [4]:
classifier = pipeline("text-classification",
                      model = "cross-encoder/qnli-electra-base")

classifier("Where is the capital of France?, Paris is the capital of France.")

[{'label': 'LABEL_0', 'score': 0.9978110194206238}]

https://huggingface.co/transformers/v3.0.2/model_doc/auto.html

Provides the intro to AutoModel classes.