In [1]:
pip install transformers

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


### Pipeline helps to call a pretrained model and its tokenizer. It is downloaded and cached.
### We wrote sentiment analysis so that model will be called that can do sentiment analysis

In [1]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [2]:
classifier("i am very happy today!")

[{'label': 'POSITIVE', 'score': 0.999875545501709}]

### We can also give multiple texts

In [7]:
results = classifier(["i am not feeling good today","i would love to go to the park"])

for result in results:
    print(f"Lable : {result['label']} with score {round(result['score'],4)}")

Lable : NEGATIVE with score 0.9998
Lable : POSITIVE with score 0.9995


##### By default, we downloaded model distilbert-uncased-finetuned-sst-english
##### distilbert-base : model
##### uncased : model trained on lowercased letters
##### sst2-english : dataset used to train model 

### Suppose you want to use model other than this, then mention it in classifier

In [8]:
classifier = pipeline("sentiment-analysis",model="nlptown/bert-base-multilingual-uncased-sentiment")

  return self.fget.__get__(instance, owner)()


In [9]:
classifier("J'aime manger des pizzas")

[{'label': '5 stars', 'score': 0.39162155985832214}]

### What is acutally happening behind the scenes

In [19]:
from transformers import AutoTokenizer, TFAutoModelForSequenceClassification

##### AutoTokenizer : Convert textual data to numeric data (like embeddings)
##### TFAutoModelForTokenClassification : then this data is passed through the model

In [20]:
# This model only exists in pytorch so we can use from_pt=True to import model in Tensorflow
model = TFAutoModelForSequenceClassification.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment",from_pt=True)
tokenizer = AutoTokenizer.from_pretrained("nlptown/bert-base-multilingual-uncased-sentiment",from_pt=True)

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

All the weights of TFBertForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForSequenceClassification for predictions without further training.


In [21]:
classifier = pipeline('sentiment-analysis',model=model,tokenizer=tokenizer)

In [22]:
classifier("I am good boy")

[{'label': '4 stars', 'score': 0.42175939679145813}]

### How tokenizer works
##### Model trained using their respective own tokenizer

In [23]:
inpt = tokenizer("This is how tokenizer works")
inpt

{'input_ids': [101, 10372, 10127, 12548, 16925, 13649, 14269, 13144, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

##### input_id : Each word is given a unique input_id (same words have some input_ids). They are indexes that are mapped onto embeddings
##### token_type_ids : identify the token's belonging to a particular sequence
##### attention_mask :  indicates which tokens should be attended to (1) and which should be ignored (0), useful for handling padding tokens in batch processing. Here, since there is no padding, all values are 1.
#####
#####
##### Q) why number of input_ids greater than words present?
##### Special Tokens: BERT-like models add special tokens like [CLS] and [SEP] to the beginning and end of the sequence respectively. This adds 2 tokens.
##### Subword Tokenization: Tokenizers in models like BERT use WordPiece, Byte-Pair Encoding (BPE), or similar subword tokenization techniques. This means that words are often broken down into smaller subword units, especially for words that are not in the model's vocabulary.

### A better way to tokenize

In [24]:
tokenizer(
    ["This is how tokenizer works","my name is hamza"],
    padding = True,
    truncation = True,
    max_length=512,
    return_tensors="tf")

{'input_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
array([[  101, 10372, 10127, 12548, 16925, 13649, 14269, 13144,   102],
       [  101, 11153, 11221, 10127, 13222, 10601,   102,     0,     0]])>, 'token_type_ids': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
array([[0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0]])>, 'attention_mask': <tf.Tensor: shape=(2, 9), dtype=int32, numpy=
array([[1, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 1, 1, 1, 1, 1, 1, 0, 0]])>}

##### padding : to make all sentences equal
##### truncation : remove white spaces
##### max_length : padding till length 512