<a href="https://colab.research.google.com/github/marziemajidi/AI/blob/LLM-%40AIx64/video_2_transformers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Install transformers**

In [None]:
!pip install transformers

**Start using transformers and pipeline.**

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis",device='cuda')
classifier("We are happy to buy a new book.")

**Using Tokenizer and Model separately.**

In [None]:
from transformers import AutoTokenizer,AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

txt = "Hello world!"

inp = tokenizer(txt,return_tensors="pt")
out= model(**inp)
print(out)

**Using TensorFlow as backend for transformers.**

In [None]:
from transformers import AutoTokenizer,TFAutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = TFAutoModel.from_pretrained("bert-base-uncased")

txt = "Hello world!"

inp = tokenizer(txt,return_tensors="tf")
out= model(**inp)
print(out)

**Send our Tokenizer and Model to pipeline.**

In [None]:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier = pipeline ("sentiment-analysis", model=model, tokenizer=tokenizer)
results = classifier (["We are very happy to show you the Transformers library.",
                       "We hope you don't hate it."])
for result in results:
  print(result)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


{'label': 'POSITIVE', 'score': 0.9997994303703308}
{'label': 'NEGATIVE', 'score': 0.5308628082275391}


**Check what Tokenizer is doing with data.**

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

tokens = tokenizer.tokenize("We are very happy to show you the Transformers library.")
token_ids = tokenizer.convert_tokens_to_ids (tokens)
input_ids = tokenizer ("We are very happy to show you the Transformers library.")
print (f' Tokens: {tokens}')
print ("###"*10)
print (f'Token IDs: {token_ids}')
print ("###"*10)
print (f'Input IDs: {input_ids}')

X_train = ["We are very happy to show you the Transformers library.",
            "We hope you don't hate it."]
batch = tokenizer (X_train, padding=True, truncation=True,
                   max_length=512, return_tensors="pt")
print ("###"*10)
print(batch)

 Tokens: ['we', 'are', 'very', 'happy', 'to', 'show', 'you', 'the', 'transformers', 'library', '.']
##############################
Token IDs: [2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 19081, 3075, 1012]
##############################
Input IDs: {'input_ids': [101, 2057, 2024, 2200, 3407, 2000, 2265, 2017, 1996, 19081, 3075, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
##############################
{'input_ids': tensor([[  101,  2057,  2024,  2200,  3407,  2000,  2265,  2017,  1996, 19081,
          3075,  1012,   102],
        [  101,  2057,  3246,  2017,  2123,  1005,  1056,  5223,  2009,  1012,
           102,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}


**Using Tokenizer and Model separately and inference the Model using pytorch.**

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import torch.nn.functional as F

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

X_train = ["We are very happy to show you the Transformers library.",
            "We hope you don't hate it."]
batch = tokenizer(X_train, padding=True, truncation=True,
                   max_length=512, return_tensors="pt")
print(batch)
print ("###"*10)

with torch.no_grad():
    outputs = model(**batch)
    print(outputs)
    print ("###"*10)
    predictions = F.softmax(outputs.logits, dim=1)
    print(predictions)
    print ("###"*10)
    labels = torch.argmax(predictions, dim=1)
    print(labels)
    print ("###"*10)
    labels = [model.config.id2label[label_id] for label_id in labels.tolist()]
    print(labels)


{'input_ids': tensor([[  101,  2057,  2024,  2200,  3407,  2000,  2265,  2017,  1996, 19081,
          3075,  1012,   102],
        [  101,  2057,  3246,  2017,  2123,  1005,  1056,  5223,  2009,  1012,
           102,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0]])}
##############################
SequenceClassifierOutput(loss=None, logits=tensor([[-4.1329,  4.3811],
        [ 0.0818, -0.0418]]), hidden_states=None, attentions=None)
##############################
tensor([[2.0060e-04, 9.9980e-01],
        [5.3086e-01, 4.6914e-01]])
##############################
tensor([1, 0])
##############################
['POSITIVE', 'NEGATIVE']
