# Build a sentiment analysis pipeline with HuggingFace

In [20]:
#for colab
!pip install transformers




[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [21]:
from transformers import pipeline
import torch
from pprint import pprint

In [78]:
model_name = "SamLowe/roberta-base-go_emotions"
classifier = pipeline("text-classification", model=model_name, top_k=None)

We start by creating a "Sentiment Analysis" **classifier** using the pipeline function provided by the Hugging Face Transformers library. This function allows us to easily use pre-trained models for various natural language processing (NLP) tasks, like sentiment analysis.

In [79]:
results = classifier("I am walking now")
results

[[{'label': 'neutral', 'score': 0.7125529646873474},
  {'label': 'approval', 'score': 0.13431048393249512},
  {'label': 'realization', 'score': 0.07332759350538254},
  {'label': 'joy', 'score': 0.05254011228680611},
  {'label': 'relief', 'score': 0.026701610535383224},
  {'label': 'optimism', 'score': 0.012795311398804188},
  {'label': 'excitement', 'score': 0.011233342811465263},
  {'label': 'pride', 'score': 0.009617138653993607},
  {'label': 'caring', 'score': 0.009234932251274586},
  {'label': 'sadness', 'score': 0.008024858310818672},
  {'label': 'annoyance', 'score': 0.005693409126251936},
  {'label': 'admiration', 'score': 0.005559689365327358},
  {'label': 'fear', 'score': 0.004826841875910759},
  {'label': 'nervousness', 'score': 0.004653627518564463},
  {'label': 'desire', 'score': 0.0039973268285393715},
  {'label': 'disappointment', 'score': 0.0038509401492774487},
  {'label': 'amusement', 'score': 0.0031087114475667477},
  {'label': 'embarrassment', 'score': 0.002306994516

The model takes this text as input and predicts the sentiment associated with it. 

pipeline doc: https://huggingface.co/docs/transformers/main_classes/pipelines
pipeline tasks: 

### More then one sentence

In [33]:
# We give a list to the classifier now
results = classifier(["My hovercraft is full of eels and that's it", "My borther won"])
results

[{'label': 'POSITIVE', 'score': 0.6145071983337402},
 {'label': 'POSITIVE', 'score': 0.999642014503479}]

### Exercise:

Add different text inputs with varying sentiments, run it, check the model's sentiment predictions, and explore how it assigns labels.

## Now select a specific model into your pipeline

In [61]:
model_name = "SamLowe/roberta-base-go_emotions"

The model_name variable holds the name of the pre-trained model. In this case, it's "twitter-roberta-base-sentiment-latest"

Let's have a look at the model card: https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest

In [67]:
classifier = pipeline("sentiment-analysis", model=model_name)

## Tokenizer

- Tokenization is the process of breaking down text into smaller **units** called **tokens**.

- Tokens are the basic building blocks used by Transformers models to understand and process text.

- Tokens can represent **words, subwords, or even individual characters**, depending on the model's vocabulary.

![Pipeline](https://huggingface.co/datasets/huggingface-course/documentation-images/resolve/main/en/chapter2/full_nlp_pipeline.svg)

Source image: https://huggingface.co/learn/nlp-course/chapter2/2?fw=pt

In [68]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

"AutoModelForSequenceClassification" adapts to various model architectures automatically.

In [69]:
model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

When using from_pretrained, we are loading a pre-trained model and tokenizer specified by the model_name.

In [70]:
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)

We create our sentiment analysis classifier.

## Tokens to inputs IDs

In [51]:
tokens = tokenizer.tokenize("Another cool sentence to demonstrate something.")
token_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = tokenizer("Another cool sentence to demonstrate something.")

In [52]:
print(f' Tokens:{tokens}')
print(f' Token IDs: {token_ids}')
print(f' input_ids:{input_ids}')

 Tokens:['another', 'cool', 'sentence', 'to', 'demonstrate', 'something', '.']
 Token IDs: [2178, 4658, 6251, 2000, 10580, 2242, 1012]
 input_ids:{'input_ids': [101, 2178, 4658, 6251, 2000, 10580, 2242, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


### Exercise: 
Test different tokenizers, select models from the hub.

Some:

https://huggingface.co/SamLowe/roberta-base-go_emotions

https://huggingface.co/bert-base-uncased

Some more... 


In [87]:
#uncomment this to answer the exercise
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("Alex went to Gorge's house! And that is Cooool!")
token_ids = tokenizer.convert_tokens_to_ids(tokens)
input_ids = tokenizer("Another cool sentence to demonstrate something.")

Downloading (…)okenizer_config.json: 100%|██████████████████████████████████████████| 28.0/28.0 [00:00<00:00, 27.6kB/s]
Downloading (…)lve/main/config.json: 100%|█████████████████████████████████████████████| 570/570 [00:00<00:00, 578kB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████████████████████████████████████| 232k/232k [00:00<00:00, 1.69MB/s]
Downloading (…)/main/tokenizer.json: 100%|██████████████████████████████████████████| 466k/466k [00:00<00:00, 2.59MB/s]


In [88]:
print(f' Tokens:{tokens}')
print(f' Token IDs: {token_ids}')
print(f' input_ids:{input_ids}')

 Tokens:['alex', 'went', 'to', 'gorge', "'", 's', 'house', '!', 'and', 'that', 'is', 'co', '##oo', '##ol', '!']
 Token IDs: [4074, 2253, 2000, 14980, 1005, 1055, 2160, 999, 1998, 2008, 2003, 2522, 9541, 4747, 999]
 input_ids:{'input_ids': [101, 2178, 4658, 6251, 2000, 10580, 2242, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}


## Batching

In [89]:
sentences = ["Another cool sentence to demonstrate something.",
           "All I need is two sentences."]
batch = tokenizer(sentences, padding=True, truncation=True, max_length=512, return_tensors="pt") #pt for pyTorch

### Note:
All our sample will have the same length (necessity for the model) - tensors must have the same shape.
```
padding=True and truncation=True
```

In [90]:
pprint(batch)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1]]),
 'input_ids': tensor([[  101,  2178,  4658,  6251,  2000, 10580,  2242,  1012,   102],
        [  101,  2035,  1045,  2342,  2003,  2048, 11746,  1012,   102]]),
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0]])}


Returns a dictionary with keys 'input_ids' and 'attention_mask', with two tensors the 'input ids' tensor and the 'attention_mask' tensor.
input_ids are unique ids.

## Predictions

In [91]:
# Prevent gradient computation (no need to compute gradients during inference)

with torch.no_grad():
    outputs = model(**batch) 
    print(outputs)
    print('')
    predictions = torch.softmax(outputs.logits, dim=1)      # Apply softmax to convert model logits to probabilities
    pprint(predictions)
    print('')
    labels = torch.argmax(predictions, dim=1)              # Find the index of the class with the highest probability for each example
    pprint(labels)
    labels = [model.config.id2label[label_id] for label_id in labels.tolist()]
    pprint(labels)

SequenceClassifierOutput(loss=None, logits=tensor([[-5.7059, -6.0613, -5.9658, -5.0061, -4.3712, -6.9165, -5.9688, -6.5262,
         -6.8372, -5.5885, -5.4265, -6.1034, -7.0587, -6.1179, -6.4271, -6.6725,
         -7.5742, -6.2602, -6.7766, -7.8902, -6.2853, -7.8410, -5.0055, -7.8402,
         -7.6885, -5.9437, -6.8205,  3.4048],
        [-5.7238, -6.0326, -5.7438, -4.8020, -4.5821, -6.9556, -5.8810, -6.3498,
         -6.7756, -5.5713, -5.4902, -5.9506, -6.9905, -5.9530, -6.3645, -6.7789,
         -7.5721, -6.2770, -6.8283, -7.8356, -6.3383, -7.8621, -5.1717, -7.9430,
         -7.7602, -5.9722, -6.5982,  3.4492]]), hidden_states=None, attentions=None)

tensor([[1.1023e-04, 7.7266e-05, 8.5000e-05, 2.2194e-04, 4.1876e-04, 3.2850e-05,
         8.4748e-05, 4.8536e-05, 3.5563e-05, 1.2397e-04, 1.4576e-04, 7.4074e-05,
         2.8496e-05, 7.3010e-05, 5.3591e-05, 4.1929e-05, 1.7019e-05, 6.3329e-05,
         3.7786e-05, 1.2407e-05, 6.1756e-05, 1.3033e-05, 2.2208e-04, 1.3044e-05,
         1.5180

In [92]:
# Define the number of decimal places to round to
decimal_places = 2
# Round the probabilities
rounded_probabilities = torch.round(predictions * 10**decimal_places) / (10**decimal_places)
# Print the rounded probabilities
print('')
pprint(rounded_probabilities)


tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])


### Saving

In [51]:
save_directory = "your_dir"
tokenizer.save_pretrained(save_directory)
model. save_pretrained(save_directory)

tokenizer = AutoTokenizer.from_pretrained(save_directory)
model = AutoModelForSequenceClassification.from_pretrained(save_directory)
