# Sentiment analysis with Transformers

## Overview

The Transformer in NLP is an architecture that aims to solve sequence-to-sequence tasks while handling long-range dependencies. It is a transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.


In this exercise you will explore transformers pre-trained on the task for sentiment classification.

## Requirements

To install the transformers library run following commands.
```
conda install -c pytorch pytorch
pip install transformers[torch]
```

In [32]:
# Import required packages
from typing import Tuple

import pandas as pd
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer

# Create class for data preparation
class SimpleDataset:
    def __init__(self, tokenized_texts):
        self.tokenized_texts = tokenized_texts
    
    def __len__(self):
        return len(self.tokenized_texts["input_ids"])
    
    def __getitem__(self, idx):
        return {k: v[idx] for k, v in self.tokenized_texts.items()}


In [None]:
# Load tokenizer and model, create trainer
model_name = "siebert/sentiment-roberta-large-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
trainer = Trainer(model=model)

In [None]:
# Create list of texts (can be imported from .csv, .xls etc.)
pred_texts = ['I like that','That is annoying','This is great!','Wouldn´t recommend it.']

In [None]:
# Tokenize texts and create prediction data set
tokenized_texts = tokenizer(pred_texts,truncation=True,padding=True)
pred_dataset = SimpleDataset(tokenized_texts)

In [None]:
# Run predictions
predictions = trainer.predict(pred_dataset)[0]

In [None]:
# Transform predictions to labels
preds = predictions.argmax(-1)
labels = pd.Series(preds).map(model.config.id2label)
scores = (np.exp(predictions)/np.exp(predictions).sum(-1,keepdims=True)).max(1)

In [None]:
# Create DataFrame with texts, predictions, labels, and scores
df = pd.DataFrame(list(zip(pred_texts,preds,labels,scores)), columns=['text','pred','label','score'])
df.head()

## Test the boundaries

Try to find some examples where the model fails.

In [None]:
def get_prediction(text: str) -> Tuple[str, float]:
    """Get label and score for a given text.
    
    Args:
        text: Text input to evaluate.
        
    Returns:
        Tuple of label and score for a given text.
    """
    tokenized_texts = tokenizer([text],truncation=True,padding=True)
    pred_dataset = SimpleDataset(tokenized_texts)
    predictions = trainer.predict(pred_dataset)[0]
    label = model.config.id2label[predictions.argmax(-1)[0]]
    score = (np.exp(predictions)/np.exp(predictions).sum(-1,keepdims=True)).max(1)[0]
    return label, score

In [None]:
label, score = get_prediction(...)
print(f"\nLabel: {label}\nScore: {score}")