## Imports

In [14]:
from transformers import pipeline

# model is from huggingface - https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest
model_path = "cardiffnlp/twitter-roberta-base-sentiment-latest"
sentiment_task = pipeline("sentiment-analysis", model=model_path, tokenizer=model_path)


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## Preliminary Testing

In the following cell, I played around with the model to assess how it can be used.

In [15]:
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoConfig
import numpy as np
from scipy.special import softmax

MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
config = AutoConfig.from_pretrained(MODEL)
# PT
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
#model.save_pretrained(MODEL)
text = "hey lefty loser how about they take commercial together and save a bit!."
query = text.split(" ")

sentiment_output = {}
# evaluate sentiments for each word in the query
for word in query:
    encoded_input = tokenizer(word, return_tensors='pt') # includes both the input_ids and the attention_mask

    #print(encoded_input)

    output = model(**encoded_input)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)

    ranking = np.argsort(scores)
    ranking = ranking[::-1] # 0 stands for negative, 1 stands for neutral, 2 stands for positive

    temp = []
    for i in range(scores.shape[0]):
        l = config.id2label[ranking[i]]
        s = scores[ranking[i]]
        temp.append((l, s))
    
    sentiment_output[word] = temp

print(sentiment_output)


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


{'hey': [('neutral', 0.5666436), ('positive', 0.37361282), ('negative', 0.05974364)], 'lefty': [('neutral', 0.6846445), ('positive', 0.18011715), ('negative', 0.13523835)], 'loser': [('negative', 0.6023872), ('neutral', 0.287571), ('positive', 0.110041775)], 'how': [('neutral', 0.6041866), ('positive', 0.21852258), ('negative', 0.17729077)], 'about': [('neutral', 0.57544786), ('positive', 0.29396135), ('negative', 0.13059081)], 'they': [('neutral', 0.5719359), ('positive', 0.29150087), ('negative', 0.13656318)], 'take': [('neutral', 0.6289807), ('positive', 0.2486109), ('negative', 0.12240843)], 'commercial': [('neutral', 0.5754338), ('positive', 0.3216852), ('negative', 0.10288105)], 'together': [('neutral', 0.560652), ('positive', 0.34983358), ('negative', 0.08951441)], 'and': [('neutral', 0.57114667), ('positive', 0.27668178), ('negative', 0.15217155)], 'save': [('neutral', 0.6034487), ('positive', 0.2806874), ('negative', 0.11586387)], 'a': [('neutral', 0.4693253), ('positive', 0.3

## Augmentation Example

In [11]:
# Augmentation Example

augmented_input = ""

for word in query:
    augmented_input += word + " " + f"[{sentiment_output[word][0][0]}]" + " "

print(augmented_input)

hey [neutral] lefty [neutral] loser [negative] how [neutral] about [neutral] they [neutral] take [neutral] commercial [neutral] together [neutral] and [neutral] save [neutral] a [neutral] bit!. [neutral] 


# Vibe Testing on Test Data

In the following example. I show that sentiment analysis by itself is not sufficient in classifying the toxicity of a query. 

Query: "I get the odd feeling Klastri  the head of the ACLU of Hawaii  will step in and defend this scum for freedom of speech."

Most Likely Sentiment: 
1) neutral
2) positive
3) negative

In [32]:
# Testing on actual test data to see vibe:

query = "I get the odd feeling Klastri  the head of the ACLU of Hawaii  will step in and defend this scum for freedom of speech."

encoded_input = tokenizer(query, return_tensors='pt') # includes both the input_ids and the attention_mask
output = model(**encoded_input)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

ranking = np.argsort(scores)
ranking = ranking[::-1] # 0 stands for negative, 1 stands for neutral, 2 stands for positive

print(query) # for visualization
for i in range(scores.shape[0]):
    l = config.id2label[ranking[i]]
    s = scores[ranking[i]]
    print(f"{l}: {s}")


I get the odd feeling Klastri  the head of the ACLU of Hawaii  will step in and defend this scum for freedom of speech.
negative: 0.7978891134262085
neutral: 0.19384410977363586
positive: 0.008266775868833065


# Importance of segmenting sentences

In our experiment below, we can see that we are better in isolating toxic parts of a sentences when whe chunk the query.

In the same query: "I get the odd feeling Klastri  the head of the ACLU of Hawaii  will step in and defend this scum for freedom of speech."

We observe the following:
- I get the odd feeling Klastri [neutral]
- the head of the ACLU [neutral]
- of Hawaii  will step in [neutral]
- and defend this scum for freedom [negative]
- of speech. [neutral]

This observation is crucial since we can takeaway 2 important lessons:
1) Segmentation of sentences to smaller pieces can help focus the content (and allow concurrent processing to speed up classification if needed)
2) It might be beneficial to only include [negative] or [positive] sentiments and ignore neutral sentiments to avoid confusion. We can observe that neutral is the highest score which dilutes the other sentiments.

In [33]:
query = "I get the odd feeling Klastri  the head of the ACLU of Hawaii  will step in and defend this scum for freedom of speech.".split(" ") # turn this into a list

def listostring(s):
    str1 = " "
    return (str1.join(s))


divisor = 0 
split_index = len(query) // 4

# divide the query into 4 parts
divided_query = []
while divisor < len(query):
    divided_query.append(query[divisor:divisor+split_index])
    divisor += split_index

divided_query = [listostring(element) for element in divided_query] # convert the list of lists into a list of strings
print(divided_query)
print()
visual_augmented_input = ""
augmented_input = ""

for phrase in divided_query:
    encoded_input = tokenizer(phrase, return_tensors='pt') # includes both the input_ids and the attention_mask
    output = model(**encoded_input)
    scores = output[0][0].detach().numpy()
    scores = softmax(scores)

    ranking = np.argsort(scores)
    ranking = ranking[::-1] # 0 stands for negative, 1 stands for neutral, 2 stands for positive

    visual_augmented_input += phrase + " " + f"[{config.id2label[ranking[0]]}]" + "\n" # for visualization
    augmented_input += phrase + " " + f"[{config.id2label[ranking[0]]}]" + " " 

print(visual_augmented_input)
print(augmented_input)


['I get the odd feeling Klastri', ' the head of the ACLU', 'of Hawaii  will step in', 'and defend this scum for freedom', 'of speech.']

I get the odd feeling Klastri [neutral]
 the head of the ACLU [neutral]
of Hawaii  will step in [neutral]
and defend this scum for freedom [negative]
of speech. [neutral]

I get the odd feeling Klastri [neutral]  the head of the ACLU [neutral] of Hawaii  will step in [neutral] and defend this scum for freedom [negative] of speech. [neutral] 
