<center>

# Sentiment Analysis

</center>

In [1]:
# Installing transformers and scipy libraries

# pip install transformers --quiet        
# pip install scipy --quiet   

* ***We use the "transformers" package to download the model from Hugging Face website*** 
* ***We use "scipy" package to convert the o/p of the model into probability scores*** 

**`pip`** - pip stands for "Pip Installs Packages". It is a package management system and command-line tool for installing,  
        upgrading, and managing Python packages and dependencies. Pip is the standard package installer for Python and is 
        widely used in the Python community. It simplifies the process of installing and managing external libraries and 
        packages, making it easier to work with third-party code in Python projects.


**`transformers`** - The "transformers" library simplifies the development of NLP (Natural Language Processing) models by 
        providing access to powerful pre-trained models, efficient tokenization methods, and a unified API for various NLP 
        tasks. It has become a go-to library for many NLP practitioners and researchers, enabling them to leverage the latest 
        advancements in transformer-based architectures for their specific NLP applications.
        
The main highlight of the "transformers" library is its extensive collection of pre-trained models, including 
        state-of-the-art models like BERT, GPT, RoBERTa, and many others. These models are pre-trained on large corpora and can 
        be fine-tuned or used directly for a wide range of NLP tasks.


**`scipy`** - The SciPy library is a comprehensive scientific computing library in Python that provides a wide range of tools 
        and functions for various areas of scientific and technical computing. Built on top of NumPy, it offers efficient 
        numerical operations, advanced mathematical functions, optimization algorithms, interpolation techniques, linear algebra
        operations, statistical analysis tools, signal and image processing functions, and more. With its extensive
        functionality, SciPy serves as a powerful tool for tasks such as numerical analysis, data processing, optimization, 
        statistics, signal processing, and beyond. It is widely used in scientific research, engineering, data analysis, 
        and other domains where complex mathematical and scientific computations are required.


**`--quiet`** - The behavior and usage of --quiet can vary depending on the tool or script you are using. Generally, when 
        "--quiet" is specified as a command-line argument or option, it instructs the tool or script to operate silently, 
        without displaying unnecessary information or verbose output. This can be useful when you want to run a program without 
        being distracted by excessive output.


**`Hugging Face website`** - The website contains a model called "roBERTa" for tweet sentiment analysis. The model is developed 
        by the Facebook AI Team. It is pre-trained on 58 million tweets and is proven to be quite accurate for tweet sentiment 
        analysis. Website - [https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment]


In [2]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification

In [3]:
from scipy.special import softmax


**`AutoTokenizer`** - The AutoTokenizer is a powerful feature provided by the Hugging Face Transformers library in Python. 
        It is designed to automatically select and instantiate the appropriate tokenizer based on the chosen pre-trained model. 
        This eliminates the need to manually download and manage different tokenizers for various models. The AutoTokenizer 
        class supports a wide range of tokenization tasks, such as text classification, named entity recognition, and machine 
        translation. It provides methods to tokenize input text, convert tokens to input IDs, handle special tokens, and 
        perform other tokenization-related operations. The AutoTokenizer simplifies the process of working with different
        models and ensures consistent tokenization across various tasks, making it a valuable tool for Natural Language 
        Processing (NLP) tasks in Python.


**`AutoModelForSequenceClassification`** - The "AutoModelForSequenceClassification" is a component of the Hugging Face 
        Transformers library in Python that offers automatic model selection and instantiation based on the specified task. 
        It is specifically designed for sequence classification tasks, such as sentiment analysis or text categorization. By 
        using the "AutoModelForSequenceClassification" class, developers can easily load and initialize the appropriate 
        pre-trained model for their task without needing to manually download and manage multiple models. The class provides 
        a consistent interface for fine-tuning, inference, and evaluation of sequence classification models. It supports various
        architectures, such as BERT, RoBERTa, and GPT, and allows users to seamlessly switch between them based on their 
        specific requirements. The "AutoModelForSequenceClassification" class simplifies the process of working with sequence 
        classification models and enables efficient development and deployment of natural language processing models in Python.


**`softmax`** - The softmax function is a mathematical function used in machine learning for multiclass classification tasks. 
        It takes a vector of real numbers as input and transforms them into a probability distribution. The function 
        exponentiates each element of the input vector and normalizes the results by dividing them by the sum of all 
        exponentiated values. This normalization ensures that the resulting probabilities sum up to 1. The softmax function 
        is commonly used to convert model outputs into probabilities for assigning an input to one of multiple classes. 
        By mapping the outputs to a probability distribution, the softmax function enables the selection of the most probable 
        class based on the highest probability value. 


In [4]:

tweet = "@MehranShakarami today's cold @ home 😒 https://mehranshakarami.com"


In [5]:

# precprocess tweet                                # Explanation:  

tweet_words = []

for word in tweet.split(' '):                     # splits the tweet word by word
    if word.startswith('@') and len(word) > 1:    # to avoid instances where '@' is used as a letter apart from the username we include length of words. If length of the word > 2  including '@' then the code gets executed.
        word = '@user'                            # coverts any word that starts with '@' and has more than 2 characters to a word called "@user"    
    
    elif word.startswith('http'):
        word = "http"                             # coverts any word that starts with 'http..............' to 'http' 
    tweet_words.append(word)

tweet_proc = " ".join(tweet_words)                # joins the tweet that was split before with the above changes made to @ and http words

tweet_words


['@user', "today's", 'cold', '@', 'home', '😒', 'http']

In [6]:

# load model and tokenizer

roberta = "cardiffnlp/twitter-roberta-base-sentiment"                # Name of the model copied from Hugging Face website

model = AutoModelForSequenceClassification.from_pretrained(roberta)  # Downloading and loading the "roBERTa" model 
tokenizer = AutoTokenizer.from_pretrained(roberta)                   # Loading the Tokenizer to convert tweet text into appropriate numbers

labels = ['Negative', 'Neutral', 'Positive']                         # Labelling the outputs of the model


# Note - Downloading is done only initially. The code should execute much faster next time since it only loads then. 


In [7]:

# Sentiment Analysis

encoded_tweet = tokenizer(tweet_proc, return_tensors='pt')     # Converting processed tweet into PyTorch tensors 'pt', so that we can pass that into the model

print(encoded_tweet)                                           # Printing the encoded tweet


{'input_ids': tensor([[    0,  1039, 12105,   452,    18,  2569,   787,   184, 17841, 10659,
          2054,     2]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}


**`input_ids`** - The code snippet you provided represents a dictionary with a single key-value pair. The key is 'input_ids', and the value is a PyTorch tensor.

In this specific case, the tensor is a 2D tensor with a shape of (1, 12). The values within the tensor represent the input token IDs for a sequence of 12 tokens. Token IDs are typically used in natural language processing tasks to represent individual tokens from a vocabulary.

In the example, the input token IDs include the values [0, 1039, 12105, 452, 18, 2569, 787, 184, 17841, 10659, 2054, 2]. The specific meaning of each token ID depends on the tokenization scheme and the vocabulary used by the model. Token IDs are often obtained by converting text input into a numerical representation that can be processed by the model.

Overall, the dictionary represents an input to a model or a part of a model's output, where the 'input_ids' are provided to represent the tokens of a sequence in a numerical format.



**`attention_mask`** - It represents a dictionary with a single key-value pair. The key is 'attention_mask', and the value is a PyTorch tensor.

In this specific case, the tensor is a 2D tensor with a shape of (1, 12). The values within the tensor represent the attention mask for a sequence of 12 tokens. An attention mask is commonly used in natural language processing tasks, particularly in transformer-based models like BERT.

The attention mask is used to indicate which tokens in the sequence should be attended to (considered) by the model and which tokens should be ignored. In this case, all tokens in the sequence have an attention mask value of 1, indicating that they should be attended to. A value of 0 would typically be used to indicate tokens that should be ignored or masked during processing.

Overall, the dictionary represents an input to a model or a part of a model's output, where the 'attention_mask' is provided to guide the model's attention mechanism.

In [8]:

# Passing encoded tweet to the model to do sentiment analysis

output = model(encoded_tweet['input_ids'], encoded_tweet['attention_mask'])   # entering the keys i.e. "input_ids" & "attention_mask"

print(output)                                                                 # O/p's the sentiment analysis


SequenceClassifierOutput(loss=None, logits=tensor([[ 1.3710,  0.3350, -1.7215]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


In [9]:

# Simplyfying 

"""
Instead of passing the encoded tweet in to the model 
(i.e. "output = model(encoded_tweet['input_ids'], encoded_tweet['attention_mask'])") 
we can simplify it by writing the below code 
"""
output = model(**encoded_tweet)

print(output)


SequenceClassifierOutput(loss=None, logits=tensor([[ 1.3710,  0.3350, -1.7215]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)


In [10]:
print(type(output))

<class 'transformers.modeling_outputs.SequenceClassifierOutput'>


In [11]:

# Converting the result into probabilites 

scores = output[0][0].detach().numpy()     # scores = output[0][0].detach() selects the o/p list i.e. [ 1.3710,  0.3350, -1.7215]
                                           # scores = output[0][0].detach().numpy() converts the list into a numpy array
    
print(scores)


[ 1.3709717   0.33496094 -1.7214578 ]


In [12]:
print(type(scores))         # confirmatinon that the list is converted into a numpy array 

<class 'numpy.ndarray'>


In [13]:

scores = softmax(scores)     # passing scores into the softmax function to covert it into probabilities

print(scores)               

"""
# Note 

1. You can multiply the above code by 100 to get proper integers in terms of 100% rather than from 0 to 1. This is totally optional however and is down to the user's choice. 
2. Make sure to not run this cell over and over again, as the o/p changes everytime the code is run. 
3. Suppose if we run this cell over and over again for some reason, then to get the original o/p run the program from 2 cells above i.e., from # Converting the result into probabilites
""";


[0.7141536  0.2534299  0.03241653]


In [14]:

# Printing output with the corrersponding labels

for i in range(len(scores)):
    
    l = labels[i]
    s = scores[i]
    print(l,s)


Negative 0.7141536
Neutral 0.2534299
Positive 0.032416526


### <span style="background-color: yellow;">Let's look at another example using the entire code in a single cell</span>

In [15]:

tweet = 'Great content! subscribed 😉'

# precprcess tweet
tweet_words = []

for word in tweet.split(' '):
    if word.startswith('@') and len(word) > 1:
        word = '@user'
    
    elif word.startswith('http'):
        word = "http"
    tweet_words.append(word)

tweet_proc = " ".join(tweet_words)

# load model and tokenizer
roberta = "cardiffnlp/twitter-roberta-base-sentiment"

model = AutoModelForSequenceClassification.from_pretrained(roberta)
tokenizer = AutoTokenizer.from_pretrained(roberta)

labels = ['Negative', 'Neutral', 'Positive']

# sentiment analysis
encoded_tweet = tokenizer(tweet_proc, return_tensors='pt')
# output = model(encoded_tweet['input_ids'], encoded_tweet['attention_mask'])
output = model(**encoded_tweet)

scores = output[0][0].detach().numpy()
scores = softmax(scores) * 100            # printing scores in terms of 100 % by multiplying * 100 

# print output with corresponding labels
for i in range(len(scores)):
    
    l = labels[i]
    s = scores[i]
    print(l,s)


Negative 0.20959757
Neutral 1.897672
Positive 97.89272
