# What is sentiment analysis?

 - The process of computationally identifying and categorizing opinions expressed in a piece of text or document.
 - Determine whether the writer's attitude towards a particular topic, product, etc. is positive, negative, or neutral.

[NLTK](https://www.nltk.org/)

 - Natural Language Toolkit (NLTK) is a Python library for NLP.

[SpaCy](https://spacy.io/)

 - SpaCy is an open-source software library for advanced Natural Language Processing, written in the programming languages Python and Cython.

[Genism](https://radimrehurek.com/gensim/)

 - Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora.

## Steps of sentiment analysis

01. Data Collection
02. Data Preprocessing
    - Lowercasing
    - Tokenization
    - Stopwords removal
    - Remove punctuation
03. Text Vectorization
    - Bag of Words
    - TF-IDF (Term Frequency - Inverse Document Frequency)
    - WordEmbedding
       - Word2Vec
       - GloVe
       - BERT
04. Model Selection
    - Naive Bayes
    - Logistic Regression
    - Support Vector Machine
    - Random Forest
    - Neural Network
05. Model Evaluation
06. Model Deployment

## Naive Bayes Classifier

Training Data:
- Positive Tweet 1: "I love the new phone."
- Positive Tweet 2: "Great weather today!"
- Negative Tweet 1: "I hate waiting."

### Step 1: Tokenization
Tokenization involves breaking down each tweet into individual words:

- Positive Tweet 1: ["I", "love", "the", "new", "phone"]
- Positive Tweet 2: ["Great", "weather", "today"]
- Negative Tweet 1: ["I", "hate", "waiting"]

### Step 2: Calculate Prior Probabilities
Calculate the prior probabilities based on the training data:

- $P(Positive) = \frac{2}{3}$
- $P(Negative) = \frac{1}{3}$

### Step 3: Calculate Likelihood
Calculate the likelihood of each word given the sentiment:

- For Positive sentiment:
  - $P(I|Positive) = \frac{1}{5}$
  - $P(love|Positive) = \frac{1}{5}$
  - $P(the|Positive) = \frac{1}{5}$
  - $P(new|Positive) = \frac{1}{5}$
  - $P(phone|Positive) = \frac{1}{5}$
  - $P(Great|Positive) = \frac{1}{5}$
  - $P(weather|Positive) = \frac{1}{5}$
  - $P(today|Positive) = \frac{1}{5}$
- For Negative sentiment:
  - $P(I|Negative) = \frac{1}{3}$
  - $P(hate|Negative) = \frac{1}{3}$
  - $P(waiting|Negative) = \frac{1}{3}$

### Step 4: Calculate Posterior Probabilities
Now, suppose we have a new tweet: "I love the great weather today."

Calculate the posterior probabilities for both Positive and Negative sentiments:

- For Positive sentiment:
  - $P(D|Positive) = P(I|Positive) \times P(love|Positive) \times P(the|Positive) \times P(great|Positive) \times P(weather|Positive) \times P(today|Positive) \approx \frac{1}{5} \times \frac{1}{5} \times \frac{1}{5} \times \frac{1}{5} \times \frac{1}{5} \times \frac{1}{5} = \frac{1}{3125}$
  - $P(D) = P(D|Positive) \times P(Positive) + P(D|Negative) \times P(Negative) \approx \frac{1}{3125} \times \frac{2}{3} + 0 \times \frac{1}{3} = \frac{2}{4687}$
  - $P(Positive|D) = \frac{P(D|Positive) \times P(Positive)}{P(D)} \approx \frac{\frac{1}{3125} \times \frac{2}{3}}{\frac{2}{4687}} = \frac{1}{3125} \times \frac{2}{3} \times \frac{4687}{2} = \frac{1}{3125} \times 2343 \approx 0.74976$
- For Negative sentiment:
  - $P(D|Negative) = P(I|Negative) \times P(love|Negative) \times P(the|Negative) \times P(great|Negative) \times P(weather|Negative) \times P("today"|Negative) \approx \frac{1}{3} \times 0 \times \frac{1}{3} \times 0 \times \frac{1}{3} \times 0 = 0$
  - $P(Negative|D) = \frac{P(D|Negative) \times P(Negative)}{P(D)} = 0$

In this case, the model correctly predicts that the tweet is positive because "love," "great," "weather," and "today" are words associated with positive sentiment in the training data.

----

01. Tokenization
   - Split the text into individual words (tokens) e.g. $w_1, w_2, w_3, ..., w_n$
02. Calculate prior probabilities based on training data
   - $P(Positive) = \frac{Number\ of\ Positive\ Tweets}{Total\ Number\ of\ Tweets}$
   - $P(Negative) = \frac{Number\ of\ Negative\ Tweets}{Total\ Number\ of\ Tweets}$
03. Calculate likelihood based on training data
   - For each sentiment, calculate the probability of each word appearing in a tweet of that sentiment
      - $P(D|Positive) = P(w_1|Positive) * P(w_2|Positive) * P(w_3|Positive) * ... * P(w_n|Positive)$
      - $P(D|Negative) = P(w_1|Negative) * P(w_2|Negative) * P(w_3|Negative) * ... * P(w_n|Negative)$
04. Calculate posterior probabilities
   - $P(Positive|D) = \frac{P(D|Positive) * P(Positive)}{P(D)}$
   - $P(Negative|D) = \frac{P(D|Negative) * P(Negative)}{P(D)}$

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# Training data
texts = ["I love this product!", "This is terrible.", "Neutral review."]
labels = ["positive", "negative", "neutral"]

# Create a Naive Bayes classifier pipeline
model = make_pipeline(CountVectorizer(), MultinomialNB())

# Train the model
model.fit(texts, labels)

# Test data
test_texts = ["This is a great experience!", "I don't like it.", "It's okay."]

# Predict sentiments
predictions = model.predict(test_texts)

# Display the results
for text, sentiment in zip(test_texts, predictions):
    print(f"Text: {text} | Predicted Sentiment: {sentiment}")


Text: This is a great experience! | Predicted Sentiment: negative
Text: I don't like it. | Predicted Sentiment: negative
Text: It's okay. | Predicted Sentiment: negative


## Simple example using NLTK


In [2]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer

In [3]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('vader_lexicon')

[nltk_data] Downloading package punkt to /home/meftaul/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/meftaul/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/meftaul/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [4]:
SentimentIntensityAnalyzer().polarity_scores("I love this product!")

{'neg': 0.0, 'neu': 0.308, 'pos': 0.692, 'compound': 0.6696}

In [5]:
SentimentIntensityAnalyzer().polarity_scores("My worst day ever.")

{'neg': 0.577, 'neu': 0.423, 'pos': 0.0, 'compound': -0.6249}

In [6]:
def preprocess_text(text):
    # Tokenize the text into words
    words = word_tokenize(text.lower())
    
    # Remove stop words and punctuation
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word.isalpha() and word not in stop_words]
    
    return ' '.join(words)

In [7]:
def analyze_sentiment(text):
    sia = SentimentIntensityAnalyzer()
    sentiment_score = sia.polarity_scores(text)['compound']

    if sentiment_score >= 0.05:
        return 'Positive'
    elif sentiment_score <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

In [8]:
# Example text
example_text = "I love the product. It works really well!"

# Preprocess the text
preprocessed_text = preprocess_text(example_text)

# Perform sentiment analysis
sentiment = analyze_sentiment(preprocessed_text)

# Display the result
print(f"Original Text: {example_text}")
print(f"Sentiment: {sentiment}")

Original Text: I love the product. It works really well!
Sentiment: Positive


The `SentimentIntensityAnalyzer` in NLTK is part of the Vader (Valence Aware Dictionary and sEntiment Reasoner) sentiment analysis tool. Vader is specifically designed for analyzing sentiments in text data and is well-suited for social media texts, reviews, and other short text snippets. Here's a brief overview of how `SentimentIntensityAnalyzer` works:

1. **Lexicon-based approach:**
   Vader uses a pre-built lexicon (a dictionary) that contains words and their associated sentiment scores. The lexicon is crafted to handle sentiments expressed in various contexts, including emoticons, capitalization, and intensifiers.

2. **Polarity Scores:**
   The `SentimentIntensityAnalyzer` assigns a polarity score to each word in the text. The scores include:
   - **Positive score:** The likelihood that the text expresses positive sentiment.
   - **Neutral score:** The likelihood that the text is neutral.
   - **Negative score:** The likelihood that the text expresses negative sentiment.
   - **Compound score:** A combination of the three scores above, normalized to fall between -1 (most negative) and +1 (most positive).

3. **Sentiment Classification:**
   Based on the compound score, the sentiment of the text is classified into three categories:
   - Positive
   - Negative
   - Neutral

4. **Handling Intensifiers and Negations:**
   Vader is designed to handle intensifiers (e.g., "very good") and negations (e.g., "not bad") effectively. It considers the impact of such words on the sentiment scores.

The `polarity_scores` method returns a dictionary containing positive, neutral, negative, and compound scores. You can then interpret these scores to determine the overall sentiment of the text.

Keep in mind that while Vader is efficient and easy to use, it may not be suitable for all types of text or domains. For more complex tasks or domain-specific sentiment analysis, you might consider using machine learning-based approaches with custom-trained models.

## RoBERTa (Robustly Optimized BERT Pretraining Approach)

In [9]:
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification
from scipy.special import softmax

  from .autonotebook import tqdm as notebook_tqdm


In [10]:
MODEL = f'cardiffnlp/twitter-roberta-base-sentiment'
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
# model.save_pretrained(MODEL)


In [18]:
# example_text = "I hated the movie. It was a disaster. Poor direction, bad acting."

example_text = "I felt energized within five minutes, but it lasted for only 30 minutes.I paid $50 for this product, I could have just drunk a cup of coffee and saved my money."

sia = SentimentIntensityAnalyzer()
sentiment_score = sia.polarity_scores(example_text)
sentiment_score

{'neg': 0.091, 'neu': 0.736, 'pos': 0.172, 'compound': 0.4118}

In [20]:
encoded_text = tokenizer.encode(example_text, return_tensors='pt')
output = model(encoded_text)
scores = output[0][0].detach().numpy()
scores = softmax(scores)

# scores
# # print positive negative and neutral scores
print(f"Negative score: {scores[0]}")
print(f"Neutral score: {scores[1]}")
print(f"Positive score: {scores[2]}")


Negative score: 0.05258140340447426
Neutral score: 0.23635177314281464
Positive score: 0.7110668420791626
