# Sentiment analysis

In [None]:
import pandas as pd
from IPython.display import Image
from IPython.core.display import HTML 

- the identification and classification of feelings, emotions, attitudes and opinions in text, is a rapidly growing research topic in various fields, e.g. digital humanities,  computer science, marketing, investment, business,...
- also known as opinion mining
- allows organizations to identify public sentiment towards certain words or topics to understand their users better
- under the umbrella of sentiment analysis, we understand both emotion and polarity analysis, depending on the categories classified


## Polarity analysis
- classifying texts into three categories - positive, neutral and negative polarities

In [None]:
Image(url= "../img/sentiment.jpg")
# source: https://www.expressanalytics.com/wp-content/uploads/2021/06/sentimentanalysishotelgeneric-2048x803-1.jpg

## Emotion detection 
- emotion models established in the field of psychology (such as Paul Ekman's, Robert Plutchik's and Russell's models) - categories like sadness, joy, disgust, surprise, anger,...

In [None]:
Image(url= "../img/emotion.png")
# source: hthttps://clevertap.com/blog/how-emotions-incite-response/

## Sentiment analysis in the Digital Humanities

Numerous applications:
* the analysis of periodicals (Koncar et al 2022)
* novels (Stanković et al 2022)
* plays (Schmidt et al 2021)
* poems (Sprugnoli et al 2022)
* fairy tales (Zehe et al 2017)
* song lyrics (Hernández-Lorenzo et al 2022)
* Holocaust testimonies (Blanke et al 2020)

...

# Methods

## Lexicon-based

* A Sentiment Analysis Tool Chain for 18th Century Periodicals https://www.melusinapress.lu/read/ezpg-wk34/section/01ef33d1-8d9d-4391-a488-7d090f719858 oder https://gitlab.uni.lu/melusina/vdhd/koncar_sentiment 
* SentText https://thomasschmidtur.pythonanywhere.com 
* Syuzhet https://github.com/mjockers/syuzhet 
* VADER https://github.com/cjhutto/vaderSentiment 
* NRC Emotion Lexicon (EmoLex) https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm 
* SentiWordNet 3.0 https://github.com/aesuli/SentiWordNet 
* SentiWS http://wortschatz.uni-leipzig.de/de/download 

-  specialized sentiment lexicons were frequently used and often manually created by domain experts in the DH
-  Sentiment lexicons or dictionaries are collections of words and phrases annotated with sentiment scores 
- lexicon-based methods do not require any training data and can be a good starting point for sentiment analysis tasks
- however, their performance might not be as good as ML methods when it comes to understanding the context and handling complex sentences
-  due to the large variety in style and dating of DH texts, these resources often need to be tailor-made for a specific    project, thus leading to an expensive and time-consuming process
- alternative approach would be using machine learning and deep learning methods.
- the choice of method depends on the specific requirements and the resources available at hand

## ML-based

- a supervised classification task
 
#### Suggested models:

### 1. Logistic Regression
- simple yet effective machine learning algorithm that can be used for sentiment analysis
- since it is classification, the dependent variable is  a discrete categorical one (e.g. if we want to predict a new customer vs returning one)
- this algorithm is often used to predict between two discrete classes, e.g. pregnant and not pregnant or in our case now positive, newgative, neutral categories
- works well with small and very clean datasets (no outliers and messy relationships, no missing values)
- using bag-of-words or TF-IDF representation of a text, it estimates the probabilities of a particular text belonging to a specific sentiment category
-  often performs well in text classification tasks
- faster and easier to interpret compared to more complex models.


In [None]:
Image(url= "../img/logistic.png",width=500, height=1500)
# source: Machine Learning for Absolute Beginners, Oliver Theobald

#### EXAMPLE


Imagine we have two sentences:

- "I love this movie. It's fantastic."
- "I hate this movie. It's terrible."

We first need to convert these sentences into numerical vectors. 
We could use a "Bag of Words" (BoW) model, Term Frequency-Inverse Document Frequency (TF-IDF), or even word embeddings.

Let's say we use a simple BoW model and have transformed our sentences into vectors based on the frequency of words, resulting in vectors in a 3D space where each dimension corresponds to the words "love", "fantastic", and "terrible".

Now our sentences might be represented as points in this 3-dimensional space.

Let's say our positive sentence ("I love this movie. It's fantastic.") is represented as the point (1, 1, 0) because it contains the words "love" and "fantastic" once each and does not contain the word "terrible". Similarly, our negative sentence ("I hate this movie. It's terrible.") is represented as the point (0, 0, 1).

The logistic regression model will find the "decision boundary", a hyperplane (in this case, a line in the 3D space) where points on one side belong to one category ("positive") and points on the other side belong to the other category ("negative").

The logistic regression function is trained to calculate the probability that a given point belongs to the "positive" category. If the probability is greater than a certain threshold (commonly 0.5), it assigns the point to the "positive" category; otherwise, it assigns it to the "negative" category.

Given a new sentence, we would convert it into the same 3D space. The logistic regression model will then calculate the probability that the new point belongs to the "positive" category, and assigns it a sentiment based on the threshold.

### 2. Naive Bayes
- commonly used algorithm for sentiment analysis
- a probabilistic classifier that makes use of Bayes' Theorem, and it assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature
- this naive assumption of independence allows for simplicity and speed, which is a big reason why Naive Bayes is popular for text classification
- particularly good at dealing with large feature spaces, which are common in text data.


In [None]:
Image(url= "../img/bayes.png",width=500, height=1500)
# https://medium.com/analytics-vidhya/na%C3%AFve-bayes-algorithm-5bf31e9032a2

#### EXAMPLE

<U>FORMULA</U> 

Suppose we have a sentence "I love this movie". We want to classify whether the sentiment of this sentence is positive or negative.

We have two classes:

- A = Positive sentiment
- B = Negative sentiment


And we have four words in our sentence:

- W1 = "I"
- W2 = "love"
- W3 = "this"
- W4 = "movie"

We want to calculate the probability that the sentiment is positive given these four words. According to Bayes' theorem, this can be represented as:

```P(Positive | W1, W2, W3, W4)```



The Naive Bayes classifier simplifies this by assuming that the words are independent of each other. While this is a naive assumption in most natural language cases, hence the name "naive", it often works well in practice.

Thus, we can represent it as:

```P(Positive | W1, W2, W3, W4) = P(W1 | Positive) * P(W2 | Positive) * P(W3 | Positive) * P(W4 | Positive) * P(Positive)```

Here, the

```P(W1 | Positive)``` is the probability of the word "I" appearing in a positive review.
```P(W2 | Positive)``` is the probability of the word "love" appearing in a positive review, and so on.


### 3. SVM
- a powerful machine learning algorithm used for classification tasks, including sentiment analysis
- works by finding a hyperplane in a N-dimensional space that distinctly classifies the data points
- it handles high dimensional data well and is effective when there is a clear margin of separation between classes 
- SVMs can be computationally expensive and may not perform as well when the classes are highly overlapping
- tend to be less interpretable than simpler models like Logistic Regression or Naive Bayes.

In [None]:
Image(url= "../img/svm_logistic.png",width=500, height=1500)
# source: Machine Learning for Absolute Beginners, Oliver Theobald

 ##### EXAMPLE
 
- The idea of SVMs is simple -  The algorithm creates a line (or a hyperplane in higher dimensions) which separates the data into classes
- The goal of SVM is to find the maximum marginal hyperplane(MMH) that best divides the dataset into classes.


Imagine we have two sentences:

- "I love this movie. It's fantastic."
- "I hate this movie. It's terrible."

We need to convert these sentences into a form SVM can understand, i.e., numerical vectors
One common way of doing this is by using a "Bag of Words" (BoW) model, which transforms the sentences into vectors based on the frequency of words.

Let's say after using a BoW model, we have transformed our sentences into vectors in a 3D space where each dimension corresponds to the words "love", "fantastic", and "terrible", and the value in each dimension is the frequency of that word in the sentence.

Now our sentences might be represented as points in this 3-dimensional space.

Let's say our positive sentence ("I love this movie. It's fantastic.") is represented as the point (1, 1, 0) because it contains the words "love" and "fantastic" once each and does not contain the word "terrible". Similarly, our negative sentence ("I hate this movie. It's terrible.") is represented as the point (0, 0, 1).

The SVM algorithm will find a hyperplane (in this case, a line in 3D space) that best separates these points based on their class. This hyperplane aims to maximize the margin, which is the distance between the hyperplane and the nearest point from either class.

Now, given a new sentence, we would convert it to the same 3D space, and whichever side of the hyperplane it lands on is the class (positive or negative sentiment) that our SVM model predicts for the sentence.

### Deep learning methods

-  state of the art for sentiment analysis
- allow for more accurate classification and contextual understanding, but require large, well-annotated training corpora
- the lack of annotated data to train such algorithms and the need for domain adaptation are the two main challenges faced by the DH community in applying these methods

#### Models:
- Recurrent Neural Network (RNN) - LSTMs and GRU
- transformers  - BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pretrained Transformer)


- transformers are currently outpreforming all else, due to their ability to understand the context of words in all positions of the text
- however very computationally intensive, require a lot of labeled data

### Comparison of methods

| Model | Pros | Cons |
| --- | --- | --- |
| Logistic Regression | Simple to understand and implement. Less prone to overfitting. | Assumption of linearity. Might not work well with a large number of features. |
| Naive Bayes | Fast and easy to implement. Performs well in text classification tasks. | Assumption of independent predictors. In real-life, it's almost impossible that we get a set of predictors which are completely independent. |
| SVM | Effective in high dimensional spaces. Works well with clear margin of separation. | Not suitable for large datasets. Does not perform well when classes are overlapped. |
| Lexicon-based | Does not require any training data. Takes into account the intensity of sentiment. | Can't understand the context. Fails to handle complex sentences. |
| Deep Learning (BERT, GPT etc.) | Can handle complex language tasks. Does not require feature engineering. | Requires a lot of computational resources. Black box model. |


## Let's get going - sentiment analysis with a lexicon + shallow ML

In [None]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder

In [None]:
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

In [None]:
import pandas as pd

## 1. Vader

### Loading the data


In [None]:
df = pd.read_csv('../data/FinancialPhraseBank.csv', delimiter=',', header=None, encoding= 'latin-1')
df.columns = ['Sentiment', 'Text']

df.head(10)

In [None]:
df['Sentiment'].value_counts()


Let's apply the Vader sentiment analyzer on our dataset. For this, we'll use the NLTK library's Vader module.




In [None]:
def sentiment_scores(dataframe):
    sid_obj = SentimentIntensityAnalyzer()
    
    dataframe['compound'] = dataframe['Text'].apply(lambda x: sid_obj.polarity_scores(x)['compound'])
    
    def get_sentiment(compound):
        if compound >= 0.05:
            return "positive"
        elif compound <= -0.05:
            return "negative"
        else:
            return "neutral"
        
    dataframe['vader_sentiment'] = dataframe['compound'].apply(get_sentiment)
    return dataframe

df = sentiment_scores(df)

df

You might ask why we didn't do any cleaning. That's because the VADER sentiment analysis tool is specifically tuned to  text data and it is very sensitive to both the syntax and semantics of the input text. It doesn't just rely on a list of individual sentiment-bearing words, but also understands how the words interact in the context of the sentence (through things like punctuation, capitalization, intensifiers, and word order).

- VADER is case-sensitive. This means it understands the difference between words written in different cases (like "GOOD" and "good") and assigns different sentiment intensities to them. So, converting all the text to lowercase, a common step in data cleaning for many NLP tasks, is not advisable when using VADER.

- Punctuation: Similarly, VADER also considers punctuation in determining the sentiment of the text. Exclamation points, for example, can intensify the sentiment of a statement. So, removing punctuation, another common data cleaning step, is also not recommended when using VADER.

- Stop words: VADER already takes care of common words (stop words) and understands their sentiment implications in context. Therefore, it is not necessary to remove stop words from the text data when using VADER.


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score


In [None]:

accuracy = accuracy_score(df['Sentiment'], df['vader_sentiment'])
precision = precision_score(df['Sentiment'], df['vader_sentiment'], average='macro')
recall = recall_score(df['Sentiment'], df['vader_sentiment'], average='macro')
f1 = f1_score(df['Sentiment'], df['vader_sentiment'], average='macro')

print(f"Vader accuracy: {accuracy}")
print(f"vader precision: {precision}")
print(f"Vader recall: {recall}")
print(f"Vader F1 Score: {f1}")


- average='macro' argument calculates the metric independently for each class and then takes the average, which treats all classes equally
- if your classes have imbalances, you might want to consider using average='weighted', which weights the metric by the number of samples of each class.

## 2. ML algorithms

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

In [None]:
# Initialize the CountVectorizer
vectorizer = CountVectorizer(stop_words='english')

# Vectorize the Text column
X = vectorizer.fit_transform(data['Text'])

# Encode the labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(data['Sentiment'])

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


### Logistic Regression

In [None]:
clf_lr = LogisticRegression(max_iter=1000)
clf_lr.fit(X_train, y_train)
y_pred_lr = clf_lr.predict(X_test)

### Naive Bayes

In [None]:
clf_nb = MultinomialNB()
clf_nb.fit(X_train, y_train)
y_pred_nb = clf_nb.predict(X_test)

### SVM

In [None]:
clf_svm = LinearSVC()
clf_svm.fit(X_train, y_train)
y_pred_svm = clf_svm.predict(X_test)

### Evaluation

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [None]:

def print_metrics(y_test, y_pred):
    print('Accuracy: ', accuracy_score(y_test, y_pred))
    print('Precision: ', precision_score(y_test, y_pred, average='weighted'))
    print('Recall: ', recall_score(y_test, y_pred, average='weighted'))
    print('F1 Score: ', f1_score(y_test, y_pred, average='weighted'))

# Metrics for Logistic Regression
print('Metrics for Logistic Regression:')
print_metrics(y_test, y_pred_lr)

# Metrics for Naive Bayes
print('\nMetrics for Naive Bayes:')
print_metrics(y_test, y_pred_nb)

# Metrics for SVM
print('\nMetrics for SVM:')
print_metrics(y_test, y_pred_svm)


From all this we can conclude that Logistic regression worked best four our dataset, followed by SVM, Naive Bayes and the lexicon. 

### Why?
-  Logistic regression is specifically designed for these kind of  binary classification problems so it models the probability that each input instance belongs to the positive class, which fits the nature of this task very well
- Logistic regression does not require features to be independent, unlike Naive Bayes. This can be a significant advantage when dealing with text data, as words in a sentence are often not independent of each other
- logistic regression does not require features to be independent (words), unlike Naive Bayes, which can be a significant advantage when dealing with text data, as words in a sentence are often not independent of each other
- if we tuned the parameters of th SVM better (e.g. the C - penalty for missclassification), we might have gotten a better result

#### Further Reading

<B>BOOKS</B>:
- Natural Language Processing with Python by Steven Bird, Ewan Klein, and Edward Loper
- Speech and Language Processing by Daniel Jurafsky and James H. Martin
- Deep Learning for Natural Language Processing by Palash Goyal, Sumit Pandey, and Karan Jain


<B>ARTICLES</B>
- Acheampong, Francisca Adoma; Wenyu, Chen; Nuuno-Mensah, Henry: Text‐based Emotion Detection: Advances, Challenges, and Opportunities. 2020. Engineering Reports, 2, 7. DOI: https://doi.org/10.1002/eng2.12189. [W6BG3BF5]
- Dang, Nhan Cach; Moreno-García, Maria N.; De la Prieta, Fernando. Sentiment Analysis Based on Deep Learning: A Comparative Study.2020. Electronics, 9, 483. DOI: https://doi.org/10.3390/electronics9030483 [9M88FBQ9]
- Blanke, Tobias; Brayant, Michel; Hedges, Mark:  Understanding Memories of the Holocaust—A New Approach to Neural Networks in the Digital Humanities. 2020. Digital Scholarship in the Humanities, 35, 1. Oxford University Press (OUP), 17–33. DOI: https://doi.org/10.1093/llc/fqy082 . [B4ETJDFD]
- Hartmann, Jochen; Heitmann, Mark; Siebert, Christian; Schamp, Christina: More than a Feeling: Accuracy and Application of Sentiment Analysis. 2023. International Journal of Research in Marketing, 40,1, 75–87. DOI: https://doi.org/10.1016/j.ijresmar.2022.05.005. [J5E77P3M]
- Hernández-Lorenzo, Laura; Diaz, Aitor; Perez, Alvaro; Ros, Salvador; Gonzalez-Bianco, Elena: Exploring Spanish contemporary song lyrics through Digital Humanities methods: Some thematic and structural properties. 2022. Digital Scholarship in the Humanities. 37, 3, 738–746. DOI: https://doi.org/10.1093/llc/fqab083 [4PRVFJNZ]
- Koncar, Philipp; Geiger, Bernhard C.; Glatz, Christina; Hobisch, Elisabeth; Sarić, Sanja; Scholger, Martina; Völkl, Yvonne; Helic, Denis: A Sentiment Analysis Tool Chain for 18th Century Periodicals. 2022. In: Manuel Burghardt, Lisa Dieckmann, Timo Steyer, Peer Trilcke, Niels-Oliver Walkowski, Joëlle Weis, Ulrike Wuttke (eds.): Fabrikation von Erkenntnis. Experimente in den Digital Humanities. Luxembourg. Zeitschrift für digitale Geisteswissenschaften und Melusina Press. DOI: https://doi.org/10.26298/ezpg-wk34. [ZE7V6L5N]
- Schmidt, Thomas; Dennerlein, Katrin; Wolff, Christian : Using Deep Learning for Emotion Analysis of 18th and 19th Century German Plays. 2021. In: Burghardt, Manuel and Dieckmann, Lisa and Steyer, Timo and Trilcke, Peer and Walkowski, Niels-Oliver and Weis, Joëlle and Wuttke, Ulrike, (eds.): Fabrikation von Erkenntnis: Experimente in den Digital Humanities. Teilband 1. Melusina Press, Esch-sur-Alzette, Luxembourg. ISBN:  978-2-919815-25-8. [RHVH9KPB]
- Sprugnoli, Rachele; Passarotti, Marco; Cecchini, Flavio Massimiliano; Pellegrini, Matteo: Overview of the EvaLatin 2020 Evaluation Campaign. 2020. Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages (105–110). Marseille: European Language Resources Association (ELRA). [9HI22FW8]
- Sprugnoli, Rachele;  Mambrini, Francesco; Passarotti, Marco; Moretti, Giovanni: Sentiment Analysis of Latin Poetry: First Experiments on the Odes of Horace. 2021. Proceedings of the Eighth Italian Conference on Computational Linguistics. Milan, Italy, 314–320. DOI:  https://doi.org/10.5281/zenodo.5773792. [7MNF9SMT]
- Stanković, Ranka; Miloš Košprdić: Sentiment Analysis of Serbian Old Novels. 2022. Proceedings of the 2nd Workshop on Sentiment Analysis and Linguistic Linked Data, European Language Resources Association, 31–38. [TPHTTER9]
- Suissa, Omri, et al. “Text Analysis Using Deep Neural Networks in Digital Humanities and Information Science.” Journal of the Association for Information Science and Technology, vol. 73, no. 2, 2022, pp. 268–87, https://doi.org/10.1002/asi.24544. [6TAVZY7M]
- Zehe, Albin; Becker, Martin; Jannidis, Fotis; Hotho, Andreas: Towards Sentiment Analysis on German Literature. 2017. In: Kern-Isberner, G., Fürnkranz, J., Thimm, M. (eds) KI 2017: Advances in Artificial Intelligence. KI 2017. Lecture Notes in Computer Science(), 10505. Springer, Cham. DOI: https://doi.org/10.1007/978-3-319-67190-1_36. [3AHVI274]

<B>TUTORIALS</B>
- https://huggingface.co/blog/sentiment-analysis-python
- https://towardsdatascience.com/a-step-by-step-tutorial-for-conducting-sentiment-analysis-a7190a444366
- https://realpython.com/python-nltk-sentiment-analysis/