# Sentiment Analysis

**Sentiment analysis** is the process of understanding the opinion of an author about a subject. 

## Key Elements 

- **Opinion/Emotion**
    - Opinion (or polarity) can be positive, neutral or negative
    - Emotion can be qualitative

- **Subject**

- **Opinion Holder**

Sentiment analysis is important in social media monitoring, brand monitoring, customer service, product analytics, market research...

In [None]:
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot as plt 

data = pd.read_csv('../data/sentiment_analysis/IMDB_sample.csv')

data['label'].value_counts()

In [None]:
data['label'].value_counts(normalize=True)

## Levels of granularity

- Document
- Sentence
- Aspect

## Type of sentiment analysis algorithms 

### Rule/lexicon-based 

nice:+2, good:+1, terrible:-3... 

### Automatic / Machine Learning 

### Which one to choose? 

The automated/ML approach depends on historical labelled data, takes longer and can be quite powerful, while the rule/lexicon depends on manually crafted valence scores, can be fast but it different words might have different values depending on the context.

In [None]:
text = "Today was a good day."

from textblob import TextBlob 

my_valence = TextBlob(text) 
print(my_valence.sentiment)

## Wordclouds

In [None]:
from wordcloud import WordCloud 

cloud_two_cities = WordCloud().generate('En un pais multicolor nacion una abeja bajo el sol. La abeja se llamaba Maya y era azteca, que no maya. La abeja del pais guay')

plt.imshow(cloud_two_cities, interpolation='bilinear')
plt.axis('off')
plt.show()

# Bag of Words (BoW)

Transforms the text into a sort of numeric form.

BoW describes the occurence of words within a document or a collection of documents. 

It builds a vocabulary of the words and a measure of their presence.

The inconvenients of BoW are linked to the fact that word order and grammar rules are lost once we build our BoW

In [None]:
revs = pd.read_csv('../data/sentiment_analysis/amazon_reviews_sample.csv')
revs.head()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=1000)
cv.fit(data.review)
X=cv.transform(data.review) 
X_df = pd.DataFrame(X.toarray(), columns=cv.get_feature_names_out())

In [None]:
X_df.head()

In [None]:
import sys 

sys.getsizeof(X)


In [None]:
sys.getsizeof(X_df)

# N-grams

Negations are somehow neglected by BoW.

- Unigrams: single tokens 
- Bigrams: tuples
- Trigrams:
- N-grams


`cv = CountVectorizer(ngram_range=(min_n, max_n))`

`ngram_range(1,2)` for instance, uses unigrams and bigrams

The longer the n:
- The more precise the ML model
- More features
- Higher risk of overfitting

Look for the best n for the problem at hand. 

- the `max_features` parameter helps containing the number of features, selecting the most frequent words only.
- `max_df` ignore terms with a frequency above its value. Can be an integer, or a proportion. Same for `min_df`



In [None]:
#Import the vectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Build the vectorizer, specify max features and fit
vect = CountVectorizer(max_features=1000, ngram_range=(2, 2), max_df=500)
vect.fit(revs.review)

# Transform the review
X_review = vect.transform(revs.review)

# Create a DataFrame from the bow representation
X_df = pd.DataFrame(X_review.toarray(), columns=vect.get_feature_names_out())
print(X_df.head())

# Build features from text

In [None]:
from nltk import word_tokenize 

anna_k = 'Happy families are all alike, every unhappy family is unhappy in its own way.'

word_tokenize(anna_k)

In [None]:
word_tokens = [word_tokenize(review) for review in revs.review] 
word_tokens[0]

In [None]:
len_tokens = [] 

for i in range(len(word_tokens)): 
    len_tokens.append(len(word_tokens[i]))

revs['len_tokens'] = len_tokens
revs.head()

In [None]:
from langdetect import detect_langs

spanish = 'Tu crees que el tema este será capaz de adivinar en qué idioma está escrita esta frase?'

detect_langs(spanish)

# Stopwords

Frequent words that occur very often and that dont add much value.

Depending on the context we would like to add other words that we know will be very frequent too.

In [None]:
from wordcloud import WordCloud, STOPWORDS 
import matplotlib.pyplot as plt 

my_stopwords =set(STOPWORDS)
# When analysing movie reviews, the following words can be considered stop words too
my_stopwords.update(["movie", "movies", "film", "films", "watch", "br"])

In [None]:
my_cloud = WordCloud(background_color='white', stopwords=my_stopwords).generate(data.review[0])
plt.imshow(my_cloud, interpolation='bilinear')

## Stopwords with BoW 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, ENGLISH_STOP_WORDS

my_stopwords = ENGLISH_STOP_WORDS.union(["movie", "movies", "film", "films", "watch", "br"])

vect=CountVectorizer(stop_words=list(my_stopwords))
vect.fit(data.review)
X=vect.transform(data.review)


# Capturing tokens using patterns 

Sometimes there are tokens we would like to ignore, like mails, digits...

In [None]:
my_string = '123'
print('isalnum', my_string.isalnum())
print('isalpha', my_string.isalpha())
print('isdecimal', my_string.isdecimal())
print('isdigit', my_string.isdigit())
print('isnumeric', my_string.isnumeric())

In [None]:
import re 

my_string = '#cocotero'

x=re.search('#[A-Za-z]+', my_string)

print(x)

In [None]:
tweets = pd.read_csv('../data/sentiment_analysis/Tweets.csv')

# Build and fit the vectorizer
vect = CountVectorizer(token_pattern=r'\b[^\d\W][^\d\W]+\b').fit(tweets.text)
vect.transform(tweets.text)
print('Length of vectorizer: ', len(vect.get_feature_names_out()))

In [None]:
# Build the first vectorizer
vect1 = CountVectorizer().fit(tweets.text)
vect1.transform(tweets.text)

# Build the second vectorizer
vect2 = CountVectorizer(token_pattern=r'\b[^\d\W][^\d\W]').fit(tweets.text)
vect2.transform(tweets.text)

# Print out the length of each vectorizer
print('Length of vectorizer 1: ', len(vect1.get_feature_names_out()))
print('Length of vectorizer 2: ', len(vect2.get_feature_names_out()))

In [None]:
# Import the word tokenizing package
from nltk import word_tokenize

# Tokenize the text column
word_tokens = [word_tokenize(review) for review in tweets.text]
print('Original tokens: ', word_tokens[0])

# Filter out non-letter characters
cleaned_tokens = [[word for word in item if word.isalpha()] for item in word_tokens]
print('Cleaned tokens: ', cleaned_tokens[0])

# Stemming and Lemmatization

Stemming is the process of reducing the words of a text to their roots. Faster
Lemmatization is similar to stemming but instead of finding roots, it reduces each word to actual valid words. Slower.



In [None]:
from nltk.stem import PorterStemmer 

porter = PorterStemmer()
porter.stem('wonderful')

In [None]:
from nltk.stem.snowball import SnowballStemmer 

SpanishStemmer = SnowballStemmer('spanish')
SpanishStemmer.stem('cuadranguloso')

In [None]:
sentence = 'Stem doesnt apply to sentences'
porter.stem(sentence)

In [None]:
from nltk import word_tokenize 

tokens=word_tokenize(sentence)
stemmed_tokens = [porter.stem(token) for token in tokens]

In [None]:
stemmed_tokens

In [None]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer() 

# pos is part of speech
lemmatizer.lemmatize('wonderful', pos='a')

# TFIDF

The tfidf score is defined as term frequency * inverse document frequency.

BoW does not account for the lenght of a document. TFIDF does. 

TFIDF takes into account words that are common in all the documents

TFIDF due to its nature, doesnt need to take care of stopwords explicitly as other methods do.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer 

vect = TfidfVectorizer(max_features=100)

In [None]:
vect.fit(tweets.text)
X=vect.transform(tweets.text)
X

In [None]:
X_df = pd.DataFrame(X.toarray(), columns=vect.get_feature_names_out())
X_df.head()

In [None]:
# Import the required vectorizer package and stop words list
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS

# Define the vectorizer and specify the arguments
my_pattern = r'\b[^\d\W][^\d\W]+\b'
vect = TfidfVectorizer(ngram_range=(1, 2), max_features=100, token_pattern=my_pattern, stop_words=list(ENGLISH_STOP_WORDS)).fit(tweets.text)

# Transform the vectorizer
X_txt = vect.transform(tweets.text)

# Transform to a data frame and specify the column names
X=pd.DataFrame(X_txt.toarray(), columns=vect.get_feature_names_out())
X.head()

# Predicting Sentiment 

Classification problem with 2 classes (positive or negative) or 3 (positive, neutral and negative)


In [None]:
# Import the logistic regression
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

vect=CountVectorizer(max_features=100)

movies = pd.read_csv('../data/sentiment_analysis/IMDB_sample.csv')

X = vect.fit_transform(movies.review)

# Define the vector of targets and matrix of features
y = movies.label

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Build a logistic regression model and calculate the accuracy
log_reg = LogisticRegression().fit(X_train, y_train)

print('Accuracy of logistic regression: ', log_reg.score(X_test, y_test))