**Sentiment Analysis:**
Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.

### Playground

In [5]:
import os
import pandas as pd

Idea: Measure the sentiment score of the review

In [1]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [7]:
dataset_path = os.path.join("..", "datasets", "combined.csv")
dataset = pd.read_csv(dataset_path, delimiter=",")
dataset.head()

Unnamed: 0.1,Unnamed: 0,sentence,is_sarcastic
0,0,thirtysomething scientists unveil doomsday clo...,1.0
1,1,dem rep. totally nails why congress is falling...,0.0
2,2,eat your veggies: 9 deliciously different recipes,0.0
3,3,inclement weather prevents liar from getting t...,1.0
4,4,mother comes pretty close to using word 'strea...,1.0


In [10]:
analyzer = SentimentIntensityAnalyzer()
sentence = dataset["sentence"][0]
vs = analyzer.polarity_scores(sentence)
print(type(vs))
print("{:-<65} {}".format(sentence, str(vs)))

<class 'dict'>
thirtysomething scientists unveil doomsday clock of hair loss---- {'neg': 0.504, 'neu': 0.496, 'pos': 0.0, 'compound': -0.7269}


Idea: Count some punctuation symbols and normalize the counts with the size of the text

In [22]:
punctuations = ['.', '!', '?', ',']
punctuations_counts = list()
sentence = dataset["sentence"][1]
print(sentence)
for punctuation in punctuations:
    punctuations_counts.append(sentence.count(punctuation))
print(punctuations_counts)

dem rep. totally nails why congress is falling short on gender, racial equality
[1, 0, 0, 1]


Idea: Count the number of noun, verb, ect. in a text and normalize it with the count of words

In [40]:
import stanza
constituency_parser = stanza.Pipeline(lang='en', processors='tokenize,pos')

2023-11-29 12:00:32 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 11.5MB/s]                    
2023-11-29 12:00:33 INFO: Loading these models for language: en (English):
| Processor | Package         |
-------------------------------
| tokenize  | combined        |
| pos       | combined_charlm |

2023-11-29 12:00:33 INFO: Using device: cpu
2023-11-29 12:00:33 INFO: Loading: tokenize
2023-11-29 12:00:33 INFO: Loading: pos
2023-11-29 12:00:33 INFO: Done loading processors!


In [44]:
text = dataset["sentence"][0]
doc = constituency_parser(text)
for sentence in doc.sentences:
    print(sentence.text)
    for word in sentence.words:
        print(word.text, word.upos)

thirtysomething scientists unveil doomsday clock of hair loss
thirtysomething ADJ
scientists NOUN
unveil VERB
doomsday ADJ
clock NOUN
of ADP
hair NOUN
loss NOUN


Idea: Create vocabulary for both sarcastic and non-sarcastic sets and measure the ratio of sarcastic and non-sarcastic words in the review

In [50]:
from sklearn.feature_extraction.text import CountVectorizer

In [55]:
bagOfWords = CountVectorizer(lowercase=True, stop_words='english', ngram_range=(1, 2), max_features=1000)
sarcastic_sentences = dataset[dataset["is_sarcastic"] == 1]["sentence"]
sarcastic_bow = bagOfWords.fit_transform(sarcastic_sentences)
vocabulary_sarcastic = bagOfWords.vocabulary_
regular_sentences = dataset[dataset["is_sarcastic"] == 0]["sentence"]
regular_bow = bagOfWords.fit_transform(regular_sentences)
vocabulary_regular = bagOfWords.vocabulary_

In [58]:
print(vocabulary_sarcastic)
print(vocabulary_regular)

{'scientists': 758, 'hair': 388, 'getting': 356, 'work': 975, 'mother': 579, 'comes': 175, 'pretty': 679, 'close': 170, 'using': 921, 'word': 973, 'nearly': 593, 'failed': 296, 'government': 375, 'large': 484, 'meet': 553, 'room': 743, 'new': 597, 'area': 52, 'boy': 111, 'man': 539, 'does': 239, 'area man': 53, 'video': 925, 'game': 347, 'secret': 762, 'service': 773, 'fan': 302, 'wearing': 952, 'shirt': 777, 'day': 215, 'york': 998, 'introduces': 454, 'program': 688, 'city': 162, 'new york': 599, 'obama': 609, 'state': 829, 'speech': 818, 'law': 489, 'history': 415, 'makes': 536, 'come': 174, 'students': 842, 'report': 723, 'probably': 684, 'god': 367, 'bring': 117, 'high': 410, 'mom': 571, 'keeps': 469, 'make': 535, 'stop': 834, 'team': 867, 'having': 400, 'men': 557, 'experience': 286, 'paul': 640, 'parents': 635, 'department': 226, 'warns': 942, 'americans': 41, 'avoid': 72, 'missing': 569, 'just': 466, 'hoping': 421, 'media': 551, 'gets': 355, 'needs': 596, 'releases': 719, 'romne

Idea: measure the average similarity between the title and keywords in the review

In [9]:
import spacy
import string
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
nlp = spacy.load("en_core_web_lg")

In [4]:
product_title = "Fresh Whole Rabbit (Misc.)"
product_title2 = "If Democrats Had Any Brains, They'd Be Republicans (Hardcover)"
review = "I'll keep this short and sweet.  We ordered one of these rabbits for our children this Easter and boy what a surprise.  It is NOT a living rabbit.  Someone has killed this rabbit and skinned it, I suppose for eating.  Anyway, our children were traumatized and Easter is not the same holiday that it used to be for us.  On the upside, we don't have to fill their Easter baskets anymore as we told them the Easter bunny was killed by Amazon. P.S.  The rabbit tasted very good."
review2 = "I did buy this book. Liberals will not buy this book, so their opinion is worthless. Besides most of them cant even read clearly enough to get the joke. Anger and frustration that Conservatives are clever and witty with a dash of smarta**, and they resort to reviewing a book that we know they wouldnt spend a penny on. Besides it would take too much of their welfare check to buy it! Quit trying to sabotage this book. Just go away."
review2_b = "Coulter writes at what she does best, finding witty or witty sounding excuses to support Republicans over Democrats in any circumstance.  Frankly, about the only thing she and Democrats probably agree on is that Bush Jr. did not do a good job with the war in Iraq, Afghanistan, and 9/11.  But of course, Coulter has some kind of excuse to blame Bush's responsibility on 'any old Democrat' even though Democrats may have nothing to do with Bush Jr.'s personal responsibilities or his decision for the country. I think she makes some good points, and I think some things are just put out there and that she doesn't listen. She just wants only her opinion to be right to an extreme.  She's good at analyzing situations, but she would not be good for a government position requiring much trust to keep stability, that is for sure.  On the other hand, you probably want her to be your Republican lobbyist. A 'friend' a 'Coulter Jr.' told me about how great this book is.  He acts just like Coulter, but just doesn't publish books and goes out and speaks like she does.  Otherwise, he would probably be doing at least okay- (Coulter created and kept her niche first.)  I am not particularly Democrat or Republican, but I try to give everything a chance.  This book, while giving some fresh perspectives I would not have thought of, is quite hit or miss, too opinionated, and not always reasoning things out enough."

In [3]:
token1 = nlp("dog")
token2 = nlp("cat")
token1.similarity(token2)

0.82208162391359

In [5]:
from rake_nltk import Rake
rake_nltk_var = Rake()
rake_nltk_var.extract_keywords_from_text(review2_b)
keyword_extracted_review = rake_nltk_var.get_ranked_phrases()
rake_nltk_var.extract_keywords_from_text(product_title2)
keyword_extracted_title = rake_nltk_var.get_ranked_phrases()
print(keyword_extracted_review)
print(keyword_extracted_title)

['government position requiring much trust', 'even though democrats may', "coulter jr .' told", 'niche first .)', "bush jr .'", 'witty sounding excuses', 'democrats probably agree', 'always reasoning things', 'bush jr', 'finding witty', 'probably want', 'blame bush', 'like coulter', 'coulter writes', 'coulter created', 'support republicans', 'speaks like', 'quite hit', 'publish books', 'personal responsibilities', 'particularly democrat', 'old democrat', 'least okay', 'keep stability', 'give everything', 'fresh perspectives', 'analyzing situations', 'would probably', 'republican lobbyist', 'good points', 'good job', 'democrats', 'coulter', 'things', 'republican', 'good', 'good', 'would', 'would', 'war', 'wants', 'try', 'thought', 'think', 'think', 'thing', 'sure', 'right', 'responsibility', 'put', 'otherwise', 'opinionated', 'opinion', 'nothing', 'miss', 'makes', 'listen', 'kind', 'kept', 'iraq', 'hand', 'great', 'goes', 'giving', 'friend', 'frankly', 'extreme', 'excuse', 'enough', 'de

In [6]:
from nltk import word_tokenize          
from nltk.stem import WordNetLemmatizer 
class LemmaTokenizer:
    def __init__(self):
        self.wnl = WordNetLemmatizer()
    def __call__(self, doc):
        return [self.wnl.lemmatize(t) for t in word_tokenize(doc)]

In [7]:
def get_dict_from_keyword_extracted(keyword_extracted):
    vectorizer = CountVectorizer(lowercase=True, tokenizer=LemmaTokenizer())
    matrix = vectorizer.fit_transform(keyword_extracted)
    counts_dict = dict()
    i = 0
    for word in vectorizer.get_feature_names_out():
        counts_dict[word] = matrix.sum(axis=0).tolist()[0][i]
        i += 1
    counts_dict = dict(sorted(counts_dict.items(), key=lambda item: item[1], reverse=True))
    # Remove punctuation
    for punctuation in string.punctuation:
        if punctuation in counts_dict.keys():
            counts_dict.pop(punctuation)
    return counts_dict

In [10]:
dict_review = get_dict_from_keyword_extracted(keyword_extracted_review)
dict_title = get_dict_from_keyword_extracted(keyword_extracted_title)
print(dict_review)
print(dict_title)



{'coulter': 5, 'democrat': 5, 'good': 4, 'book': 3, 'bush': 3, 'jr': 3, 'probably': 3, 'republican': 3, 'thing': 3, 'would': 3, 'excuse': 2, 'like': 2, 'responsibility': 2, 'think': 2, 'want': 2, 'witty': 2, '11': 1, '9': 1, 'act': 1, 'afghanistan': 1, 'agree': 1, 'always': 1, 'analyzing': 1, 'best': 1, 'blame': 1, 'chance': 1, 'circumstance': 1, 'country': 1, 'course': 1, 'created': 1, 'decision': 1, 'enough': 1, 'even': 1, 'everything': 1, 'extreme': 1, 'finding': 1, 'first': 1, 'frankly': 1, 'fresh': 1, 'friend': 1, 'give': 1, 'giving': 1, 'go': 1, 'government': 1, 'great': 1, 'hand': 1, 'hit': 1, 'iraq': 1, 'job': 1, 'keep': 1, 'kept': 1, 'kind': 1, 'least': 1, 'listen': 1, 'lobbyist': 1, 'make': 1, 'may': 1, 'miss': 1, 'much': 1, 'niche': 1, 'nothing': 1, 'okay': 1, 'old': 1, 'opinion': 1, 'opinionated': 1, 'otherwise': 1, 'particularly': 1, 'personal': 1, 'perspective': 1, 'point': 1, 'position': 1, 'publish': 1, 'put': 1, 'quite': 1, 'reasoning': 1, 'requiring': 1, 'right': 1, '

In [15]:
# Similarities between keywords in title and review
similarities = list()
for word_title in dict_title.keys():
    for word_review in dict_review.keys():
        # Only calculate for word having a count of more than 1
        if dict_review[word_review] > 1:
            similarities.append(nlp(word_title).similarity(nlp(word_review)))
# Sort similarities
similarities.sort(reverse=True)
average = sum(similarities[:len(dict_title)]) / len(similarities[:len(dict_title)])
print(similarities[:len(dict_title)])
print(average)

[1.0, 1.0, 0.7707824197885758, 0.7707824197885758]
0.885391209894288


### Feature extraction functions

In [48]:
def get_sentiment_score_feature(text):
    """!
    @brief Get the sentiment score feature of a text using VADER sentiment analysis tool.
    @param text (str): Text to be analyzed.
    @return (dict): Sentiment score feature of the text.
    """
    analyzer = SentimentIntensityAnalyzer()
    return analyzer.polarity_scores(text)["compound"]

In [30]:
def get_punctuation_feature(text):
    """!
    @brief Get the punctuation feature of a text.
    @param text (str): Text to be analyzed.
    @return (list): List of punctuation counts normalized by the total count of punctuation in the text.
                    [count of '.', count of '!', count of '?', count of ',']
    """
    punctuations = ['.', '!', '?', ',']
    punctuations_counts = list()
    total_count = 0
    for punctuation in punctuations:
        count = text.count(punctuation)
        punctuations_counts.append(count)
        total_count += count
    # Normalize the counts (ratio of punctuation count to total count)
    if total_count != 0:
        punctuations_counts = [count / total_count for count in punctuations_counts]
    return punctuations_counts

In [46]:
def get_POS_feature(text, pipeline):
    """!
    @brief Get the POS feature of a text.
    @param text (str): Text to be analyzed.
    @param pipeline (stanza.Pipeline): The Stanza pipeline use for the constituency parsing.
    @return (list): List of POS tag counts normalized by the total count of POS tags in the text.
                    [Noun count, Verb count, Adjective count, Adverb count]
    """
    doc = pipeline(text)
    POS_tags = ['NOUN', 'VERB', 'ADJ', 'ADV']
    POS_counts = [0, 0, 0, 0]
    total_count = 0
    for sentence in doc.sentences:
        for word in sentence.words:
            total_count += 1
            if word.upos in POS_tags:
                POS_counts[POS_tags.index(word.upos)] += 1
    # Normalize the counts (ratio of POS tag count to total count)
    if total_count != 0:
        POS_counts = [count / total_count for count in POS_counts]

    return POS_counts

In [53]:
def get_word_unigram_bigram_feature(text, vocabulary_sarcastic, vocabulary_regular, top_range):
    """!
    @brief Get the word unigram and bigram feature of a text.
    @param text (str): Text to be analyzed.
    @param vocabulary_sarcastic (dict): Vocabulary of sarcastic words.
    @param vocabulary_regular (dict): Vocabulary of regular words.
    @param top_range (int): Number of top words to be considered for the feature.
    @return (list): List of word unigram and bigram counts normalized by the total count of words in the text.
                    [count of sarcastic words, count of regular words, count of sarcastic bigrams, count of regular bigrams]
    """
    # Only consider the top words of the vocabulary
    vocabulary_sarcastic = dict(sorted(vocabulary_sarcastic.items(), key=lambda item: item[1], reverse=True)[:top_range])
    vocabulary_regular = dict(sorted(vocabulary_regular.items(), key=lambda item: item[1], reverse=True)[:top_range])
    # Get the word unigram and bigram counts
    word_unigram_bigram_counts = [0, 0, 0, 0]
    word_unigram_bigram_counts[0] = sum([text.count(word) for word in vocabulary_sarcastic.keys()])
    word_unigram_bigram_counts[1] = sum([text.count(word) for word in vocabulary_regular.keys()])
    word_unigram_bigram_counts[2] = sum([text.count(word) for word in vocabulary_sarcastic.keys() if len(word.split()) == 2])
    word_unigram_bigram_counts[3] = sum([text.count(word) for word in vocabulary_regular.keys() if len(word.split()) == 2])
    # Normalize the counts (ratio of word unigram and bigram count to total count)
    total_count = sum(word_unigram_bigram_counts)
    if total_count != 0:
        word_unigram_bigram_counts = [count / total_count for count in word_unigram_bigram_counts]
    return word_unigram_bigram_counts


In [None]:
def get_contextual_feature(text, sentiment_score, review_stars):
    """!
    @brief Get the contextual feature of a text.
    A sarcastic text may have a sentiment score that contradicts the review stars.
    @param text (str): Text to be analyzed.
    @param sentiment_score (float): Sentiment score of the text.
    @param review_stars (float): Review stars of the text.
    @return (float): Absolute difference between the sentiment score (normalized) and review stars.
    """
    
    # Normalize sentiment_score on a 0 to 5 scale (scale of review_stars)
    # Sentiment score is in the range [-1, 1]
    sentiment_score = (sentiment_score + 1) * 2.5
    diff = abs(sentiment_score - review_stars)
    return diff


In [None]:
def get_similarity_feature(review, title, pipeline):
    """!
    @brief Get the similarity feature of a review and a title.
    @param review (str): Review to be analyzed.
    @param title (str): Title to be analyzed.
    @param pipeline (spacy.lang.en.English): The Spacy pipeline use for the similarity analysis.
    @return (float): Average similarity between the review and the title.
    """
    doc_review = pipeline(review)
    doc_title = pipeline(title)
    similarity = 0
    for token_review in doc_review:
        for token_title in doc_title:
            similarity += token_review.similarity(token_title)
    return similarity / (len(doc_review) * len(doc_title))

In [59]:
print(get_sentiment_score_feature(dataset["sentence"][0]))
print(get_punctuation_feature(dataset["sentence"][0]))
print(get_POS_feature(dataset["sentence"][0], constituency_parser))
print(get_word_unigram_bigram_feature(dataset["sentence"][0], vocabulary_sarcastic, vocabulary_regular, 100))

-0.7269
[0, 0, 0, 0]
[0.5, 0.125, 0.25, 0.0]
[0.5, 0.5, 0.0, 0.0]
