# Emotion Representation

For capturing the relationship between the emotion representation of a review text and the probability of it being fake, the following 30 emotion indicators were obtained from the lexicons mentioned below:


1. 4 polarity-based (count of words belonging to pos and neg word list from each of the two lexicons):

	1. OpinionFinder:
		https://mpqa.cs.pitt.edu/opinionfinder/opinionfinder_2/

	2. Bing Liu’s opinion lexicon:
		https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon
		https://juliasilge.github.io/tidytext/reference/sentiments.html
		

---------------------------------------------------------------------------------------------------------------------------


2. 10 strength-based  (pos and neg score from each of the five lexicons)

	1. Afinn:
		!pip install afinn
		https://www.geeksforgeeks.org/python-sentiment-analysis-using-affin/
	
	2. S140:
		http://saifmohammad.com/WebPages/lexicons.html
		http://saifmohammad.com/Lexicons/Sentiment140-Lexicon-v0.1.zip  (for download)

	3. SentiWordNet 3.0:
		https://github.com/aesuli/SentiWordNet  (3.0 is here)
		https://www.nltk.org/api/nltk.corpus.reader.sentiwordnet.html?highlight=wordnet#:~:text=SentiWordNet%20is%20a%20lexical,.isti.cnr.it%2F  (idk what version this is, most likely 3.0 only)  
	
	4. NRC Hashtag:	
		http://saifmohammad.com/WebDocs/NRC-Hashtag-Sentiment-Lexicon-v0.1.zip   (for download)
		http://saifmohammad.com/WebPages/lexicons.html    				
		
	5. Emoticon:  (the paper they have cited has nothing to do with emoticons smh, so using the nrc emoticon lexicon)
		http://saifmohammad.com/WebPages/lexicons.html 
		http://saifmohammad.com/WebDocs/NRC-Emoticon-Lexicon-v1.0.zip (for download) (where did i get this??????)
	

---------------------------------------------------------------------------------------------------------------------------


3. 16 emotion-based   (8 from NRC emotion, 8 from NRC expanded)

	1. NRC emotion lexicon (emolex):
	https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm
	http://saifmohammad.com/WebPages/AccessResource.htm
	!pip install NRCLex
	https://pypi.org/project/NRCLex/
		
	
	2. NRC expanded:
	https://ieeexplore.ieee.org/document/7817108
	https://www.cs.waikato.ac.nz/ml/sa/lex.html#


In [1]:
# Importing required modules

import pandas as pd
from nltk.tokenize import RegexpTokenizer
from unidecode import unidecode
import csv

# Preprocessing

In [2]:
# The reviews are labelled as fake (Label1) or real (Label2)
# Dataset source: https://medium.com/@lievgarcia/deception-on-amazon-c1e30d977cfd

df = pd.read_csv("datasets/amazon_reviews.txt", sep = "\t")   
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21000 entries, 0 to 20999
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   DOC_ID             21000 non-null  int64 
 1   LABEL              21000 non-null  object
 2   RATING             21000 non-null  int64 
 3   VERIFIED_PURCHASE  21000 non-null  object
 4   PRODUCT_CATEGORY   21000 non-null  object
 5   PRODUCT_ID         21000 non-null  object
 6   PRODUCT_TITLE      21000 non-null  object
 7   REVIEW_TITLE       21000 non-null  object
 8   REVIEW_TEXT        21000 non-null  object
dtypes: int64(2), object(7)
memory usage: 1.4+ MB


In [3]:
# Mapping binary output label to numeric values 0 (fake review) and 1 (real review)

df['TARGET'] = pd.factorize(df['LABEL'])[0]
df['VERIFIED_PURCHASE'] = pd.factorize(df['VERIFIED_PURCHASE'])[0]   #Y -> 1, N -> 0
df.drop(["LABEL"], inplace = True, axis = 1)

df.head(30)

Unnamed: 0,DOC_ID,RATING,VERIFIED_PURCHASE,PRODUCT_CATEGORY,PRODUCT_ID,PRODUCT_TITLE,REVIEW_TITLE,REVIEW_TEXT,TARGET
0,1,4,0,PC,B00008NG7N,"Targus PAUK10U Ultra Mini USB Keypad, Black",useful,"When least you think so, this product will sav...",0
1,2,4,1,Wireless,B00LH0Y3NM,Note 3 Battery : Stalion Strength Replacement ...,New era for batteries,Lithium batteries are something new introduced...,0
2,3,3,0,Baby,B000I5UZ1Q,"Fisher-Price Papasan Cradle Swing, Starlight",doesn't swing very well.,I purchased this swing for my baby. She is 6 m...,0
3,4,4,0,Office Products,B003822IRA,Casio MS-80B Standard Function Desktop Calculator,Great computing!,I was looking for an inexpensive desk calcolat...,0
4,5,4,0,Beauty,B00PWSAXAM,Shine Whitening - Zero Peroxide Teeth Whitenin...,Only use twice a week,I only use it twice a week and the results are...,0
5,6,3,0,Health & Personal Care,B00686HNUK,Tobacco Pipe Stand - Fold-away Portable - Ligh...,not sure,I'm not sure what this is supposed to be but I...,0
6,7,4,0,Toys,B00NUG865W,ESPN 2-Piece Table Tennis,PING PONG TABLE GREAT FOR YOUTHS AND FAMILY,Pleased with ping pong table. 11 year old and ...,0
7,8,4,1,Beauty,B00QUL8VX6,Abundant Health 25% Vitamin C Serum with Vitam...,Great vitamin C serum,Great vitamin C serum... I really like the oil...,0
8,9,4,0,Health & Personal Care,B004YHKVCM,PODS Spring Meadow HE Turbo Laundry Detergent ...,wonderful detergent.,I've used tide pods laundry detergent for many...,0
9,10,1,0,Health & Personal Care,B00H4IBD0M,"Sheer TEST, Best Testosterone Booster Suppleme...",WARNING: do not waste your money on this,Everybody wants to fall for their promises. Bu...,0


In [4]:
num_fake = len(df[df['TARGET'] == 0])
num_real = len(df[df['TARGET'] == 1])

print(num_real, num_fake)

10500 10500


As seen above, the dataset is evenly balanced across both classes.

In [5]:
tokenizer = RegexpTokenizer(r'\w+')

# converting to lowercase and tokenizing
review_tokens = [tokenizer.tokenize(review.lower()) for review in df['REVIEW_TEXT']]

#removing special characters
review_tokens = [[unidecode(token) for token in review if token.isalnum()] for review in review_tokens]
" ".join(review_tokens[0])

'when least you think so this product will save the day just keep it around just in case you need it for something'

# Emotion Representation

## Polarity: OpinionFinder 2.0

In [None]:
# OpinionFinder2.0: Tags words with polarity (pos/neg)
# Used to develop two features: OPI_FIN_POS and OPI_FIN_NEG
# Defined as the number of words that corresponding to each polarity respectively, per review

f_count = 1
count = 0
doclist = "emotion_lexicons/emotion_lexicons/opinionfinderv2.0/amazon_reviews_" + str(f_count) + ".doclist"
f2 = open(doclist, "a")

for i in range(len(review_tokens)):
    fname = "database/docs/amazon_reviews/rev_id_" + str(i + 1)
    fp = open("emotion_lexicons/emotion_lexicons/opinionfinderv2.0/" + fname, 'w')
    review_text = ' '.join(review_tokens[i])
    fp.write(review_text)
    fp.close()
    
    if count == 2100:
        f2.close()
        count = 0
        f_count += 1
        
        doclist = "emotion_lexicons/emotion_lexicons/opinionfinderv2.0/amazon_reviews_" + str(f_count) + ".doclist"
        f2 = open(doclist, "a")
        
    f2.write(fname+"\n")         
    count += 1
    
f2.close()

### TO USE OPINIONFINDER 2.0: 
#### Run these commands one after another in a terminal opened at path emotion_lexicons/emotion_lexicons/opinionfinderv2.0

java -Xmx1g -classpath lib\weka.jar;lib\stanford-postagger.jar;opinionfinder.jar opin.main.RunOpinionFinder amazon_reviews_1.doclist -d

java -Xmx1g -classpath lib\weka.jar;lib\stanford-postagger.jar;opinionfinder.jar opin.main.RunOpinionFinder amazon_reviews_2.doclist -d

java -Xmx1g -classpath lib\weka.jar;lib\stanford-postagger.jar;opinionfinder.jar opin.main.RunOpinionFinder amazon_reviews_3.doclist -d

java -Xmx1g -classpath lib\weka.jar;lib\stanford-postagger.jar;opinionfinder.jar opin.main.RunOpinionFinder amazon_reviews_4.doclist -d

java -Xmx1g -classpath lib\weka.jar;lib\stanford-postagger.jar;opinionfinder.jar opin.main.RunOpinionFinder amazon_reviews_5.doclist -d

java -Xmx1g -classpath lib\weka.jar;lib\stanford-postagger.jar;opinionfinder.jar opin.main.RunOpinionFinder amazon_reviews_6.doclist -d

java -Xmx1g -classpath lib\weka.jar;lib\stanford-postagger.jar;opinionfinder.jar opin.main.RunOpinionFinder amazon_reviews_7.doclist -d

java -Xmx1g -classpath lib\weka.jar;lib\stanford-postagger.jar;opinionfinder.jar opin.main.RunOpinionFinder amazon_reviews_8.doclist -d

java -Xmx1g -classpath lib\weka.jar;lib\stanford-postagger.jar;opinionfinder.jar opin.main.RunOpinionFinder amazon_reviews_9.doclist -d

java -Xmx1g -classpath lib\weka.jar;lib\stanford-postagger.jar;opinionfinder.jar opin.main.RunOpinionFinder amazon_reviews_10.doclist -d

In [9]:
# extracting polarity labels from output file (exp_polarity.txt) generated by OpinionFinder2.0 and adding to dataset

opinion_finder_pos_count = []
opinion_finder_neg_count = []

parent_dir = "emotion_lexicons/emotion_lexicons/opinionfinderv2.0/database/docs/amazon_reviews/rev_id_"
suffix = "_auto_anns/exp_polarity.txt"

for i in range(len(review_tokens)):
    fpath = parent_dir + str(i + 1) + suffix
    f = open(fpath, "r")
    content = f.read()
    f.close()
    
    opinion_finder_pos_count.append(content.count("positive"))
    opinion_finder_neg_count.append(content.count("negative"))
    

df['OPI_FIN_POS'] = opinion_finder_pos_count
df['OPI_FIN_NEG'] = opinion_finder_neg_count

## Polarity: Bing Liu's Lexicon

In [10]:
# Bing Liu et al: Opinion Lexicon for positive and negative polarity tagging of words
# Used to develop 2 features: BL_POS and BL_NEG
# defined as the number of words that corresponding to each polarity respectively, per review

dir_name = "emotion_lexicons/emotion_lexicons/bing-liu-opinion-lexicon-English/"
pos_file = dir_name + "positive-words.txt"
neg_file = dir_name + "negative-words.txt"

f1 = open(pos_file, "r")
f2 = open(neg_file, "r")

pos_lexicon = f1.read()
neg_lexicon = f2.read()

f1.close()
f2.close()

bl_pos = []
bl_neg = []

for review in review_tokens:
    count_pos = 0
    count_neg = 0
    
    for token in review:
        if token in pos_lexicon:
            count_pos += 1
        if token in neg_lexicon:
            count_neg += 1
            
    bl_pos.append(count_pos)
    bl_neg.append(count_neg)
    
print(bl_pos[:15])
print(bl_neg[:15])

df['BL_POS'] = bl_pos
df['BL_NEG'] = bl_neg

## Strength Score: AFINN

In [11]:
# AFINN: Sentiment lexicon for measuring the positive and negative score of a review
# Used to develop 2 features: BL_POS and BL_NEG

!pip install afinn

from afinn import Afinn
afn = Afinn()

afinn_pos = []
afinn_neg = []

for review in review_tokens:
    review = " ".join(review)
    s = afn.score(review)
    
    if s > 0:
        afinn_pos.append(s)
        afinn_neg.append(0.0)
        
    else:
        afinn_pos.append(0.0)
        afinn_neg.append(-1 * s)
        
df['AFINN_POS'] = afinn_pos
df['AFINN_NEG'] = afinn_neg

## Strength Score: Sentiment140

In [13]:
# Sentiment140: Lexicon for measuring the positive and negative score of a unigram/bigram
# Used to develop 2 features: S140_POS and S140_NEG


dir_name = "emotion_lexicons/emotion_lexicons/Sentiment140-Lexicon/"
f1 = "unigrams-pmilexicon.txt"
f2 = "bigrams-pmilexicon.txt"


uni_lex = pd.read_csv(dir_name + f1, sep = "\t")
bi_lex = pd.read_csv(dir_name + f2, sep = "\t", quoting=csv.QUOTE_NONE, encoding='utf-8')

uni_dict = dict(zip(uni_lex["term"], uni_lex["score"]))
bi_dict = dict(zip(bi_lex["term"], bi_lex["score"]))

s140_pos = []
s140_neg = []

for review in review_tokens:
    score = 0
    uni_score = 0
    uni_c = 0
        
    for unigram in review:
        if unigram in uni_dict:
            uni_score += uni_dict[unigram]
            uni_c += 1
    
    if uni_c > 0:
        uni_score /= uni_c
    
    
    bi_score = 0
    bi_c = 0
    
    bigrams = list(ngrams(review, 2))
    for bigram in bigrams:
        text = " ".join(bigram)
        if text in bi_dict:
            bi_score += bi_dict[text]
            bi_c += 1
    
    if bi_c > 0:
        bi_score /= bi_c
    
    
    score = (bi_score + uni_score) / (int(uni_c > 0) + int(bi_c > 0))
    
    if score > 0:
        s140_pos.append(round(score, 5))
        s140_neg.append(0.0)
        
    else:
        s140_pos.append(0.0)
        s140_neg.append(round(-1 * score, 5))
    
df['S140_POS'] = s140_pos
df['S140_NEG'] = s140_neg

## Strength Score: SentiWordNet3.0

In [15]:
# SentiWordNet3.0: Lexicon for measuring the positive and negative score of synsets in WordNet 3.0
# Used to develop 2 features: SWN_POS and SWN_NEG


# import nltk
# nltk.download('sentiwordnet')

from nltk.corpus import sentiwordnet as swn

swn_pos = []
swn_neg = []

for review in review_tokens:
    pos_score = 0
    neg_score = 0
    count = 0
    
    for term in review:
        res = swn.senti_synsets(term)
        
        try:
            res0 = list(res)[0]
            pos_score += res0.pos_score()
            neg_score += res0.neg_score()
            count += 1
            
        except:
            pass
    
    if count > 0:
        pos_score = pos_score / count
        neg_score = neg_score / count
        
    swn_pos.append(pos_score)
    swn_neg.append(neg_score)

df['SWN_POS'] = swn_pos
df['SWN_NEG'] = swn_neg

## Strength Score: NRC Hashtag

In [17]:
# NRC Hashtag: Lexicon for measuring the positive and negative score of a unigram/bigram based on twitter hashtags
# Used to develop 2 features: NRC_HASH_POS and NRC_HASH_NEG


dir_name = "emotion_lexicons/emotion_lexicons/NRC-Hashtag-Sentiment-Lexicon-v0.1/"
f1 = "unigrams-pmilexicon.txt"
f2 = "bigrams-pmilexicon.txt"

uni_lex = pd.read_csv(dir_name + f1, sep = "\t")
bi_lex = pd.read_csv(dir_name + f2, sep = "\t", quoting=csv.QUOTE_NONE, encoding='utf-8')

uni_dict = dict(zip(uni_lex["term"], uni_lex["score"]))
bi_dict = dict(zip(bi_lex["term"], bi_lex["score"]))

nrc_hash_pos = []
nrc_hash_neg = []

for review in review_tokens:
    score = 0
    uni_score = 0
    uni_c = 0
        
    for unigram in review:
        if unigram in uni_dict:
            uni_score += uni_dict[unigram]
            uni_c += 1
    
    if uni_c > 0:
        uni_score /= uni_c
    
    
    bi_score = 0
    bi_c = 0
    
    bigrams = list(ngrams(review, 2))
    for bigram in bigrams:
        text = " ".join(bigram)
        if text in bi_dict:
            bi_score += bi_dict[text]
            bi_c += 1
    
    if bi_c > 0:
        bi_score /= bi_c
    
    
    score = (bi_score + uni_score) / (int(uni_c > 0) + int(bi_c > 0))
    
    if score > 0:
        nrc_hash_pos.append(round(score, 5))
        nrc_hash_neg.append(0.0)
        
    else:
        nrc_hash_pos.append(0.0)
        nrc_hash_neg.append(round(-1 * score, 5))
    
    
df['NRC_HASH_POS'] = nrc_hash_pos
df['NRC_HASH_NEG'] = nrc_hash_neg

## Strength Score: Emoticon - Based Lexicon

In [19]:
# Emoticon Based Lexicon: Lexicon for measuring the positive and negative score of a word based on co-oocurence with emoticons
# Used to develop 2 features: NRC_HASH_POS and NRC_HASH_NEG

dir_name = "emotion_lexicons/emotion_lexicons/references_and_lexicons/ijcai-kbs_emoticon/"
f1 = "STS_OR.csv"

lex = pd.read_csv(dir_name + f1, sep = "\t")

lex['word'] = lex['word'].str.split('-').str[1]

my_dict = dict([(i,(a,b)) for i, a, b in zip(lex['word'], lex['positive'], lex['negative'])])

emoticon_pos = []
emoticon_neg = []

for review in review_tokens:
    pos_score = 0
    neg_score = 0
    count = 0
        
    for word in review:
        if word in my_dict:
            pos_score += my_dict[word][0]
            neg_score += my_dict[word][1]
            count += 1    
      
    if count > 0:
        pos_score = round(pos_score / count, 5)
        neg_score = round(neg_score / count, 5)
        
    emoticon_pos.append(pos_score)
    emoticon_neg.append(neg_score)
    

df['EMOTICON_POS'] = emoticon_pos
df['EMOTICON_NEG'] = emoticon_neg

## NRCLex: Emotion - Based Lexicon

In [36]:
# The NRC Emotion Lexicon is a list of English words and their associations with eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive).
# Used to develop 8 features: "NRC_ANGER", "NRC_ANTICIPATION", "NRC_DISGUST", "NRC_FEAR", "NRC_JOY", "NRC_SADNESS", "NRC_SURPRISE", "NRC_TRUST"

# !pip install NRCLex
from nrclex import NRCLex

my_dict = {"NRC_ANGER": [], 
           "NRC_ANTICIPATION": [],
           "NRC_DISGUST": [], 
           "NRC_FEAR": [], 
           "NRC_JOY": [], 
           "NRC_SADNESS": [], 
           "NRC_SURPRISE": [],
           "NRC_TRUST": []}



for review in review_tokens:
    text = " ".join(review)
    res = NRCLex(text)
    emotion_scores = res.raw_emotion_scores
        
    for key in my_dict:
        emotion =  key[4:].lower()
        if emotion in emotion_scores:
            my_dict[key].append(emotion_scores[emotion])
        else:
            my_dict[key].append(0)
            
df2 = pd.DataFrame.from_dict(my_dict)
df = pd.concat([df, df2], axis=1)


## NRCLex Expanded: Emotion - Based Lexicon

In [None]:
# Extension of NRCLex made using twitter data
# Used to develop 8 features: "NRC_EXP_ANGER", "NRC_EXP_ANTICIPATION", "NRC_EXP_DISGUST", "NRC_EXP_FEAR", "NRC_EXP_JOY", "NRC_EXP_SADNESS", "NRC_EXP_SURPRISE", "NRC_EXP_TRUST"

my_dict = {"NRC_EXP_ANGER": [], 
           "NRC_EXP_ANTICIPATION": [],
           "NRC_EXP_DISGUST": [], 
           "NRC_EXP_FEAR": [], 
           "NRC_EXP_JOY": [], 
           "NRC_EXP_SADNESS": [], 
           "NRC_EXP_SURPRISE": [],
           "NRC_EXP_TRUST": []}


dir_name = "emotion_lexicons/emotion_lexicons/emo_lex_expanded/"
f1 = "w2v-dp-CC-Lex.csv"

lex = pd.read_csv(dir_name + f1, sep = "\t")

lex = dict([(i,(a,b,c,d,e,f,g,h)) for i, a, b, c, d, e, f, g, h in zip(lex['word'], lex['anger'], lex['anticipation'], lex['disgust'], lex['fear'], lex['joy'], lex['sadness'], lex['surprise'], lex['trust'])])


for review in review_tokens:
    
    for key in my_dict:
        my_dict[key].append(0)
    
    for word in review:
        
        if word in lex:
            i = 0
            for key in my_dict:
                my_dict[key][-1] += lex[word][i]
                i += 1
        
df2 = pd.DataFrame.from_dict(my_dict)
df = pd.concat([df, df2], axis=1)


In [47]:
df.to_csv("datasets/amazon_reviews_with_emotion_features.txt", sep = "\t", index = False)