Copyright (c) <2022>, <Regina Nockerts>
All rights reserved.

This source code is licensed under the BSD-style license found in the
LICENSE file in the root directory of this source tree. 

__NOTE__ to the user: In first use, this notebook cannot be run top to bottom. It assumes that you have a bunch of csv files that are created at different points in the notebook.

In [3]:
import pandas as pd
import numpy as np
import os.path
from nlpUtils import aardvark as aa 
from sklearn.metrics import f1_score # auc if I get embeddings
import emoji  # https://pypi.org/project/emoji/

In [None]:
import importlib
importlib.reload(aa)

# Setup
Assumes that you have completed dataCleaningB and dataSplitBalance

In [19]:
# Import the files
x_test = pd.read_csv("dataBalancedSets/x_test.csv", header=0, index_col=0)
y_test = pd.read_csv("dataBalancedSets/y_test_sent.csv", header=0, index_col=0)
tweets_clean  = pd.read_csv("archiveData/cleanB_tweets_clean.csv", header=0, index_col=0) 
emoji_df_full = pd.read_csv("data/emoji_full.csv", header=0, index_col=0) 
print("TEST DATA")
print("x-TEST:", x_test.shape, "y-TEST:", y_test.shape)
emoji_df_full.head()


TEST DATA
x-TEST: (182, 3) y-TEST: (182, 5)


Unnamed: 0,emoji,demoji,VaderEmojiScore,emosentScore
0,🚨,:police_car_light:,0.0,0.673
1,🙏,:folded_hands:,0.0,0.418
2,🤷,:person_shrugging:,0.0,
3,🙄,:face_with_rolling_eyes:,0.0,
4,😂,:face_with_tears_of_joy:,0.4404,0.221


In [20]:
tweets_clean.shape

(1211, 10)

In [21]:
print(list(x_test.columns))
print(list(y_test.columns))

print(list(tweets_clean.columns))

['id_stable', 'Date', 'ContentClean']
['id_stable', 'label_sent', 'y_sent', 'label_stance', 'y_stance']
['id_stable', 'Date', 'Content', 'ContentClean', 'Labels', 'label_sent', 'y_sent', 'label_stance', 'y_stance', 'Flag']


In [22]:
#drop_cols = ['Date', 'Labels', 'label_sent', 'label_stance', 'y_stance', 'n_CapLetters', 'CapsRatio', 'AllCapWords', 'https', 'Mentions', 'Location', 'ReplyCount', 'RetweetCount', 'LikeCount', 'QuoteCount', 'Hashtags', 'Flag']
drop_cols = ['Date', 'Labels', 'label_sent', 'label_stance', 'y_stance', 'Flag']
tweets_clean.drop(drop_cols, inplace=True, axis=1 )
tweets_clean.head()

Unnamed: 0,id_stable,Content,ContentClean,y_sent
0,170314,Per a White House official: Biden and Harris m...,Per a White House official: Biden and Harris m...,1
1,192623,Afghan Refugee kid educated in Iran wins this ...,Afghan Refugee kid educated in Iran wins this ...,2
2,106982,@pfrpeppermint @CawthornforNC Not only did Tru...,Not only did Trump stop processing asylum & re...,0
3,31609,An Afghan refugee demands the US not forget he...,An Afghan refugee demands the US not forget he...,0
4,152666,@RepHerrell One moment you hate refugees and t...,One moment you hate refugees and the next you ...,2


_____________ FUNCTIONS ____________

In [None]:
# create the sentiment intensity dictionary object
# sid = SentimentIntensityAnalyzer()  #NOTE: this NEEDS to stay outside of the functions. I will be modifying it.

# FROM aardvark
# creates the sentiment intensity dictionary: aa.vader_sid(tweet)
# gets the compound score: aa.vader_sent_compound(tweet)
# gets the classification of the compund score using the authors' suggested cutoff points: aa.vader_pred(tweet, pos_cut, neg_cut)


# Emoji Strategies
There are a few ways we could deal with this.
1. Take the emojized version and transform it so 👍 --> Thumbs up!  (or just "!", it's the same score)
2. Translate to keyboard emoji, so 👍 --> :)
3. Add the emoji to the dictionary and give them our own score.
4. Add the emoji to the dictionary, but give them emosent scores (https://pypi.org/project/emosent-py/)

Trying a fw out on just the labeled dataset.

### First
Find scores for the emoji data

In [28]:
tweets_clean["VADERsid"] = tweets_clean["ContentClean"].apply(aa.vader_sid)
tweets_clean["VADERcompound"] = tweets_clean["ContentClean"].apply(aa.vader_sent_compound)
tweets_clean["VADERpred"] = tweets_clean["ContentClean"].apply(aa.vader_pred)

# Get the prediction and the grounttruth as lists
demoji_pred = list(tweets_clean["VADERpred"])
true = list(tweets_clean["y_sent"])

# Find the microaverage of the F1 scores
base_microF1 = f1_score(y_true=true, y_pred=demoji_pred, average='micro', zero_division='warn')
base_macroF1 = f1_score(y_true=true, y_pred=demoji_pred, average='macro', zero_division='warn')

print("Micro and Macro-Average")
print('\tVADER F-score, micro average: {:04.3f}'.format(base_microF1))
print('\tVADER F-score, macro average: {:04.3f}'.format(base_macroF1))

Micro and Macro-Average
	VADER F-score, micro average: 0.542
	VADER F-score, macro average: 0.496


Check if there is a difference/improvement with exclamation point version.

In [24]:
text = "I strongly support 👍your relocation from Afgh.👍"
print(text)
print(aa.vader_sent_compound(text))

text = aa.emojiToExcl(text)
print()
print(text)
print(aa.vader_sent_compound(text))

I strongly support 👍your relocation from Afgh.👍
0.5859

I strongly support !your relocation from Afgh.!
0.658


In [25]:
tweets_clean['ContentCleanEx'] = tweets_clean['ContentClean'].apply(aa.emojiToExcl)

In [26]:
print(aa.term_check("❤️", tweets_clean, text_col="ContentClean"))
print(aa.term_check("❤️", tweets_clean, text_col="ContentCleanEx"))
print(aa.term_check("!", tweets_clean, text_col="ContentClean"))
print(aa.term_check("!", tweets_clean, text_col="ContentCleanEx"))

('❤️', 3)
('❤️', 0)
('!', 4)
('!', 38)


In [27]:
tweets_clean["VADERsidEx"] = tweets_clean["ContentCleanEx"].apply(aa.vader_sid)
tweets_clean["VADERcompoundEx"] = tweets_clean["ContentCleanEx"].apply(aa.vader_sent_compound)
tweets_clean["VADERpredEx"] = tweets_clean["ContentCleanEx"].apply(aa.vader_pred)

# Get the prediction and the grounttruth as lists
demoji_pred_ex = list(tweets_clean["VADERpredEx"])
true = list(tweets_clean["y_sent"])

# Find the microaverage of the F1 scores
ex_microF1 = f1_score(y_true=true, y_pred=demoji_pred_ex, average='micro', zero_division='warn')
ex_macroF1 = f1_score(y_true=true, y_pred=demoji_pred_ex, average='macro', zero_division='warn')

print("Micro and Macro-Average")
print('\tVADER-excl F-score, micro average: {:04.3f}'.format(ex_microF1))
print('\tVADER-excl F-score, macro average: {:04.3f}'.format(ex_macroF1))

Micro and Macro-Average
	VADER-excl F-score, micro average: 0.540
	VADER-excl F-score, macro average: 0.495


Well, that actually made it a tiny bit worse.

Micro and Macro-Average

VADER-base, untuned
* VADER F-score, micro average: 0.542
* VADER F-score, macro average: 0.496

Emoji to Exclamation
* VADER-excl F-score, micro average: 0.540
* VADER-excl F-score, macro average: 0.495

I'm guessing that is because the way I have done it, the exclamation point just pushes the score a bit further in the direction it was going anyway. Since the neutral category is so small to begin with, this just doesn't do much. 

But at the same time, I did this for ALL emojis, so it lost the validated score on the few emojis that VADER did know.

This is not a good approach. 

In [29]:
tweets_clean.drop(["ContentCleanEx", 'VADERsidEx', "VADERcompoundEx", "VADERpredEx"], axis=1, inplace=True)
tweets_clean.head()

Unnamed: 0,id_stable,Content,ContentClean,y_sent,VADERsid,VADERcompound,VADERpred
0,170314,Per a White House official: Biden and Harris m...,Per a White House official: Biden and Harris m...,1,"{'neg': 0.0, 'neu': 0.888, 'pos': 0.112, 'comp...",0.5859,2
1,192623,Afghan Refugee kid educated in Iran wins this ...,Afghan Refugee kid educated in Iran wins this ...,2,"{'neg': 0.0, 'neu': 0.778, 'pos': 0.222, 'comp...",0.5719,2
2,106982,@pfrpeppermint @CawthornforNC Not only did Tru...,Not only did Trump stop processing asylum & re...,0,"{'neg': 0.064, 'neu': 0.936, 'pos': 0.0, 'comp...",-0.4184,0
3,31609,An Afghan refugee demands the US not forget he...,An Afghan refugee demands the US not forget he...,0,"{'neg': 0.0, 'neu': 0.923, 'pos': 0.077, 'comp...",0.1695,2
4,152666,@RepHerrell One moment you hate refugees and t...,One moment you hate refugees and the next you ...,2,"{'neg': 0.179, 'neu': 0.757, 'pos': 0.064, 'co...",-0.6167,0


### Emosent
Will the emosent package work for me?

In [30]:
emoji_df_full["emosentScore"] = emoji_df_full["emoji"].apply(aa.emosent_score)
emoji_df_full

# CITE: Sentiment of Emojis, Nova et. al.

Unnamed: 0,emoji,demoji,VaderEmojiScore,emosentScore
0,🚨,:police_car_light:,0.0000,0.673
1,🙏,:folded_hands:,0.0000,0.418
2,🤷,:person_shrugging:,0.0000,
3,🙄,:face_with_rolling_eyes:,0.0000,
4,😂,:face_with_tears_of_joy:,0.4404,0.221
...,...,...,...,...
1101,🦾,:mechanical_arm:,0.0000,
1102,🏃🏾‍♂️,:man_running_medium-dark_skin_tone:,0.0000,
1103,🚑,:ambulance:,0.0000,0.091
1104,🎃,:jack-o-lantern:,0.0000,0.617


In [31]:
print(emoji_df_full["emosentScore"].value_counts())

          638
0.0        21
1.0        18
0.333      16
0.5         9
         ... 
0.063       1
0.179       1
0.581       1
-0.314      1
0.617       1
Name: emosentScore, Length: 283, dtype: int64


Kinda. It has about half (missing 638) . But it seems to miss some of the important ones that I need. 
* 🤷, 🤮, etc.

And for the symbols where they overlap, the VADER and emosent scores do necessarilly agree and are sometimes very far off:
* 💔 (broken_heart): 0.2732 v. -0.122
* 😭 (loudly_crying_face): -0.4767 v. -0.093

And some of the values are just off for __this__ dataset. For example, the stack of dollars (💵) has a emosent score of 0.423 - very high. Which makes sense normally: money is good. But in this dataset, it shows up when people are stressing the overly high cost of refugee or ilitary operations, or are talking about corruption. 

As this tool has been validated, I'll consider the values they have. But I'll still have to assign my own values to the remaining half. So: first VADER; if not, then emosent; if not, then my ranking; and my own ranking for emojis that are used differently than normal in my dataset.

NOTE: I will have to add the emosent and my emojis to the dictionary. 
* For more insight on ranking: http://kt.ijs.si/data/Emoji_sentiment_ranking/

NOTE: There is a LOT more that could be done with emojis in terms of: 
* setting sentiment scores for all emoji that appear in the dataet, not just in my labeled subset.
* identifying news articles and other irrelevant rows in the data.
