Copyright (c) <2022>, <Regina Nockerts>
All rights reserved.

This source code is licensed under the BSD-style license found in the
LICENSE file in the root directory of this source tree. 

Sources
- Sentiment of Emojis, Nova et. al.

In [8]:
import pandas as pd
import numpy as np
import os.path
from nlpUtils import aardvark as aa 
import emoji  # https://pypi.org/project/emoji/
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


In [23]:
import importlib
importlib.reload(aa)

<module 'nlpUtils.aardvark' from 'c:\\Users\\rnocker\\Desktop\\python\\thesisAgain\\nlpUtils\\aardvark.py'>

# _____________ FUNCTIONS ____________
### NOTE: change the sentiment intensity dictionary in aardvark.py

In [5]:
# FROM aardvark
# create the sentiment intensity dictionary object: sid = SentimentIntensityAnalyzer()  
    # NOTE: this NEEDS to stay outside of the functions. I will be modifying it.
# creates the sentiment intensity dictionary: aa.vader_sid(tweet)
# gets the compound score: aa.vader_sent_compound(tweet)
# gets the classification of the compund score using the authors' suggested cutoff points: aa.vader_pred(tweet, pos_cut, neg_cut)


# Setup
Assumes that you are coming from dataCleaningB

In [9]:
# Import the file that results from dataCleaningB
tweets_clean = pd.read_csv(os.path.join('archiveData', "cleanB_tweets_clean.csv"), header=0, index_col=0)
tweets_unlabeled = pd.read_csv(os.path.join('archiveData', "cleanB_tweets_unlabeled.csv"), header=0, index_col=0)

print(tweets_clean.shape)
print(tweets_unlabeled.shape)
print()
print(list(tweets_clean.columns))

(1211, 10)
(200084, 5)

['id_stable', 'Date', 'Content', 'ContentClean', 'Labels', 'label_sent', 'y_sent', 'label_stance', 'y_stance', 'Flag']


# How does emoji package handle this?

In [10]:
print(emoji.emojize('Python is :thumbs_up:'))
print(emoji.demojize('Python is 👍'))
print(emoji.demojize('Python is👍'))
print(emoji.is_emoji("👍"))

Python is 👍
Python is :thumbs_up:
Python is:thumbs_up:
True


# Emojis

First, what emojis do we have?

In [11]:
# Find all emoji in both the labeled and unlabeled sets
a = aa.emoji_df(tweets_clean)
b = aa.emoji_df(tweets_unlabeled)
emoji_list = a+b
emoji_list = list(dict.fromkeys(emoji_list))  # gets rid of duplicates

# Translate them into text using the emoji package
demoji_list = []
for i in emoji_list: 
    demoji_list.append(emoji.demojize(i))

# Turn the two lists into a dataframe
emoji_df_full = pd.DataFrame(zip(emoji_list, demoji_list), columns=["emoji", "demoji"])
emoji_df_full

Then, what score does VADER give the emoji versus the demoji?

In [5]:
print(aa.vader_sent_compound("😀"))
print(aa.vader_sent_compound("grinning face"))
print(aa.vader_sent_compound("grinning"))
print(aa.vader_sent_compound("face"))

0.3612
0.3612
0.3612
0.0


In [6]:
emoji_df_full["VaderEmojiScore"] = emoji_df_full["emoji"].apply(aa.vader_sent_compound)
emoji_df_full["VaderDEmojiScore"] = emoji_df_full["demoji"].apply(aa.vader_sent_compound)
emoji_df_full

Unnamed: 0,emoji,demoji,VaderEmojiScore,VaderDEmojiScore
0,🚨,:police_car_light:,0.0000,0.0
1,🙏,:folded_hands:,0.0000,0.0
2,🤷,:person_shrugging:,0.0000,0.0
3,🙄,:face_with_rolling_eyes:,0.0000,0.0
4,😂,:face_with_tears_of_joy:,0.4404,0.0
...,...,...,...,...
1101,🦾,:mechanical_arm:,0.0000,0.0
1102,🏃🏾‍♂️,:man_running_medium-dark_skin_tone:,0.0000,0.0
1103,🚑,:ambulance:,0.0000,0.0
1104,🎃,:jack-o-lantern:,0.0000,0.0


In [7]:
for i, score in enumerate(emoji_df_full["VaderDEmojiScore"]):
    if score != 0:
        print(emoji_df_full["emoji"].iloc[i], emoji_df_full["demoji"].iloc[i], emoji_df_full["VaderDEmojiScore"].iloc[i])

🔥 :fire: -0.34
💥 :collision: -0.3612
✨ :sparkles: 0.3182
💤 :zzz: -0.296
🃏 :joker: 0.128
💫 :dizzy: -0.2263
💣 :bomb: -0.4939
👻 :ghost: -0.3182
❇️ :sparkle: 0.4215


So, that only worked for a very small number of emoji. The translation between the two must not be the same. Sigh.

VADER does a basic swap of emoji for text using a dictionary. Lets See how VADER translates them. 

In [8]:
e_df = pd.read_csv("data/VaderEmojiTranslate.txt", encoding="utf-8", header=0, sep="	")
e_df.head()

Unnamed: 0,emoji,translation
0,😀,grinning face
1,😁,beaming face with smiling eyes
2,😂,face with tears of joy
3,🤣,rolling on the floor laughing
4,😃,grinning face with big eyes


In [9]:
print(aa.vader_sent_compound("😀"))
print(aa.vader_sent_compound("grinning face"))
print(aa.vader_sent_compound("grinning"))
print(aa.vader_sent_compound("face"))
print()
print(aa.vader_sent_compound("😂"))
print(aa.vader_sent_compound("face with tears of joy"))
print(aa.vader_sent_compound("face"))
print(aa.vader_sent_compound("with"))
print(aa.vader_sent_compound("tears"))
print(aa.vader_sent_compound("of"))
print(aa.vader_sent_compound("joy"))
print(aa.vader_sent_compound("face") + aa.vader_sent_compound("with") + aa.vader_sent_compound("tears") + aa.vader_sent_compound("of") + aa.vader_sent_compound("joy"))

0.3612
0.3612
0.3612
0.0

0.4404
0.4404
0.0
0.0
-0.2263
0.0
0.5859
0.3596


Unfortunaely, VADER is not recognizing the emoji translations as special set phrases. Instead it is doing a straightforward swap of emoji for text and then treating the the text the same as any other text. So you can't update the emoji-phrase as a whole in the lexicon; you have to update words within it. And if you do that, those words' sentiment score will change for ALL text, not just the emoji swap-text. So. That's not ideal.

So, we can either just adopt this approach. Which seems ok - it should hurt VADER more than BERT (bert can probably learn to recognize them as set phrases). 

Or we can build a set of special set terms (ex: 😂 --> emoji-tears-joy).

First we need to see if our emojis really aren't in the VADER dictionary. Which seems weird.

In [10]:
for i, emoji in enumerate(e_df["emoji"]):
    if emoji == "🚨":  # demoji: :police_car_light:
        print(e_df["emoji"].iloc[i], ":", e_df["translation"].iloc[i])
    if emoji == "🙄":  # demoji: :face_with_rolling_eyes:
        print(e_df["emoji"].iloc[i], ":", e_df["translation"].iloc[i])

🙄 : face with rolling eyes
🚨 : police car light


Ok, so they are in there. It's just that the words as words don't have a lot of emotional valence.

And the text is just the same as the demoji version, only without the underscores... Which is would make option two (building set terms) easier. 

In [11]:
print(aa.vader_sent_compound("🙄"))
print(aa.vader_sent_compound("face with rolling eyes"))
print(aa.vader_sent_compound("face"))
print(aa.vader_sent_compound("with"))
print(aa.vader_sent_compound("rolling"))
print(aa.vader_sent_compound("eyes"))


0.0
0.0
0.0
0.0
0.0
0.0


Just swapping out the emoji for text is clearly not working because the emoji clearly has sentiment that is being lost when you just take the words as individual words.

So, let's build a dataframe that we can use to swap the emoji for text codes that we can then assign sentiment to in the VADER lexicon and let BERT learn.

We already said that the VADER approach isn't ideal. So let's look at a dictionary that was specifically built to give sentiment scores to emoji: emosent.

In [12]:
emoji_df_full.drop("VaderDEmojiScore", axis=1, inplace=True)

In [13]:
emoji_df_full["emosentScore"] = emoji_df_full["emoji"].apply(aa.emosent_score)
emoji_df_full.head()

Unnamed: 0,emoji,demoji,VaderEmojiScore,emosentScore
0,🚨,:police_car_light:,0.0,0.673
1,🙏,:folded_hands:,0.0,0.418
2,🤷,:person_shrugging:,0.0,
3,🙄,:face_with_rolling_eyes:,0.0,
4,😂,:face_with_tears_of_joy:,0.4404,0.221


In [14]:
emoji_df_full["emosentScore"].value_counts()

          638
0.0        21
1.0        18
0.333      16
0.5         9
         ... 
0.063       1
0.179       1
0.581       1
-0.314      1
0.617       1
Name: emosentScore, Length: 283, dtype: int64

So emosent misses 638 emoji and scores another 21 as 0. That's at least a lot better.

In [15]:
aa.term_check("❤️", tweets_clean)

('❤️', 3)

In [16]:
aa.emosent_score("❤️")

''

It's still missing some pretty basic ones, though. Fixing this is a job for later, I think. 

For now, I think we can safely substitute the emoji for the demoji in the text. Latter we can refine this dictionary and use it to update the VADER ditionary.

In [17]:
emoji_df_full.to_csv(os.path.join('data', "emoji_full.csv"))

In [21]:
# Demojize the dataframes
tweets_clean['ContentClean'] = tweets_clean['ContentClean'].apply(emoji.demojize)
tweets_unlabeled['ContentClean'] = tweets_unlabeled['ContentClean'].apply(emoji.demojize)


In [22]:
# Check if there are any emoji left
print(aa.term_check("❤️", tweets_clean))
emoji_tweets_temp = []
for i, tweet in enumerate(tweets_clean["ContentClean"]):
    for e in emoji_list:
        if e in tweet:
            emoji_tweets_temp.append(i)
print(len(emoji_tweets_temp))

print(aa.term_check("❤️", tweets_unlabeled))
emoji_tweets_temp = []
for i, tweet in enumerate(tweets_unlabeled["ContentClean"]):
    for e in emoji_list:
        if e in tweet:
            emoji_tweets_temp.append(i)
print(len(emoji_tweets_temp))

('❤️', 0)
0
('❤️', 0)
0


So, the different models will need to handle these text codes differently.

For VADER, I will have to create a dictionary of these codes as "words" that can be added to the lexicon.
* keep the scores from the emosent library as the prioirity
* Use the VADER score as a backup
* Manually check the results to make sure they are reasonable and identify ones to customize.

For BERT, make sure that the codes do not get coded as [UKN]. After that, I think the model can take over.

## _____________ ##
# Save to go back to dataCleaningB

In [24]:
emoji_df_full.to_csv(os.path.join('data', "emoji_full.csv"))
tweets_clean.to_csv(os.path.join('archiveData', "demoji_tweets_clean.csv"))
tweets_unlabeled.to_csv(os.path.join('archiveData', "demoji_tweets_unlabeled.csv"))