Copyright (c) <2022>, <Regina Nockerts>
All rights reserved.

This source code is licensed under the BSD-style license found in the
LICENSE file in the root directory of this source tree. 

__NOTE__ to the user: In first use, this notebook cannot be run top to bottom. It assumes that you have a bunch of csv files that are created at different points in the notebook.

In [1]:
import pandas as pd
import numpy as np
import os.path
from nlpUtils import aardvark as aa 

from sklearn.metrics import f1_score # auc if I get embeddings


#import emoji  # https://pypi.org/project/emoji/

import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')
import seaborn as sns
# sns.set(font_scale=1.5)

from wordcloud import WordCloud, STOPWORDS
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import nltk
from nltk.corpus import stopwords as nltkStop
from nltk import ngrams
#from nltk.tokenize import sent_tokenize  # Creates a list of sentences
#from nltk.tokenize import TweetTokenizer  # Prefered: tokenizes a text, with extra controls


In [None]:
import importlib
importlib.reload(aa)

# Setup
Assumes that you have completed dataCleaningB and dataSplitBalance

In [30]:
# Import the files that result from dataSplitBalance

unbal_x_train = pd.read_csv("dataBalancedSets/unbal_x_train.csv", header=0, index_col=0)
unbal_x_val = pd.read_csv("dataBalancedSets/unbal_x_val.csv", header=0, index_col=0)
unbal_y_train = pd.read_csv("dataBalancedSets/unbal_y_train.csv", header=0, index_col=0)
unbal_y_val = pd.read_csv("dataBalancedSets/unbal_y_val.csv", header=0, index_col=0)

# under_x_train = pd.read_csv("dataBalancedSets/under_x_train.csv", header=0, index_col=0)
# under_x_val = pd.read_csv("dataBalancedSets/under_x_val.csv", header=0, index_col=0)
# under_y_train = pd.read_csv("dataBalancedSets/under_y_train.csv", header=0, index_col=0)
# under_y_val = pd.read_csv("dataBalancedSets/under_y_val.csv", header=0, index_col=0)

# underOver_y_train = pd.read_csv("dataBalancedSets/underOver_y_train.csv", header=0, index_col=0)
# underOver_x_train = pd.read_csv("dataBalancedSets/underOver_x_train.csv", header=0, index_col=0)
# underOver_y_val = pd.read_csv("dataBalancedSets/underOver_y_val.csv", header=0, index_col=0)
# underOver_x_val = pd.read_csv("dataBalancedSets/underOver_x_val.csv", header=0, index_col=0)

# And the test dataset
x_test = pd.read_csv("dataBalancedSets/x_test.csv", header=0, index_col=0)
y_test = pd.read_csv("dataBalancedSets/y_test_sent.csv", header=0, index_col=0)

# And some odds and ends
tweets_clean  = pd.read_csv("archiveData/cleanB_tweets_clean.csv", header=0, index_col=0) 
emoji_df_full = pd.read_csv("data/emoji_full.csv", header=0, index_col=0)
all_unlabeled_tweets = pd.read_csv("data/all_unlabeled_tweets.csv", header=0, index_col=0)

print("Unbalanced:")
print("x-train:", unbal_x_train.shape, "x-val:", unbal_x_val.shape, "y-train:", unbal_y_train.shape, "y-val:", unbal_y_val.shape)
# print("Undersampled")
# print("x-train:", under_x_train.shape, "x-val:", under_x_val.shape, "y-train:", under_y_train.shape, "y-val:", under_y_val.shape)
# print("Under-Oversampled")
# print("x-train:", underOver_x_train.shape, "x-val:", underOver_x_val.shape, "y-train:", underOver_y_train.shape, "y-val:", underOver_y_val.shape)
print("TEST DATA")
print("x-TEST:", x_test.shape, "y-TEST:", y_test.shape)
emoji_df_full.head()


Unbalanced:
x-train: (823, 3) x-val: (206, 3) y-train: (823, 5) y-val: (206, 5)
TEST DATA
x-TEST: (182, 3) y-TEST: (182, 5)


Unnamed: 0,emoji,demoji,VaderEmojiScore,emosentScore
0,🚨,:police_car_light:,0.0,0.673
1,🙏,:folded_hands:,0.0,0.418
2,🤷,:person_shrugging:,0.0,
3,🙄,:face_with_rolling_eyes:,0.0,
4,😂,:face_with_tears_of_joy:,0.4404,0.221


In [31]:
drop_cols = ['Date', 'Labels', 'label_sent', 'label_stance', 'y_stance', 'Flag']
tweets_clean.drop(drop_cols, inplace=True, axis=1 )
print(tweets_clean.shape)
tweets_clean.head()


(1211, 4)


Unnamed: 0,id_stable,Content,ContentClean,y_sent
0,170314,Per a White House official: Biden and Harris m...,Per a White House official: Biden and Harris m...,1
1,192623,Afghan Refugee kid educated in Iran wins this ...,Afghan Refugee kid educated in Iran wins this ...,2
2,106982,@pfrpeppermint @CawthornforNC Not only did Tru...,Not only did Trump stop processing asylum & re...,0
3,31609,An Afghan refugee demands the US not forget he...,An Afghan refugee demands the US not forget he...,0
4,152666,@RepHerrell One moment you hate refugees and t...,One moment you hate refugees and the next you ...,2


# NOTES
* the tweet_clean is the full, unsplit set - NOT for model development, only for finding emojis.
* the unbalanced, and testing sets can be used for VADER model development

_____________ FUNCTIONS ____________

In [None]:
# create the sentiment intensity dictionary object
# sid = SentimentIntensityAnalyzer()  #NOTE: this NEEDS to stay outside of the functions. I will be modifying it.

# FROM aardvark
# creates the sentiment intensity dictionary: aa.vader_sid(tweet)
# gets the compound score: aa.vader_sent_compound(tweet)
# gets the classification of the compund score using the authors' suggested cutoff points: aa.vader_pred(tweet, pos_cut, neg_cut)


# Data Prep
VADER should do better if we get the input into better shape.

### What if we use the Content v. ContentClean column that we used for labeling?
Remember that VADER has its own way of dealing with punctuation, capitalization, modifiers, negations, stopwords, tokenization and lemmatization. Earlier cleaning was done to try not to mess with that. I tested to make sure that was done correctly. The scores are the same, either set. This code has been moved to the graveyard.

(A nice tutorial explaining this: https://towardsdatascience.com/are-you-scared-vader-understanding-how-nlp-pre-processing-impacts-vader-scoring-4f4edadbc91d)


### What abou the demoji?
For VADER, I will have to create a dictionary of these codes as "words" that can be added to the lexicon. We started this by finding all the emoji and saving them to a dataframe: emoji_df_full
* keep the scores from the emosent library as the prioirity
* Use the VADER score as a backup
* Manually check the results to make sure they are reasonable and identify ones to customize.

# Emoji
ref: vaderEmoji.ipynb

In [32]:
emoji_df_full

Unnamed: 0,emoji,demoji,VaderEmojiScore,emosentScore
0,🚨,:police_car_light:,0.0000,0.673
1,🙏,:folded_hands:,0.0000,0.418
2,🤷,:person_shrugging:,0.0000,
3,🙄,:face_with_rolling_eyes:,0.0000,
4,😂,:face_with_tears_of_joy:,0.4404,0.221
...,...,...,...,...
1101,🦾,:mechanical_arm:,0.0000,
1102,🏃🏾‍♂️,:man_running_medium-dark_skin_tone:,0.0000,
1103,🚑,:ambulance:,0.0000,0.091
1104,🎃,:jack-o-lantern:,0.0000,0.617


### Emosent
Will the emosent package work for me?

In [33]:
print(emoji_df_full["emosentScore"].value_counts())
print(emoji_df_full["emosentScore"].isnull().value_counts())

 0.000    21
 1.000    18
 0.333    16
 0.500     9
 0.400     7
          ..
 0.063     1
 0.179     1
 0.581     1
-0.314     1
 0.617     1
Name: emosentScore, Length: 282, dtype: int64
True     638
False    468
Name: emosentScore, dtype: int64


Kinda. It has about half (missing 638) . But it seems to miss some of the important ones that I need. 
* 🤷, 🤮, etc.

And for the symbols where they overlap, the VADER and emosent scores do necessarilly agree and are sometimes very far off:
* 💔 (broken_heart): 0.2732 v. -0.122
* 😭 (loudly_crying_face): -0.4767 v. -0.093

And some of the values are just off for __this__ dataset. For example, the stack of dollars (💵) has a emosent score of 0.423 - very high. Which makes sense normally: money is good. But in this dataset, it shows up when people are stressing the overly high cost of refugee or ilitary operations, or are talking about corruption. 

As this tool has been validated, I'll consider the values they have. But I'll still have to assign my own values to the remaining half. So: first emosent; if not, then VADER; if not, then my ranking; and my own ranking for emojis that are used differently than normal in my dataset.

NOTE: I will have to add the emosent and my emojis to the dictionary. 
* For more insight on ranking: http://kt.ijs.si/data/Emoji_sentiment_ranking/

In [89]:
emoji_df_full['emojiScore'] = np.NaN
emoji_df_full.head()

Unnamed: 0,emoji,demoji,VaderEmojiScore,emosentScore,emojiScore
0,🚨,:police_car_light:,0.0,0.673,
1,🙏,:folded_hands:,0.0,0.418,
2,🤷,:person_shrugging:,0.0,,
3,🙄,:face_with_rolling_eyes:,0.0,,
4,😂,:face_with_tears_of_joy:,0.4404,0.221,


In [90]:
for i, v, e, s in zip(emoji_df_full.index, emoji_df_full["VaderEmojiScore"], emoji_df_full['emosentScore'], emoji_df_full["emojiScore"]):
    if pd.isnull(e) == True:
        if pd.isnull(v) == False:
            emoji_df_full['emojiScore'].iloc[i] = v
    elif e != 0:
        emoji_df_full['emojiScore'].iloc[i] = e
print("NANs after filling: \n", emoji_df_full["emojiScore"].isnull().value_counts())
emoji_df_full['emojiScore'] = emoji_df_full['emojiScore'].copy()
emoji_df_full.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  emoji_df_full['emojiScore'].iloc[i] = e
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  emoji_df_full['emojiScore'].iloc[i] = v


NANs after filling: 
 False    1085
True       21
Name: emojiScore, dtype: int64


Unnamed: 0,emoji,demoji,VaderEmojiScore,emosentScore,emojiScore
0,🚨,:police_car_light:,0.0,0.673,0.673
1,🙏,:folded_hands:,0.0,0.418,0.418
2,🤷,:person_shrugging:,0.0,,0.0
3,🙄,:face_with_rolling_eyes:,0.0,,0.0
4,😂,:face_with_tears_of_joy:,0.4404,0.221,0.221


In [99]:
# WARNING: DO NOT RUN THIS: now that the dictionary has been modified  (there is a copy in archiveData)
# emoji_df_full.to_csv("data/emoji_full.csv")

### --> open the csv and edit
Only look at the entire database, not the labeled tweets, when deciding what to do with these.

In [279]:
# How often does the emoji appear?
print(aa.term_check("🔵", all_unlabeled_tweets))

('🔵', 10)
('🔵', 0)


In [None]:
# How is the emoji generally used?
for i in all_unlabeled_tweets["ContentClean"]:
    if "🙋" in i:
        print(i)

In [291]:
# What is the score of clearly analagous emoji or text?
term = "stop"
print("emonsent:", aa.emosent_score(term))  # works for emoji
print("VADER", aa.vader_sent_compound(term))  # works for text

emonsent: 
VADER -0.296


I changed values for emojis that:
* have clear analogs in other emojis - eg. different skin tones.
  * When this was done, I made a note of the analog in a new column, "analog"
* the most direct text translation is an emotion (eg. heart, thumbs-up) or action (eg. facepalming, dancing), and not a noun
* the emoji is a generally know sign or symbol, eg. biohazard sign, peace symbol

I changed the following categoreis to 0.0:
* means of communication (eg. microphone, television, telephone) - tend to be associated with news media or CTAs
* simple geometric forms, other than hearts - tend to be used as special bullet points
* government bodies - will have different meaning when talking about legalistic situations
* pointers and arrows - used to indicate a reference or emphasize
  
I did not atempt to find substitutes for all emojis. 
* occupations
* objects

### NOTE
Some interesting emojis to look at with the training data:  
* 🦍	:gorilla:
* ❄️	:snowflake:
* 🛃	:customs:
* 🛂	:passport_control:
* 🏳️‍⚧️	:transgender_flag:
* 🦠	:microbe:
* ⚖	:balance_scale:
* 🗳️	:ballot_box_with_ballot:
* ⌛	:hourglass_done:
* 👪	:family:	0.0	-0.018
  
### --> reload the new, modified emoji_df_full

In [297]:
emoji_df_full = pd.read_csv("data/emoji_full_mod1.csv", header=0, index_col=0)

# Update the VADER dictionary
Now that we have the new wordcodes and associated values, we need to put them in the VADER dictionary.

# Word Clouds
__NOTE: There is a lot of preprocessing below that is ONLY used to make the wordclouds. VADER has it's own way of dealing with things like punctuation, capitalization, stopwords, tokenization, lemmatization. And we don't necessarilly want to mass with / override that without validating.__

In order to refine the dictionaries, I need to know what words and bigrams/trigrams are common in which datasets. So let's make word clouds.

First I need to create a single text per pos, neg, and neutral. Then I can feed them into wordcloud. I think I'll use the full dataset to generate one, too.

I don't have a preexisting df for this, so I'm going to have to rebuild it. Sigh.

First just all the ContentClean

In [None]:
# SETUP

# changes to tokenizer and lemmatizer --> UPDATE in aardvark, too.
w_tokenizer = nltk.tokenize.TweetTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

#stopwords = list(STOPWORDS) + ["Afghan", "Afghans", "Afghanistan", "refugee", "refugees", "U", "S", "US", "people", "will", "now"]
stopwords = nltkStop.words('english') + ["Afghan", "Afghans", "Afghanistan", "refugee", "refugees", "U", "S", "US", "people", "will", "now", "UK", "wa", "say", "like", "via"]

In [None]:
# start from the two sets - unlabeled and labeled - that were created in dataCleaningB
all_l = pd.read_csv("all_labeled_tweets.csv", header=0, index_col=0)
all_u = pd.read_csv("all_unlabeled_tweets.csv", header=0, index_col=0)
print(list(all_l.columns))
print(list(all_u.columns))

In [None]:
l_list = ['Content','Labels', 'label_sent', 'y_sent', 'label_stance', 'y_stance', 'Flag', 'n_CapLetters', 'CapsRatio', 'AllCapWords', 'https', 'Mentions', 'Location', 'ReplyCount', 'RetweetCount', 'LikeCount', 'QuoteCount', 'Hashtags']
all_l.drop(l_list, axis=1, inplace=True)
print(list(all_l.columns))

u_list = ['Content', 'Flag']
all_u.drop(u_list, axis=1, inplace=True)
print(list(all_u.columns))

In [None]:
#NOTE that I'm ordering here by date. Not sure why...

frames = [all_l, all_u]
master_tweets = pd.concat(frames)
master_tweets.sort_values(by=['Date'], inplace=True)
master_tweets.reset_index(drop=True, inplace=True)
print(master_tweets.shape)
master_tweets.tail()

### PAUSE / UNPAUSE

In [None]:
# Pause
master_tweets.to_csv(os.path.join('dataVader', "master_tweets.csv"))

In [None]:
# Unpause
master_tweets = pd.read_csv(os.path.join('dataVader', "master_tweets.csv"), header=0, index_col=0)

### Full Data Wordcloud
from the full dataset, stopwords removed

In [None]:
# Thanks to: https://jackmckew.dev/sentiment-analysis-text-cleaning-in-python-with-vader.html
train_words = ' '.join(master_tweets["ContentClean"])
a = aa.lemmatize_text(train_words)
lem_words = ' '.join(a)

In [None]:
# CITE: https://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html
wc = WordCloud(stopwords=stopwords, width=900, height=500, max_words=30).generate(lem_words)
all_wc = wc.words_  # dictionary of the words used to make the cloud
plt.figure(figsize=(10,8))
plt.imshow(wc)
plt.axis("off")
plt.show()

# Word cloud by label
Now the labeled data, divied up by label.

In [None]:
all_xy = pd.read_csv(os.path.join('data', "all_labeled_tweets.csv"), header=0, index_col=0)
print(list(all_xy.columns))

In [None]:
d = ['Content', 'Labels', 'label_sent', 'label_stance', 'y_stance', 'n_CapLetters', 'CapsRatio', 'AllCapWords', 'https', 'Mentions', 'Location', 'ReplyCount', 'RetweetCount', 'LikeCount', 'QuoteCount', 'Hashtags', 'Flag']
master_labeled = all_xy.drop(d, axis=1)
master_labeled.sort_values(by=['Date'], inplace=True)
master_labeled.reset_index(drop=True, inplace=True)
print(master_labeled.shape)
master_labeled.tail()

In [None]:
labeled_zero = master_labeled[master_labeled["y_sent"]==0].copy()  # neu
labeled_one = master_labeled[master_labeled["y_sent"]==1].copy()  # neg
labeled_two = master_labeled[master_labeled["y_sent"]==2].copy()  # pos
print(labeled_zero.shape)
print(labeled_one.shape)
print(labeled_two.shape)

### PAUSE / UNPAUSE

In [None]:
# Pause
master_labeled.to_csv(os.path.join('dataVader', "master_labelel.csv"))
labeled_zero.to_csv(os.path.join('dataVader', "labeled_zero.csv"))
labeled_one.to_csv(os.path.join('dataVader', "labeled_one.csv"))
labeled_two.to_csv(os.path.join('dataVader', "labeled_two.csv"))

### Preprocess for wordcloud: Tokenize and Lemmatize 

In [None]:
zero_words = ' '.join(labeled_zero["ContentClean"])
zero_lem = aa.lemmatize_text(zero_words)
zero_lem = ' '.join(zero_lem)

one_words = ' '.join(labeled_one["ContentClean"])
one_lem = aa.lemmatize_text(one_words)
one_lem = ' '.join(one_lem)

two_words = ' '.join(labeled_two["ContentClean"])
two_lem = aa.lemmatize_text(two_words)
two_lem = ' '.join(two_lem)

In [None]:
# Neutral
wc = WordCloud(stopwords=stopwords, width=900, height=500, max_words=30).generate(zero_lem)
neu_wc = wc.words_  # dictionary of the words used to make the cloud
plt.figure(figsize=(10,8))
plt.imshow(wc)
plt.axis("off")
plt.show()

In [None]:
# Negative
wc = WordCloud(stopwords=stopwords, width=900, height=500, max_words=30).generate(one_lem)
neg_wc = wc.words_  # dictionary of the words used to make the cloud
plt.figure(figsize=(10,8))
plt.imshow(wc)
plt.axis("off")
plt.show()

In [None]:
# Positive
wc = WordCloud(stopwords=stopwords, width=900, height=500, max_words=30).generate(two_lem)
pos_wc = wc.words_  # dictionary of the words used to make the cloud
plt.figure(figsize=(10,8))
plt.imshow(wc)
plt.axis("off")
plt.show()

Significant in all three:
* asylum (this is surprising)
* "ha" is in both pos and neg. Does Vader catch this as sarcastic in the neg? This should be an intensifier, not a sentiment.

Add negiative score to:
* Taliban
* unvetted
* left behind
* crisis

Add positive score to:
* community
* help/ing (and why are these not fied by the lemmatizer?)
* family (this is in neutral too, but I think it might help w positive)
* resettlement (this is in neg, but really small... and in neu, maybe slightly positive)
* Thank
* welcome
* support
* opportunity

### Bigrams and Trigrams

In [None]:
# Neutral - up to trigrams allowed
wc = WordCloud(stopwords=stopwords, width=900, height=500, collocation_threshold = 3, max_words=30).generate(zero_lem)
neu_bi_wc = wc.words_  # dictionary of the words used to make the cloud
plt.figure(figsize=(10,8))
plt.imshow(wc)
plt.axis("off")
plt.show()


In [None]:
punct = '%&\()*,./:;<=>[\\]^_`{|}~!?#0123456789'

# starting from zero_lem, which was tokenized and lematized above, but stopwords were 
# not removed (used wordcloud's stopword remover). Clean it up for ngrams
ngram_text = zero_lem.replace("_", " ")
ngram_text = [word.strip(punct) for word in ngram_text.split()]
ngram_text = [word for word in ngram_text if word not in stopwords]  # remove stopwords
ngram_text = [s for s in ngram_text if len(s) != 0]  # remove empties
ngram_text_neu = ' '.join(ngram_text)

# find frequency of the bi/trigrams
bi_neu, tri_neu = aa.bi_tri_freq(ngram_text_neu)
print("Neutral nGrams")
print(bi_neu[:25])
print()
print(tri_neu[:25])


In [None]:
# Negative, trigrams allowed
wc = WordCloud(stopwords=stopwords, width=900, height=500, collocation_threshold = 3, max_words=30).generate(one_lem)
neg_bi_wc = wc.words_  # dictionary of the words used to make the cloud
plt.figure(figsize=(10,8))
plt.imshow(wc)
plt.axis("off")
plt.show()

In [None]:
punct = '%&\()*,./:;<=>[\\]^_`{|}~!?#0123456789'

# starting from zero_lem, which was tokenized and lematized above, but stopwords were 
# not removed (used wordcloud's stopword remover). Clean it up for ngrams
ngram_text = one_lem.replace("_", " ")
ngram_text = [word.strip(punct) for word in ngram_text.split()]
ngram_text = [word for word in ngram_text if word not in stopwords]  # remove stopwords
ngram_text = [s for s in ngram_text if len(s) != 0]  # remove empties
ngram_text_neg = ' '.join(ngram_text)

# find frequency of the bi/trigrams
bi_neg, tri_neg = aa.bi_tri_freq(ngram_text_neg)
print("Negative nGrams")
print(bi_neg[:25])
print()
print(tri_neg[:25])


In [None]:
# Positive, trigrams allowed
wc = WordCloud(stopwords=stopwords, width=900, height=500, collocation_threshold = 3, max_words=30).generate(two_lem)
pos_bi_wc = wc.words_  # dictionary of the words used to make the cloud
plt.figure(figsize=(10,8))
plt.imshow(wc)
plt.axis("off")
plt.show()


In [None]:
punct = '%&\()*,./:;<=>[\\]^_`{|}~!?#0123456789'

# starting from zero_lem, which was tokenized and lematized above, but stopwords were 
# not removed (used wordcloud's stopword remover). Clean it up for ngrams
ngram_text = two_lem.replace("_", " ")
ngram_text = [word.strip(punct) for word in ngram_text.split()]
ngram_text = [word for word in ngram_text if word not in stopwords]  # remove stopwords
ngram_text = [s for s in ngram_text if len(s) != 0]  # remove empties
ngram_text_pos = ' '.join(ngram_text)

# find frequency of the bi/trigrams
bi_pos, tri_pos = aa.bi_tri_freq(ngram_text_pos)
print("Neutral nGrams")
print(bi_pos[:15])
print()
print(tri_pos[:15])

In [None]:
# dictionary of the words used to make the cloud
neu_wc
neu_words = []
for key, val in neu_wc.items():
    neu_words.append(key)

neu_bi_wc
neu_bigs = []
for key, val in neu_bi_wc.items():
    neu_bigs.append(key)

neg_wc
neg_words = []
for key, val in neg_wc.items():
    neg_words.append(key)

neg_bi_wc
neg_bigs = []
for key, val in neg_bi_wc.items():
    neg_bigs.append(key)

pos_wc
pos_words = []
for key, val in pos_wc.items():
    pos_words.append(key)

pos_bi_wc
pos_bigs = []
for key, val in pos_bi_wc.items():
    pos_bigs.append(key)

wc_words = pd.DataFrame({"neu words":neu_words, "neg words":neg_words, "pos words":pos_words, "neu nGrams":neu_bigs, "neg nGrams":neg_bigs, "pos nGrams":pos_bigs})
wc_words[:15]

In [None]:
print("Neutral nGrams")
print(bi_neu[:10])
print(tri_neu[:10])
print()
print("Negative nGrams")
print(bi_neg[:10])
print(tri_neg[:10])
print()
print("Positive nGrams")
print(bi_pos[:10])
print(tri_pos[:10])


In [None]:
# Should have included this in the aardvark function. Oh well.
from sklearn.feature_extraction.text import CountVectorizer

words_in_text = list(ngrams(ngram_text_neu.split(), 1))
words_in_text = [' '.join(i) for i in words_in_text]
vectorizer = CountVectorizer(ngram_range=(1, 1))
bag_of_words = vectorizer.fit_transform(words_in_text)
vectorizer.vocabulary_
sum_words = bag_of_words.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_in_text_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
print("Frequent Words: Neutral Tweets")
print(words_in_text_freq[:25])
print()

words_in_text = list(ngrams(ngram_text_neg.split(), 1))
words_in_text = [' '.join(i) for i in words_in_text]
vectorizer = CountVectorizer(ngram_range=(1, 1))
bag_of_words = vectorizer.fit_transform(words_in_text)
vectorizer.vocabulary_
sum_words = bag_of_words.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_in_text_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
print("Frequent Words: Negative Tweets")
print(words_in_text_freq[:25])
print()

words_in_text = list(ngrams(ngram_text_pos.split(), 1))
words_in_text = [' '.join(i) for i in words_in_text]
vectorizer = CountVectorizer(ngram_range=(1, 1))
bag_of_words = vectorizer.fit_transform(words_in_text)
vectorizer.vocabulary_
sum_words = bag_of_words.sum(axis=0) 
words_freq = [(word, sum_words[0, idx]) for word, idx in vectorizer.vocabulary_.items()]
words_in_text_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
print("Frequent Words: Positive Tweets")
print(words_in_text_freq[:25])

In [None]:
# Alternate approach: easier code, messier results.
from nltk.probability import FreqDist
pos_tok = ngram_text_pos.split(" ")
fdist = FreqDist(pos_tok)
fdist = fdist.most_common(5)
fdist

In [None]:
# check_tweets = pd.read_csv(os.path.join('dataVader', "master_labelel.csv"), index_col=0, header=0)
check_tweets = pd.read_csv(os.path.join('dataVader', "master_tweets.csv"), index_col=0, header=0)


### ALTERNATE term check

In [None]:
my_count = 0
for id, text in zip(check_tweets.id_stable, check_tweets["ContentClean"]):
    if "🔜" in text.lower():  #osamabin: 6; binladen: 11; ben laden: 1
        print(id, ":", text)
        my_count += 1
my_count
# Orig: left behind: 3992 instances in master_tweets

In [None]:
vader_sent_compound("embarrassing")

# Adding to VADER
### Stop Words 
VADER takes care of stop word removal. I would to refine what is considered a stopword, but that's surprisingly complicated. So instead we are going to change the lexicon to make the relevant words score = 0.

Add: all the ones that were used for WordCloud: "Afghan", "Afghans", "Afghanistan", "refugee", "refugees", "U", "S", "US", "people", "will", "now", "UK", "wa", "ha", "say", "like", "via"

Also: "asylum", "seeker", "special", "family", "country"

In [None]:
sid = SentimentIntensityAnalyzer()
change_lex = {}

In [None]:
check_list = ["asylum", "seeker", "special", "family", "country", "Afghan", "Afghans", "Afghanistan", "refugee", "refugees", "U", "S", "US", "people", "will", "now", "UK", "wa", "ha", "say", "like", "via"]
for i in check_list:
    print(i, ":", vader_sent_compound(i))  #💩

In [None]:
change_lex["special"]=0
change_lex["like"] = 0
change_lex["ha"] = 0
sid.lexicon.update(change_lex)
# "ha" is in both pos and neg (sarcastic); this should be an intensifier, not a sentiment.

### New Dictionary terms
emojis

negative: left behind, unvetted, Taliban, Osama bin Laden, crisis

Positive: soccer, citizen, help, support, thank, welcome, hope

Neutral: resettle, resettlement

#thanks: https://stackoverflow.com/questions/40481348/is-it-possible-to-edit-nltks-vader-sentiment-lexicon

In [None]:
check_list = ["left behind", "unvetted", "Taliban", "Laden", "crisis", "soccer", "citizen", "help", "support", "thank", "welcome", "hope", "resettle", "resettlement"]

for i in check_list:
    print(i, ":", vader_sent_compound(i))

In [None]:
change_lex["left behind"] = -3
change_lex["unvetted"] = -2
change_lex["taliban"] = -1
change_lex["laden"] = -1  # In the full dataset, there are 123 " bin laden"s and 156 "laden"s; most of these are made up by just the most obvious misspellings (#osamabin: 6; binladen: 11; ben laden: 1). Good enough.
change_lex["soccer"] = 1
change_lex["citizen"] = 1
sid.lexicon.update(change_lex)

"left behind" needs to be changed into a single word for VADER to score it appropriately.

In [None]:
print(vader_sent_compound("left behind"))
change_lex["left_behind"] = 2
sid.lexicon.update(change_lex)
vader_sent_compound("left_behind")

In [None]:
my_count = 0
for id, text in zip(check_tweets.id_stable, x_train["ContentClean"]):
    if "left behind" in text.lower():  #osamabin: 6; binladen: 11; ben laden: 1
        #print(id, ":", text)
        my_count += 1
print(my_count)

my_count = 0
for id, text in zip(check_tweets.id_stable, x_train["ContentClean"]):
    if "left_behind" in text.lower():  #osamabin: 6; binladen: 11; ben laden: 1
        #print(id, ":", text)
        my_count += 1
print(my_count)

In [None]:
for i, text in zip(x_train.index, x_train["ContentClean"]):
    x_train.loc[i, ["ContentClean"]] = text.lower().replace('left behind', 'left_behind')

### Now find the score

In [None]:
x_train.drop(columns = ["VADERsid", "VADERcompound", "VADERpred"], axis = 1, inplace = True)
x_train.head()

In [None]:

# Use the functions to stick it all into the df
x_train["VADERsid"] = x_train["Content"].apply(vader_sid)
x_train["VADERcompound"] = x_train["Content"].apply(vader_sent_compound)
x_train["VADERpred"] = x_train["Content"].apply(vader_pred)

print("Ground Truth:")
print(y_train_sent["y_sent"].value_counts())
print("Pred:")
print(x_train["VADERpred"].value_counts())


In [None]:
# Use the same "true" as above
# Get the prediction list
lex1_pred = list(x_train["VADERpred"])

# Find the microaverage of the F1 scores
base_microF1 = f1_score(y_true=true, y_pred=lex1_pred, average='micro', zero_division='warn')
base_macroF1 = f1_score(y_true=true, y_pred=lex1_pred, average='macro', zero_division='warn')

print("Micro and Macro-Average")
print('\tVADER_lex1 F-score, micro average: {:04.3f}'.format(base_microF1))
print('\tVADER_lex1 F-score, macro average: {:04.3f}'.format(base_macroF1))
print()

This is a little bit better, but really not much. And not over the baseline.
* Underpredicting neutral
* Underpredicting negative
* Overpredicting positive

Baseline: Majority Class Prediction
* Majority class prediction F-score, micro average: 0.606
* Majority class prediction F-score, macro average: 0.252

VADER-base, untuned:
* VADER-base F-score, micro average: 0.543
* VADER-base F-score, macro average: 0.502

VADER-base, lexicon with updated sentiment terms
* VADER_lex_ F-score, micro average: 0.563
* VADER_lex_ F-score, macro average: 0.521


In [None]:
frames = [x_train, y_train_sent]
train_temp = pd.concat(frames, axis=1)
train_temp.drop(["Date", "Content", "ContentClean", "VADERsid"], axis=1, inplace=True)
train_temp = train_temp.iloc [:, [0, 1, 2, 4]]  
train_temp.columns = ["id_stable", "VADERcompound", "VADERpred", "GroundTruth"]
print(train_temp.shape)

In [None]:
train_temp.groupby('GroundTruth')['VADERcompound'].describe()

In [None]:
print("VADER Compound Score grouped by Ground Truth")
train_temp.boxplot(by='GroundTruth', column='VADERcompound', figsize=(10,8))

### PAUSE / UNPAUSE

In [None]:
# PAUSE
x_train.to_csv(os.path.join('dataVader', "x_train_VADERlex.csv"))
#x_val.to_csv(os.path.join('dataVader', "x_val_VADERlex.csv"))   NOT done yet 


In [None]:
# UNPAUSE
x_train = pd.read_csv(os.path.join('dataVader', "x_train_VADERlex.csv"), header=0, index_col=0)
x_val = pd.read_csv(os.path.join('dataVader', "x_val_VADERlex.csv"), header=0, index_col=0)
y_train_sent = pd.read_csv(os.path.join('data', "y_train_sent.csv"), header=0, index_col=0)
y_val_sent = pd.read_csv(os.path.join('data', "y_val_sent.csv"), header=0, index_col=0)

### Add new emojis to the lexicon

In [None]:
emoji_df = pd.read_csv(os.path.join('dataVader', "emoji_score.csv"), header=0, index_col=0)
emoji_lex = emoji_df[emoji_df["VADERscore"]==0].copy()
emoji_lex.drop("VADERscore", axis=1, inplace=True)
emoji_lex.head()

I'm going to do the scoring in Excel.

In [None]:
emoji_lex.to_csv("emoji_lex.csv")

In [None]:
emoji_lex2 = pd.read_csv(os.path.join('dataVader', "emoji_lex_myScore.csv"), header=0, index_col=0)
emoji_lex2.tail()

# emoji_lex2 = pd.read_csv("emoji_lex.csv", header=0, index_col=0)

So, apparently VADER transforms emoji to text before extracting sentiment. I'm not exactly sure how this works: since it seems to map emoji to some pretty common words, I'm not sure how updating these sentiments would work. So I'm going to do the word transformation myself, then add those words to the lexicon with my scores.

In [None]:
# del emoji_lex
# del emoji_lex2

In [None]:
frames = [emoji_lex, emoji_lex2]
emoji_lex = pd.concat(frames, axis=1)
# emoji_lex.drop(columns=["demoji", "emosentScore"], inplace=True)
emoji_lex.drop(columns=["emoji", "demoji", "emosentScore"], inplace=True)
emoji_lex = emoji_lex[emoji_lex["lexScore"] != 0]
emoji_lex = dict(zip(emoji_lex["name"], emoji_lex["lexScore"]))
emoji_lex

In [None]:
sid.lexicon.update(emoji_lex)

In [None]:
print(vader_sent_compound("person_shrugging"))
print(vader_sent_compound("clown_face"))
print(vader_sent_compound("my_broken_heart"))

So that works to update VADER.

Now to make the transformation in the data.

In [None]:
a = emoji_df[emoji_df["VADERscore"]==0].copy()
a.drop("VADERscore", axis=1, inplace=True)
a.tail()

In [None]:
frames = [a, emoji_lex2]
emoji_dataUpdate = pd.concat(frames, axis=1)
emoji_dataUpdate.tail()

In [None]:
emoji_dataUpdate.loc[70, "emoji"] = "💔"
emoji_dataUpdate.loc[71, "emoji"] = "😔"
emoji_dataUpdate.loc[72, "emoji"] = "😥"
emoji_dataUpdate.tail()

In [None]:
emoji_dataUpdate.drop(columns=["demoji", "emosentScore", "lexScore"], inplace=True)
emoji_dataUpdate.tail()

### PAUSE / UNPAUSE

In [None]:
x_train.to_csv(os.path.join('dataVader', "temp_x_train.csv"))

In [None]:
x_train = pd.read_csv(os.path.join('dataVader', "temp_x_train.csv"), index_col=0, header=0)
x_train.tail()

### Cont.

In [None]:
print(aa.term_check("💀", x_train))  #💀 sk_ull
print(aa.term_check("sk_ull", x_train))  #💀 sk_ull
print(aa.term_check("💔", x_train))
print(aa.term_check("my_broken_heart", x_train))

aa.term_check("person_shrugging", x_train)


In [None]:
for i, text in zip(x_train.index, x_train["ContentClean"]):
    if "‼️" in text:
        x_train.loc[i, ["ContentClean"]] = text.replace("‼️", "!!")
    if "👊" in text:
        x_train.loc[i, ["ContentClean"]] = text.replace("👊", "!")
    for emj, nme in zip(emoji_dataUpdate["emoji"], emoji_dataUpdate["name"]):
        if emj in text:
            x_train.loc[i, ["ContentClean"]] = text.replace(emj, str(" "+nme+" "))
    

## Find new scores

In [None]:
# sid.lexicon.update(change_lex)
print(vader_sent_compound("left_behind"))
print(vader_sent_compound("💀"))
print(vader_sent_compound("sk_ull"))
print(vader_sent_compound("💔"))
print(vader_sent_compound("my_broken_heart"))
print(vader_sent_compound("person_shrugging"))

In [None]:
# x_train.drop(columns=["VADERsid", "VADERcompound", "VADERpred"], axis=1, inplace=True)
# x_train.head()

In [None]:
# Use the functions to stick it all into the df
x_train["VADERsid"] = x_train["Content"].apply(vader_sid)
x_train["VADERcompound"] = x_train["Content"].apply(vader_sent_compound)
x_train["VADERpred"] = x_train["Content"].apply(vader_pred)

print("Ground Truth:")
print(y_train_sent["y_sent"].value_counts())
print("Pred:")
print(x_train["VADERpred"].value_counts())

In [None]:
# Use the same "true" as above
# Get the prediction list
lex2_pred = x_train["VADERpred"].tolist()

# Find the microaverage of the F1 scores
base_microF1 = f1_score(y_true=true, y_pred=lex2_pred, average='micro', zero_division='warn')
base_macroF1 = f1_score(y_true=true, y_pred=lex2_pred, average='macro', zero_division='warn')

print("Micro and Macro-Average")
print('\tVADER-base w emoji lexicon, F-score, micro average: {:04.3f}'.format(base_microF1))
print('\tVADER-base w emoji lexicon, F-score, macro average: {:04.3f}'.format(base_macroF1))
print()

In [None]:
frames = [x_train, y_train_sent]
train_temp = pd.concat(frames, axis=1)
train_temp.drop(["Date", "Content", "ContentClean", "VADERsid"], axis=1, inplace=True)
train_temp = train_temp.iloc [:, [0, 1, 2, 4]]  
train_temp.columns = ["id_stable", "VADERcompound", "VADERpred", "GroundTruth"]
print(train_temp.shape)

In [None]:
train_temp.groupby('GroundTruth')['VADERcompound'].describe()

In [None]:
print("VADER Compound Score grouped by Ground Truth")
train_temp.boxplot(by='GroundTruth', column='VADERcompound', figsize=(10,8))

So, this changes NOTHING. I guess not many rows have emojis, and/or those rows were already predicted correctly.

* Underpredicting neutral
* Underpredicting negative
* Overpredicting positive

Baseline: Majority Class Prediction
* Majority class prediction F-score, micro average: 0.606
* Majority class prediction F-score, macro average: 0.252

VADER-base, untuned:
* VADER-base F-score, micro average: 0.543
* VADER-base F-score, macro average: 0.502

VADER-base, lexicon with updated sentiment terms
* VADER_lex_ F-score, micro average: 0.563
* VADER_lex_ F-score, macro average: 0.521

VADER-base w emoji lexicon
* VADER-base w emoji lexicon, F-score, micro average: 0.563
* VADER-base w emoji lexicon, F-score, macro average: 0.521


# Find and Remove more News
Ok, so we discovered when dealing with the emojis that several of them are reliable indicators of news articles or other irrelevant rows:
* 🎙️: audio broadcasts, esp. podcasts
* 🆕: new infomration
* 📡: on the radar
* 📊: data given
* 📅: event schedule   #DO THIS ONE LATER


In [None]:
## IN THE FULL DATASET
print(check_tweets.shape)
my_count = 0
for id, text in zip(check_tweets.id_stable, check_tweets["ContentClean"]):
    if "🆕" in text.lower():  #osamabin: 6; binladen: 11; ben laden: 1
        print(id, ":", text)
        my_count += 1
my_count

In [None]:
my_count=0
print(x_train.shape)
for id, text in zip(x_train.id_stable, x_train["ContentClean"]):
    if "new_button" in text.lower():  #osamabin: 6; binladen: 11; ben laden: 1
        my_count += 1
my_count


In [None]:
list = ["studio_microphone", "NEW_button", "satellite_antenna", "bar_chart"]
indx_list = []
# for i, text in zip(x_train.index, x_train["ContentClean"]):
#     for emj in list:
#         if emj in text:
#             indx_list.append(i)
for id, text in zip(x_train.index, x_train["ContentClean"]):
    if "bar_chart" in text:
        print(i)

indx_list


This is not in enough rows to be worth it.

# Reset the category thresholds
I think the VADER people wanted to keep "neutral" as small as possible. That makes sense when you are talking about a simple subject. It's less useful when topics are complex (lots of pros and cons) and undecided or neutral is a viable stance to take on the subject.

So let's widen the thresholds.

In [None]:
def vader_pred_var(tweet):
    scores = sid.polarity_scores(tweet)
    comp = scores["compound"]
    if comp >= 0.1:
        return 2
    elif comp <= -.01:
        return 1
    else:
        return 0

In [None]:
# Use the functions to stick it all into the df
x_train["VADERsid"] = x_train["Content"].apply(vader_sid)
x_train["VADERcompound"] = x_train["Content"].apply(vader_sent_compound)
x_train["VADERpred"] = x_train["Content"].apply(vader_pred_var)

print("Ground Truth:")
print(y_train_sent["y_sent"].value_counts())
print("Pred:")
print(x_train["VADERpred"].value_counts())

In [None]:
# Use the same "true" as above
# Get the prediction list
threshold_pred = x_train["VADERpred"].tolist()

# Find the microaverage of the F1 scores
base_microF1 = f1_score(y_true=true, y_pred=threshold_pred, average='micro', zero_division='warn')
base_macroF1 = f1_score(y_true=true, y_pred=threshold_pred, average='macro', zero_division='warn')

print("Micro and Macro-Average")
print('\tVADER-threshold F-score, micro average: {:04.3f}'.format(base_microF1))
print('\tVADER-threshold F-score, macro average: {:04.3f}'.format(base_macroF1))
print()

So, disappointingly, this also does very little. Because there is so much overlap, moving the threshold may help one class, but it will about equally hurt another. There has to be better separation betwteen classes in order for this to work.

NOTE: no need for boxplots, as the distribution hasn't changed.

Unfortunately, I think I have to leave the analysis there and move on to writeup. I don't have time to keep working on this now.

# Now make the selected transformations on the validate data and the full dataset.
* sid.lexicon.update(change_lex)
* replace "left behind" with "left_behind"
* replace new emojis with the text equivalent in 

# TO CONSIDER, esp. for BERT
* expand contractions
* remove links, URLs
* Replace emoji w standard text
* What to do with hashtags
* remove capitalization 
* What to do with punctuation: ! or ? should be useful for sentiment. Maybe " " for sarcasm?
* remove whitespaces
* Bertmoticon package, which is fine-tuned to the BERT model.
* remove stop words beyond the standard NLTK stop words
    * Create WordClouds to find prominant but useless words
    * Ex: days of the week and their abbreviations, month names, and the word “Twitter” 
* deal w negations
    * create a dictionary of negations so that negated words could be effectively handled
* tokenize 
* stemming (via PorterStemmer)
* REMEMBER:
    * Hashtags is filled with "No hashtags"
    * Location is filled with "English-speaking"