## Sentiment analysis of US and Singaporean text messages

The World Happiness Report for 2021 lists the United States as the 19th happiest country in the world, while Singapore rings in at 32nd*. Let's try some sentiment analysis to see if this reported discrepancy is borne out! The link below leads to a report on a corpus of text messages from students of different nationalities that were attending the National University of Singapore. Are the messages of Singaporean students rated as "positive" more than those of US students? Let's take a look.

*source: https://worldhappiness.report/ed/2021/

*corpus study: https://link.springer.com/article/10.1007/s10579-012-9197-9

Start by importing the necessary libraries:

In [1]:
#import the necessary libraries
import nltk
import pandas as pd
from collections import Counter
#nltk.download('twitter_samples')               be sure you have these! 
#nltk.download('punkt')                         
#nltk.download('averaged_perceptron_tagger')
#nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
from nltk.corpus import twitter_samples, stopwords, wordnet
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import punkt, FreqDist, classify, NaiveBayesClassifier
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
import re, string
import json
import random

### Data preparation

To do this, we're going to train a Naive Bayes model on a set of 10,000 tweets that were split into equal groups - tweets with "positive" sentiment and tweets with "negative" sentiment*. Using a training set of 7000 tokens and a test set of 3000 tokens,
we can further test the performance of the model on the text message data.
    
**Item 94, "Twitter Samples" from: https://www.nltk.org/nltk_data/*

In [2]:
#Prepare the data for analysis!

#get dataframe of text messages
textdf = pd.read_csv("clean_nus_sms.csv")

#drop columns we won't need/that aren't important for this task
textdf = textdf.drop(['id', 'Unnamed: 0'], axis =1)

#filter rows to get only USA and Singapore data
textdf = textdf.loc[(textdf['country'] == 'United States') | (textdf['country'] == 'Singapore') ]

#make a dataframe for each country 
textdf_us = textdf.loc[(textdf['country'] == 'United States')]
textdf_sing = textdf = textdf.loc[(textdf['country'] == 'Singapore') ]

# make a list of messages for each country
textdf_us['Message'].dropna(inplace=True)
textdf_sing['Message'].dropna(inplace=True)

us_tokens = textdf_us['Message'].tolist()
us_tokens_string = ' '.join(us_tokens)
us_cleaned = re.sub('\W+', ' ', us_tokens_string)

sing_tokens = textdf_sing['Message'].tolist()
sing_tokens_string = ' '.join(sing_tokens)
sing_cleaned = re.sub('\W+', ' ', sing_tokens_string)

#tokenize
pos_tokens = twitter_samples.tokenized('positive_tweets.json')
neg_tokens = twitter_samples.tokenized('negative_tweets.json')
us_tokenized = word_tokenize(us_cleaned)
sing_tokenized = word_tokenize(sing_cleaned)


#normalize and lemmatize the text messages
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

us_no_stop = [word for word in us_tokenized if not word.lower() in stop_words]
sing_no_stop = [word for word in sing_tokenized if not word.lower() in stop_words]

us_clean_token = [WordNetLemmatizer().lemmatize(token) for token in us_no_stop]
sing_clean_token = [WordNetLemmatizer().lemmatize(token) for token in sing_no_stop]


#define a function that handles stop words and lemmatization for tweets
def remove_noise(tokens, stop_words= ()):
    cleaned_tokens = []
    for token, tag in pos_tag(tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', token)
        token = re.sub('(@[A-Za-z0-9_]+)', '', token)
        
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)
        
        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

#normalize and lemmatize the tweets
pos_clean_token = []
neg_clean_token = []

for tokens in pos_tokens:
    pos_clean_token.append(remove_noise(tokens, stop_words))
for tokens in neg_tokens:
    neg_clean_token.append(remove_noise(tokens, stop_words))


#define a function to get all words in a list for checking word frequencies
def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

all_pos_words = get_all_words(pos_clean_token)
all_neg_words = get_all_words(neg_clean_token)
all_us_words = get_all_words(us_clean_token)
all_sing_words = get_all_words(sing_clean_token)

#convert tokens to a dictionary for classification tasks
def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

#to do this, have to make the text messages into a list of
#tokenized sentences, rather than a list of strings        
us_split_tokens = []
sing_split_tokens = []
for string in us_tokens:
    us_split_tokens.append(string.split())
for string in sing_tokens:
    sing_split_tokens.append(string.split())
    
      
pos_model_tokens = get_tweets_for_model(pos_clean_token)
neg_model_tokens = get_tweets_for_model(neg_clean_token)
us_model_tokens = get_tweets_for_model(us_split_tokens)
sing_model_tokens = get_tweets_for_model(sing_split_tokens)


#prepare tweet data for training/testing
positive_dataset = [(tweet_dict, "Positive")
                    for tweet_dict in pos_model_tokens]
negative_dataset = [(tweet_dict, "Negative")
                    for tweet_dict in neg_model_tokens]

#prepare text message data for testing
us_dataset = [(tweet_dict)
                    for tweet_dict in us_model_tokens]
sing_dataset = [(tweet_dict)
                    for tweet_dict in sing_model_tokens]

#make one combined set of negative and positive tweets for training
dataset = positive_dataset + negative_dataset

#shuffle so all positive tweets aren't first
random.shuffle(dataset) 

#set apart 7000 for training, 3000 for testing
train_data = dataset[:7000]  
test_data = dataset[7000:]

#build the model!
classifier = NaiveBayesClassifier.train(train_data)

### Modeling

The model is built! How accurate is it, and what features does it find most informative? 

In [3]:
print('\nAccuracy is:', classify.accuracy(classifier, test_data))
print(classifier.show_most_informative_features(10))


Accuracy is: 0.9953333333333333
Most Informative Features
                      :( = True           Negati : Positi =   2053.2 : 1.0
                      :) = True           Positi : Negati =    998.8 : 1.0
                     sad = True           Negati : Positi =     25.0 : 1.0
                     bam = True           Positi : Negati =     22.6 : 1.0
                follower = True           Positi : Negati =     22.5 : 1.0
              appreciate = True           Positi : Negati =     18.6 : 1.0
                    glad = True           Positi : Negati =     17.2 : 1.0
                     x15 = True           Negati : Positi =     16.8 : 1.0
                followed = True           Negati : Positi =     14.8 : 1.0
               community = True           Positi : Negati =     14.5 : 1.0
None


Not too bad! 99% accuracy, and it finds smiley faces to be the most indicative of positivity, while the word 'sad' is the biggest hint for negativity. Some strange artifacts from training on Twitter data, such as "follower" and "followed" - as well as whatever "x15" is - but hopefully it still does okay on the text messages. 


Let's see what these kids are talking about! What are the most frequent words in the sets?

In [4]:
freq_dist_us = FreqDist(sing_clean_token)
print('Singapore 10 most frequent words: \n')
print(freq_dist_us.most_common(10))

freq_dist_sing = FreqDist(us_clean_token)
print('\nUS 10 most frequent words: \n') 
print(freq_dist_sing.most_common(10))

Singapore 10 most frequent words: 

[('Haha', 4244), ('u', 3940), ('haha', 1760), ('go', 1671), ('le', 1425), ('got', 1330), ('lol', 1119), ('Hahaha', 1101), ('Lol', 1077), ('time', 979)]

US 10 most frequent words: 

[('u', 271), ('get', 256), ('know', 245), ('Hi', 183), ('Thanks', 177), ('like', 161), ('want', 154), ('time', 126), ('got', 124), ('Lol', 123)]


The Singaporeans come out of the gate in a fit of laughter! They might be hard
to beat. The Americans are more focused on "time" and "know"-ing, which is 
not much fun! 

Let's see how the model performs on a couple of individual messages:

In [5]:
print('\na) Clearly negative example: ' +  us_tokens[65])
print('\n' + classifier.classify(dict([token, True] for token in us_dataset[65])))


print('\nb) Clearly positive example: ' + us_tokens[168])
print('\n' + classifier.classify(dict([token, True] for token in us_dataset[168])))

print('\nc) Positive, but not straightforward: ' + sing_tokens[625])
print('\n' + classifier.classify(dict([token, True] for token in sing_dataset[625])))

print('\nd) Neutral: ' + sing_tokens[8888])
print('\n' + classifier.classify(dict([token, True] for token in sing_dataset[8888])))


a) Clearly negative example: I'm gonna be home soon and i don't want to talk about this stuff anymore tonight, k? I've cried enough today.

Negative

b) Clearly positive example: Have a good show!

Positive

c) Positive, but not straightforward: Heee?? I cant wait too sweerty. Muacks u much much

Negative

d) Neutral: Okay i'll be at canteen 2.

Positive


It appears to do pretty well, though there are also certainly ways to refine it that I discuss a bit below. Let's try to get a sense of the overall positivity/negativity. Let's use a variable to tally the positive and negative count for each country, which we can use to make a "happiness ratio", called HR. Then we iterate through each list and add to the appropriate tally. 

In [6]:
us_neg_tally = 0
us_pos_tally = 0
sing_neg_tally = 0
sing_pos_tally = 0


for sent in us_dataset:
    if classifier.classify(dict([token, True] for token in sent)) == "Positive":
        us_pos_tally += 1
    else:
        us_neg_tally += 1
        
        
for sent in sing_dataset:
    if classifier.classify(dict([token, True] for token in sent)) == "Positive":
        sing_pos_tally += 1
    else:
        sing_neg_tally += 1

Now, we divide the positive scores by the negative scores to get the happiness ratio! Since we divide by the negative tally, a higher HR means higher happiness. If HR < 1, this means there were more negative texts than positive ones. Let's see what happened:

In [7]:
sing_happy_ratio = sing_pos_tally / sing_neg_tally
print('The happiness ratio for Singaporean text messages is: ' + str(sing_happy_ratio))


us_happy_ratio = us_pos_tally / us_neg_tally
print('\nThe happiness ratio for US text messages is: ' + str(us_happy_ratio))

The happiness ratio for Singaporean text messages is: 0.5832134637514385

The happiness ratio for US text messages is: 0.6457418788410887


It's pretty close! The US messages might just barely have an edge. This mirrors the World Happiness Rankings too, where the two countries were both relatively highly-ranked and not super far apart. 

### Areas for further consideration

There are many ways the analysis could be extended/refined in the future. Here are just a few ideas:

* *Normalize the happiness ratio*:

    * The happiness ratio is not super informative in a vaccuum - maybe it could be normalized somehow? Put the tallies in a data frame and MinMax normalize to values between 0 (least happy) and 1 (most happy)? 
    
* *Cultural context*:

    * Lots of the slang terms in the texts escaped the stopword filter. Those could be added pretty simply, but it might affect the model's predictions if it keys in on these slang terms to make the positive/negative distinction. There are also lots of non-English terms in the Singaporean data that require cultural context. "Le" for example turns up a bunch, and is apparently
a discourse marker denoting uncertainty. This kind of context is really important!

* *Coding*:

    * I ended up using slightly different code to process tweets and texts because of the different structure of each data set. Surely there must be a more parsimonious solution?
    
* *Bugs(?):*

    * The evaluation of individual messages seems to change each time the code is run. Is this just the 
variation of the model? Are the examples that are not clearly negative or positive causing some flip-flopping?