### Module 10 Assignment 

Lyn Nguyen Nov. 2022

Design a sentiment analysis classifier using the **Sentiment 140** corpus and **NLTK**. Test the classifier using content from Twitter and Reddit. Describe any limitations of your sentiment analyzer. Turn in Python code as a Jupyter for the classifier.


http://help.sentiment140.com/for-students

- data: trainingandtestdata folder 
	
http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/

- how to put together a sentiment analysis classifier

In [2]:
import pandas as pd
import nltk
import numpy as np
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


## Tutorial 

In [3]:
# run what's in laurentluce.com 
pos_tweets = [('I love this car', 'positive'),
              ('This view is amazing', 'positive'),
              ('I feel great this morning', 'positive'),
              ('I am so excited about the concert', 'positive'),
              ('He is my best friend', 'positive')]
neg_tweets = [('I do not like this car', 'negative'),
              ('This view is horrible', 'negative'),
              ('I feel tired this morning', 'negative'),
              ('I am not looking forward to the concert', 'negative'),
              ('He is my enemy', 'negative'),
              ('@Kenichan I dived many times for the ball. applepie Man.', 'negative')] # added

tweets = []

for (words, sentiment) in pos_tweets + neg_tweets:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 3] #throw out words with 1 or 2 characters
    tweets.append((words_filtered, sentiment))

print(tweets)

test_tweets = [
    (['feel', 'happy', 'this', 'morning'], 'positive'),
    (['larry', 'friend'], 'positive'),
    (['not', 'like', 'that', 'man'], 'negative'),
    (['house', 'not', 'great'], 'negative'),
    (['your', 'song', 'annoying'], 'negative')]

test_tweets

[(['love', 'this', 'car'], 'positive'), (['this', 'view', 'amazing'], 'positive'), (['feel', 'great', 'this', 'morning'], 'positive'), (['excited', 'about', 'the', 'concert'], 'positive'), (['best', 'friend'], 'positive'), (['not', 'like', 'this', 'car'], 'negative'), (['this', 'view', 'horrible'], 'negative'), (['feel', 'tired', 'this', 'morning'], 'negative'), (['not', 'looking', 'forward', 'the', 'concert'], 'negative'), (['enemy'], 'negative'), (['@kenichan', 'dived', 'many', 'times', 'for', 'the', 'ball.', 'applepie', 'man.'], 'negative')]


[(['feel', 'happy', 'this', 'morning'], 'positive'),
 (['larry', 'friend'], 'positive'),
 (['not', 'like', 'that', 'man'], 'negative'),
 (['house', 'not', 'great'], 'negative'),
 (['your', 'song', 'annoying'], 'negative')]

### CLASSIFIER
We get a list of features (words) and their frequencies next. 

In [63]:
# CLASSIFIER
import nltk

def get_words_in_tweets(tweets):  
    """smush all the words in the tweets into a single list"""
    all_words = []
    for (words, sentiment) in tweets:
      all_words.extend(words)
    return all_words


def get_word_features(wordlist):
    """ Outputs dictionary, although 
        no frequency count shows up (wordlist)"""
    wordlist = nltk.FreqDist(wordlist)  # FreqDist({'word1': 3, 'word2': 1, etc.}) ordered from most freq to least
    word_features = wordlist.keys()
    return word_features 

word_features = get_word_features(get_words_in_tweets(tweets))

In [5]:
tweets[1:3]
# type(tweets)

[(['this', 'view', 'amazing'], 'positive'),
 (['feel', 'great', 'this', 'morning'], 'positive')]

In [6]:
nltk.FreqDist(get_words_in_tweets(tweets))

FreqDist({'this': 6, 'the': 3, 'car': 2, 'view': 2, 'feel': 2, 'morning': 2, 'concert': 2, 'not': 2, 'love': 1, 'amazing': 1, ...})

In [7]:
word_features

dict_keys(['love', 'this', 'car', 'view', 'amazing', 'feel', 'great', 'morning', 'excited', 'about', 'the', 'concert', 'best', 'friend', 'not', 'like', 'horrible', 'tired', 'looking', 'forward', 'enemy', '@kenichan', 'dived', 'many', 'times', 'for', 'ball.', 'applepie', 'man.'])

Next, we need a feature extractor

In [8]:
def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        """word_features is predefined above as list of 
           3+ letter tokens from all tweets combined"""
        features['contains(%s)' % word] = (word in document_words)
    return features

In [9]:
# example 
extract_features(['love', 'this', 'car'])

{'contains(love)': True,
 'contains(this)': True,
 'contains(car)': True,
 'contains(view)': False,
 'contains(amazing)': False,
 'contains(feel)': False,
 'contains(great)': False,
 'contains(morning)': False,
 'contains(excited)': False,
 'contains(about)': False,
 'contains(the)': False,
 'contains(concert)': False,
 'contains(best)': False,
 'contains(friend)': False,
 'contains(not)': False,
 'contains(like)': False,
 'contains(horrible)': False,
 'contains(tired)': False,
 'contains(looking)': False,
 'contains(forward)': False,
 'contains(enemy)': False,
 'contains(@kenichan)': False,
 'contains(dived)': False,
 'contains(many)': False,
 'contains(times)': False,
 'contains(for)': False,
 'contains(ball.)': False,
 'contains(applepie)': False,
 'contains(man.)': False}

In [10]:
# apply features to classifier with our feature_extract function 
# it outputs a list of tuple, each tuple holds the "feature dictionary"
training_set = nltk.classify.apply_features(extract_features, tweets)
training_set

[({'contains(love)': True, 'contains(this)': True, 'contains(car)': True, 'contains(view)': False, 'contains(amazing)': False, 'contains(feel)': False, 'contains(great)': False, 'contains(morning)': False, 'contains(excited)': False, 'contains(about)': False, 'contains(the)': False, 'contains(concert)': False, 'contains(best)': False, 'contains(friend)': False, 'contains(not)': False, 'contains(like)': False, 'contains(horrible)': False, 'contains(tired)': False, 'contains(looking)': False, 'contains(forward)': False, 'contains(enemy)': False, 'contains(@kenichan)': False, 'contains(dived)': False, 'contains(many)': False, 'contains(times)': False, 'contains(for)': False, 'contains(ball.)': False, 'contains(applepie)': False, 'contains(man.)': False}, 'positive'), ({'contains(love)': False, 'contains(this)': True, 'contains(car)': False, 'contains(view)': True, 'contains(amazing)': True, 'contains(feel)': False, 'contains(great)': False, 'contains(morning)': False, 'contains(excited)':

In [11]:
# train our classifier using our training data set
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [12]:
# positive output because of word "friend"
tweet = 'Larry is my friend'
classifier.classify(extract_features(tweet.split()))

'positive'

## Apply

Now that we got the tutorial to work, let's call in Sentiment 140 data. These are their column names: 

0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)

1 - the id of the tweet (2087)

2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)

3 - the query (lyx). If there is no query, then this value is NO_QUERY.

4 - the user that tweeted (robotickilldozr)

5 - the text of the tweet (Lyx is cool)

## FINAL PROJECT TRIAL 

In [37]:
# FINAL PROJECT (FB)
input_path = "data/master_annotated.csv"
fp = pd.read_csv(input_path, encoding='latin-1')

In [38]:
fp.columns

Index(['Unnamed: 0', 'experiment_id', 'experiment_group', 'text', 'tweet_id',
       'tweet_likes', 'retweets', 'tweet_created_at', 'user_id',
       'in_reply_to_status_id', 'in_reply_to_user_id',
       'in_reply_to_screen_name', 'dow', 'month_day', 'time', 'yr', 'ymd',
       'tweet_id_char', 'created_at', 'description', 'location',
       'followers_count', 'screen_name', 'statuses_count', 'favourites_count',
       'verified', 'user_id_char', 'text_length', 'text_word_count',
       'opinion_key', 'opinion_label', 'opinion_annotation_confidence',
       'ego_involvement_key', 'ego_involvement_label',
       'ego_involvement_annotation_confidence'],
      dtype='object')

In [99]:
# popularity--> opinion_label

# create a list of our conditions
conditions = [(fp['opinion_key'] == 0),
    (fp['opinion_key'] == 1) ,
    (fp['opinion_key'] == 2) ,
    (fp['opinion_key'] == 3)]

# create a list of the values we want to assign for each condition
# values = ['supports', 'neutral', 'against', 'NA']
values = ['FOR student loan forgiveness', 'NEUTRAL support', 'AGAINST student loan forgiveness', 'cannot judge support']


# create a new column and use np.select to assign values to it using our lists as arguments
fp['algorithm_opinion'] = np.select(conditions, values)

# display updated DataFrame
fp.head(3)

Unnamed: 0.1,Unnamed: 0,experiment_id,experiment_group,text,tweet_id,tweet_likes,retweets,tweet_created_at,user_id,in_reply_to_status_id,...,opinion_key,opinion_label,opinion_annotation_confidence,ego_involvement_key,ego_involvement_label,ego_involvement_annotation_confidence,algorithm_opinion,wordTokenize,tokenLength,msgLen
0,0,1,msnbc,@MSNBC @MaddowBlog âSimpletonâs defenseâ...,1.596988e+18,4,0,Sun Nov 27 22:01:59 +0000 2022,1.51875e+18,1.596987e+18,...,2,AGAINST student loan forgiveness,0.7,1,Somewhat important,0.95,AGAINST student loan forgiveness,"[@MSNBC, @MaddowBlog, â, , , Simpletonâ, , ...",47,195
1,1,2,msnbc,@MSNBC @MaddowBlog I feel sorry for the sucker...,1.596993e+18,0,0,Sun Nov 27 22:22:27 +0000 2022,3202809000.0,1.596987e+18,...,1,NEUTRAL support,0.62,3,cannot judge importance,0.65,NEUTRAL support,"[@MSNBC, @MaddowBlog, I, feel, sorry, for, the...",21,114
2,2,3,msnbc,@MSNBC @MaddowBlog Setting up a 2024 elections...,1.596997e+18,0,0,Sun Nov 27 22:39:00 +0000 2022,140915700.0,1.596987e+18,...,2,AGAINST student loan forgiveness,0.43,2,Not important at all,0.81,AGAINST student loan forgiveness,"[@MSNBC, @MaddowBlog, Setting, up, a, 2024, el...",26,148


In [100]:
# turn df['tweet'] into token variables 

def tokenize_column(df): 
    '''From hw 8'''
    # input data
    # stem = pd.DataFrame(df)

    # iterate each col's row, use a list to add it back to the dataframe
    tokenized_list = []
    tLenList = []
    msgLen = []
    for ind in df.index: 
        msg = df['text'][ind]           #tweet--> text
        # tokens = word_tokenize(msg)
        tokens = TweetTokenizer().tokenize(msg) # https://stackoverflow.com/questions/34714162/preventing-splitting-at-apostrophies-when-tokenizing-words-using-nltk
        # tknzr = TweetTokenizer()
        # tknzr.tokenize("@Kenichan I haven't dived many times for the ball. Man")


        tokenized_list.append(tokens)
        tLenList.append(len(tokens))
        msgLen.append(len(msg))

    df['wordTokenize'] = tokenized_list
    df['tokenLength'] = tLenList
    df['msgLen'] = msgLen

    return df

### Need to remove stop words in tokenize_column()

In [101]:
# add a tokenized list from Twitter text
df1 = tokenize_column(fp)
df1.tail(3)

Unnamed: 0.1,Unnamed: 0,experiment_id,experiment_group,text,tweet_id,tweet_likes,retweets,tweet_created_at,user_id,in_reply_to_status_id,...,opinion_key,opinion_label,opinion_annotation_confidence,ego_involvement_key,ego_involvement_label,ego_involvement_annotation_confidence,algorithm_opinion,wordTokenize,tokenLength,msgLen
465,465,466,usedgov,@usedgov why are my student loans not transfer...,1.599892e+18,0,0,Mon Dec 05 22:24:29 +0000 2022,7.925171e+17,,...,0,FOR student loan forgiveness,0.95,1,Somewhat important,0.79,FOR student loan forgiveness,"[@usedgov, why, are, my, student, loans, not, ...",39,183
466,466,467,foxnews,@FoxNews Just another way of screwing the taxp...,1.599894e+18,0,0,Mon Dec 05 22:32:26 +0000 2022,1.518825e+18,1.599351e+18,...,2,AGAINST student loan forgiveness,0.42,3,cannot judge importance,0.4,AGAINST student loan forgiveness,"[@FoxNews, Just, another, way, of, screwing, t...",44,244
467,467,468,foxnews,@FoxNews The Democrats donât seem to be tryi...,1.599904e+18,0,0,Mon Dec 05 23:09:08 +0000 2022,1.586128e+18,1.599901e+18,...,3,cannot judge support,0.66,0,Very important,0.69,cannot judge support,"[@FoxNews, The, Democrats, donâ, , , t, seem...",40,196


In [102]:
# check how balanced the data is. total count of negative, neutral, and positive sentiment. 
df1['algorithm_opinion'].value_counts() # sentiment --> algorithm_opinion

NEUTRAL support                     193
AGAINST student loan forgiveness    136
FOR student loan forgiveness        120
cannot judge support                 19
Name: algorithm_opinion, dtype: int64

In [103]:
df1 = df1[['wordTokenize', 'algorithm_opinion']]

In [104]:
df1.tail(2)
# fp_data = fb[['wordTokenize', 'algorithm_opinion']]


Unnamed: 0,wordTokenize,algorithm_opinion
466,"[@FoxNews, Just, another, way, of, screwing, t...",AGAINST student loan forgiveness
467,"[@FoxNews, The, Democrats, donâ, , , t, seem...",cannot judge support


In [105]:
def records(df): 
    # https://stackoverflow.com/questions/9758450/pandas-convert-dataframe-to-array-of-tuples
    return df.to_records(index=False).tolist()
df1 = records(df1)

In [82]:
# input need columns wordTokenize, sentiment 
word_features = get_word_features(get_words_in_tweets(sample_data))

In [107]:
def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        """word_features is predefined above as list of 
           3+ letter tokens from all tweets combined"""
        features['contains(%s)' % word] = (word in document_words)
    return features

# apply features to classifier with our feature_extract function 
# it outputs a list of tuple, each tuple holds the "feature dictionary"
training_set = nltk.classify.apply_features(extract_features, df1)

In [108]:
# train our classifier using our training data set
classifier = nltk.NaiveBayesClassifier.train(training_set)


In [109]:
# test it out 
tweet = "@FoxNews He\'s having issues isn't he. He can't pass a ban on pew pews, he can't do student loan forgiveness (kind of intentional btw,) he can't pass gas because his heads in the way of natural progression in his bum. He just can't catch a break man. 😪"

classifier.classify(extract_features(tweet.split()))

'AGAINST student loan forgiveness'

### Apply classifier to Student Loan Twitter Data 

In [110]:
student_data = pd.read_csv('data/master_annotated.csv')
student_data.head(3)

Unnamed: 0.1,Unnamed: 0,experiment_id,experiment_group,text,tweet_id,tweet_likes,retweets,tweet_created_at,user_id,in_reply_to_status_id,...,verified,user_id_char,text_length,text_word_count,opinion_key,opinion_label,opinion_annotation_confidence,ego_involvement_key,ego_involvement_label,ego_involvement_annotation_confidence
0,0,1,msnbc,@MSNBC @MaddowBlog “Simpleton’s defense”? You...,1.596988e+18,4,0,Sun Nov 27 22:01:59 +0000 2022,1.51875e+18,1.596987e+18,...,False,1.51875e+18,183,30,2,AGAINST student loan forgiveness,0.7,1,Somewhat important,0.95
1,1,2,msnbc,@MSNBC @MaddowBlog I feel sorry for the sucker...,1.596993e+18,0,0,Sun Nov 27 22:22:27 +0000 2022,3202809000.0,1.596987e+18,...,False,3202809000.0,114,20,1,NEUTRAL support,0.62,3,cannot judge importance,0.65
2,2,3,msnbc,@MSNBC @MaddowBlog Setting up a 2024 elections...,1.596997e+18,0,0,Sun Nov 27 22:39:00 +0000 2022,140915700.0,1.596987e+18,...,False,140915700.0,148,20,2,AGAINST student loan forgiveness,0.43,2,Not important at all,0.81


In [111]:
test_data = student_data[['text', 'opinion_label']]
test_data.columns = ['tweet', 'sentiment'] # rename so we can use tokenize_column() if needed later 
test_data.head(3)

Unnamed: 0,tweet,sentiment
0,@MSNBC @MaddowBlog “Simpleton’s defense”? You...,AGAINST student loan forgiveness
1,@MSNBC @MaddowBlog I feel sorry for the sucker...,NEUTRAL support
2,@MSNBC @MaddowBlog Setting up a 2024 elections...,AGAINST student loan forgiveness


In [113]:
class_list = []
for row in test_data.index:
    msg = test_data['tweet'][row]
    msg_split = msg.split()
    result = classifier.classify(extract_features(msg_split))
    class_list.append(result)
test_data["algorithm_opinion"] = class_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data["algorithm_opinion"] = class_list


In [114]:
# create column to show if predicted_sentiment is the same as sentiment
conditions = [(test_data['sentiment']==test_data['predicted_sentiment']),
(test_data['sentiment'] != test_data['predicted_sentiment'])]
values = ['yes', 'no']
test_data['match'] = np.select(conditions, values)
test_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['match'] = np.select(conditions, values)


Unnamed: 0,tweet,sentiment,predicted_sentiment,algorithm_opinion,match
0,@MSNBC @MaddowBlog “Simpleton’s defense”? You...,AGAINST student loan forgiveness,FOR student loan forgiveness,FOR student loan forgiveness,no
1,@MSNBC @MaddowBlog I feel sorry for the sucker...,NEUTRAL support,NEUTRAL support,NEUTRAL support,yes
2,@MSNBC @MaddowBlog Setting up a 2024 elections...,AGAINST student loan forgiveness,NEUTRAL support,NEUTRAL support,no
3,@MSNBC @MaddowBlog If you can't pay off studen...,NEUTRAL support,NEUTRAL support,NEUTRAL support,yes
4,@MSNBC @MaddowBlog The simple defense is why s...,FOR student loan forgiveness,FOR student loan forgiveness,FOR student loan forgiveness,no
...,...,...,...,...,...
463,@FoxNews I don't need any bias media to tell m...,AGAINST student loan forgiveness,AGAINST student loan forgiveness,AGAINST student loan forgiveness,yes
464,@FoxNews He still trying to get college studen...,FOR student loan forgiveness,NEUTRAL support,NEUTRAL support,no
465,@usedgov why are my student loans not transfer...,FOR student loan forgiveness,FOR student loan forgiveness,FOR student loan forgiveness,no
466,@FoxNews Just another way of screwing the taxp...,AGAINST student loan forgiveness,AGAINST student loan forgiveness,AGAINST student loan forgiveness,yes


In [115]:
# count up now many matches
test_data['match'].value_counts()

yes    288
no     180
Name: match, dtype: int64

----------------

In [None]:
# count up now many matches
test_data['match'].value_counts()

yes    96
no     67
Name: match, dtype: int64

There are 96 matches between `predicted_sentiment` and `sentiment` out of 163 test data points. That is 59% accuracy.

Our model is accurate more than 1/2 of the time. Given its constraints, 59% is acceptable. We believe that if future work address the limitations of this model, the result will improve. Below is a list of the model's limitation: 
- not able to use emoticons 
- not recognizing @username as an entity/subject
- no treatment for commas and periods
- treat lower/upper cases differenlty
- special characters and hashtags are still in test data, unaddressed
- needed to remove stop words from the training model
- A larger training data set might yield better result. We only used 0.125% of the provided Sentiment 140 dataset (2K out of 1.6 million rows). 

Finally, the pre-labeled test data could not be neatly categorize. For example, when we sense "hope" in the text, we would label it as positive, even though there are negative sentiment that prefaces the hope/resolution. 
ex: 
>@POTUS since your student loan forgiveness move is not going to pass muster with the courts, why not do something legitimate and fair. Lock all student loans at 1% interest for all existing and future loans. #StudentLoans2022 #loanforgiveness #studentloans #college

The manual we gave this tweet was 'positive' but our model categorizes it as 'negative'. 

In [None]:
contain_values = test_data[test_data['tweet'].str.contains('@POTUS since your student loan forgiveness move is not going to pass muster with the courts')]
contain_values

Unnamed: 0,tweet,sentiment,predicted_sentiment,match
17,@POTUS since your student loan forgiveness mov...,positive,negative,no


-------------


In [15]:
input_path = "prelim_data/training.1600000.processed.noemoticon.csv"
# '/Users/lnguyen/Library/CloudStorage/OneDrive-Personal/JHU/SocialMediaAnalytics_2022fall/Nguyen_Lyn_module8_hw/SocialMediaInsightsforMachineLearning.xlsm'

s140_training = pd.read_csv(input_path, encoding='latin-1')
s140_training.columns = ["polarity", "tweet_id", "date", "query", "user", "tweet"]


In [16]:
# create a list of our conditions
conditions = [(s140_training['polarity'] == 0),
    (s140_training['polarity'] == 2) ,
    (s140_training['polarity'] == 4)]

# create a list of the values we want to assign for each condition
values = ['negative', 'neutral', 'positive']

# create a new column and use np.select to assign values to it using our lists as arguments
s140_training['sentiment'] = np.select(conditions, values)

# display updated DataFrame
s140_training.head(3)

Unnamed: 0,polarity,tweet_id,date,query,user,tweet,sentiment
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,negative
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,negative
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,negative


Looks like the data loads in nicely. Let's reformat the training data to that of the tutorial. We need a list of tuples, where each tuple is an individual tweet. 
ex: [('text here abc', 'positive'), ('the sun is hot', 'positive'), ...]

Then we need to split up the text into tokens. Differing from tutorial, we did not split words and filter those with fewer than 3 letters. Used TweetTokenizer().tokenize instead tokenize().word_tokenize to keep can't as one word and keep handle attached @.

In [17]:
s140_training.columns

Index(['polarity', 'tweet_id', 'date', 'query', 'user', 'tweet', 'sentiment'], dtype='object')

In [19]:
# turn s140_training['tweet'] into token variables 

def tokenize_column(df): 
    '''From hw 8'''
    # input data
    # stem = pd.DataFrame(df)

    # iterate each col's row, use a list to add it back to the dataframe
    tokenized_list = []
    tLenList = []
    msgLen = []
    for ind in df.index: 
        msg = df['tweet'][ind]
        # tokens = word_tokenize(msg)
        tokens = TweetTokenizer().tokenize(msg) # https://stackoverflow.com/questions/34714162/preventing-splitting-at-apostrophies-when-tokenizing-words-using-nltk
        # tknzr = TweetTokenizer()
        # tknzr.tokenize("@Kenichan I haven't dived many times for the ball. Man")


        tokenized_list.append(tokens)
        tLenList.append(len(tokens))
        msgLen.append(len(msg))

    df['wordTokenize'] = tokenized_list
    df['tokenLength'] = tLenList
    df['msgLen'] = msgLen

    return df

In [20]:
# add a tokenized list from Twitter text
df = tokenize_column(s140_training)
df.head(3)

Unnamed: 0,polarity,tweet_id,date,query,user,tweet,sentiment,wordTokenize,tokenLength,msgLen
0,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...,negative,"[is, upset, that, he, can't, update, his, Face...",24,111
1,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...,negative,"[@Kenichan, I, dived, many, times, for, the, b...",20,89
2,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire,negative,"[my, whole, body, feels, itchy, and, like, its...",10,47


In [21]:
# check how balanced the data is. total count of negative, neutral, and positive sentiment. 
df['sentiment'].value_counts()

positive    800000
negative    799999
Name: sentiment, dtype: int64

### Sample the corpus: 
With 1.6 million rows of data, to improve run time, we will sample 2k for now and proceed with the training. We will pick 1K with positive sentiment and 1K with negative sentiment. 

In [22]:
positive_df = df[df['sentiment'] == 'positive']
negative_df = df[df['sentiment'] == 'negative']

In [23]:
# get row count
positive_df.shape[0], negative_df.shape[0]

(800000, 799999)

In [24]:
import random
random.seed(10)

positive_sample = positive_df.sample(n = 1000)
negative_sample = negative_df.sample(n = 1000)

In [25]:
positive_sample.shape[0], negative_sample.shape[0]

(1000, 1000)

In [58]:
# combine positive_sample & negative_sample
sample_data = pd.concat([positive_sample, negative_sample]).shape[0]

2000

In [60]:
sample_data = pd.concat([positive_sample, negative_sample])[['wordTokenize', 'sentiment']]
sample_data

Unnamed: 0,wordTokenize,sentiment
913939,"[@enobytes, drank, a, 2003, I, guess, that, do...",positive
1316439,"[@HellenBach, that, is, good, ,, I, wish, I, c...",positive
1498361,"[@Ambee789, AGREED, &, AGREED, !]",positive
1458986,"[is, going, into, the, kitchen, as, the, smoke...",positive
1393378,"[Scotland, has, a, ', poonia, ', playing, for,...",positive
...,...,...
495237,"[my, toe, hurts, .]",negative
351190,"[Gotta, wait, for, 3, and, half, hours, for, m...",negative
519703,"[I, feel, horrible, ., Pato, is, taking, Kat, ...",negative
517241,"[sitting, sick, at, home, .]",negative


In [28]:
sample_data.tail(2)


Unnamed: 0,wordTokenize,sentiment
517241,"[sitting, sick, at, home, .]",negative
90271,"[Oh, Arsenal, NOT, again, fs, (, Could, Drogba...",negative


In [29]:
def records(df): 
    # https://stackoverflow.com/questions/9758450/pandas-convert-dataframe-to-array-of-tuples
    return df.to_records(index=False).tolist()
sample_data = records(sample_data)

In [61]:
sample_data

Unnamed: 0,wordTokenize,sentiment
913939,"[@enobytes, drank, a, 2003, I, guess, that, do...",positive
1316439,"[@HellenBach, that, is, good, ,, I, wish, I, c...",positive
1498361,"[@Ambee789, AGREED, &, AGREED, !]",positive
1458986,"[is, going, into, the, kitchen, as, the, smoke...",positive
1393378,"[Scotland, has, a, ', poonia, ', playing, for,...",positive
...,...,...
495237,"[my, toe, hurts, .]",negative
351190,"[Gotta, wait, for, 3, and, half, hours, for, m...",negative
519703,"[I, feel, horrible, ., Pato, is, taking, Kat, ...",negative
517241,"[sitting, sick, at, home, .]",negative


In [62]:
word_features = get_word_features(get_words_in_tweets(sample_data))
word_features

ValueError: too many values to unpack (expected 2)

In [31]:
def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        """word_features is predefined above as list of 
           3+ letter tokens from all tweets combined"""
        features['contains(%s)' % word] = (word in document_words)
    return features

# apply features to classifier with our feature_extract function 
# it outputs a list of tuple, each tuple holds the "feature dictionary"
training_set = nltk.classify.apply_features(extract_features, sample_data)

In [32]:
# train our classifier using our training data set
classifier = nltk.NaiveBayesClassifier.train(training_set)


In [33]:
# test it out 
tweet = 'we are happy with the outcome'
classifier.classify(extract_features(tweet.split()))

'positive'

### Apply classifier to Student Loan Twitter Data 

In [35]:
student_data = pd.read_csv('data/master_annotated.csv')


In [None]:
student_data = student_data[['tweetFullText', 'sentiment']]
test_data.columns = ['tweet', 'sentiment'] # rename so we can use tokenize_column() if needed later 
test_data.head(3)

### Apply classifier to Twitter Data

In [108]:
test_data = pd.read_csv('student_sample_4hashtag_900_mod10use.csv')

In [112]:
test_data= test_data[['tweetFullText', 'sentiment']]
test_data.columns = ['tweet', 'sentiment'] # rename so we can use tokenize_column() if needed later 
test_data.head(3)

Unnamed: 0,tweet,sentiment
0,“Who in the hell do they think that they are?”...,negative
1,"""It makes me so angry. They just continue to s...",negative
2,@AdamParkhomenko BREAKING: I AM NOT GETTING $5...,negative


In [121]:
class_list = []
for row in test_data.index:
    msg = test_data['tweet'][row]
    msg_split = msg.split()
    result = classifier.classify(extract_features(msg_split))
    class_list.append(result)
test_data["predicted_sentiment"] = class_list
    

In [123]:
# create column to show if predicted_sentiment is the same as sentiment
conditions = [(test_data['sentiment']==test_data['predicted_sentiment']),
(test_data['sentiment'] != test_data['predicted_sentiment'])]
values = ['yes', 'no']
test_data['match'] = np.select(conditions, values)
test_data

Unnamed: 0,tweet,sentiment,predicted_sentiment,match
0,“Who in the hell do they think that they are?”...,negative,negative,yes
1,"""It makes me so angry. They just continue to s...",negative,positive,no
2,@AdamParkhomenko BREAKING: I AM NOT GETTING $5...,negative,positive,no
3,@cnnbrk As someone who will still owe tens of ...,negative,negative,yes
4,@DrMarkScience Why should I still be paying fo...,negative,negative,yes
...,...,...,...,...
158,Without offering any thing close to a suggesti...,negative,positive,no
159,Yay for those of us with student loans held by...,negative,negative,yes
160,You can and still should apply for One-Time St...,positive,positive,yes
161,🇺🇸🌍 #DemsAbroad writes to @usedgov @FAFSA abou...,negative,negative,yes


In [126]:
# count up now many matches
test_data['match'].value_counts()

yes    96
no     67
Name: match, dtype: int64

There are 96 matches between `predicted_sentiment` and `sentiment` out of 163 test data points. That is 59% accuracy.

Our model is accurate more than 1/2 of the time. Given its constraints, 59% is acceptable. We believe that if future work address the limitations of this model, the result will improve. Below is a list of the model's limitation: 
- not able to use emoticons 
- not recognizing @username as an entity/subject
- no treatment for commas and periods
- treat lower/upper cases differenlty
- special characters and hashtags are still in test data, unaddressed
- needed to remove stop words from the training model
- A larger training data set might yield better result. We only used 0.125% of the provided Sentiment 140 dataset (2K out of 1.6 million rows). 

Finally, the pre-labeled test data could not be neatly categorize. For example, when we sense "hope" in the text, we would label it as positive, even though there are negative sentiment that prefaces the hope/resolution. 
ex: 
>@POTUS since your student loan forgiveness move is not going to pass muster with the courts, why not do something legitimate and fair. Lock all student loans at 1% interest for all existing and future loans. #StudentLoans2022 #loanforgiveness #studentloans #college

The manual we gave this tweet was 'positive' but our model categorizes it as 'negative'. 

In [129]:
contain_values = test_data[test_data['tweet'].str.contains('@POTUS since your student loan forgiveness move is not going to pass muster with the courts')]
contain_values

Unnamed: 0,tweet,sentiment,predicted_sentiment,match
17,@POTUS since your student loan forgiveness mov...,positive,negative,no
