### Module 10 Assignment 

Lyn Nguyen Nov. 2022

Design a sentiment analysis classifier using the **Sentiment 140** corpus and **NLTK**. Test the classifier using content from Twitter and Reddit. Describe any limitations of your sentiment analyzer. Turn in Python code as a Jupyter for the classifier.


http://help.sentiment140.com/for-students

- data: trainingandtestdata folder 
	
http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/

- how to put together a sentiment analysis classifier

In [None]:
import pandas as pd
import nltk
import numpy as np
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


### CLASSIFIER
We get a list of features (words) and their frequencies next. 

In [185]:
# CLASSIFIER
import nltk

def get_words_in_tweets(tweets):  
    """smush all the words in the tweets into a single list"""
    all_words = []
    for (words, sentiment) in tweets:
      all_words.extend(words)
    return all_words


def get_word_features(wordlist):
    """ Outputs dictionary, although 
        no frequency count shows up (wordlist)"""
    wordlist = nltk.FreqDist(wordlist)  # FreqDist({'word1': 3, 'word2': 1, etc.}) ordered from most freq to least
    word_features = wordlist.keys()
    return word_features 

word_features = get_word_features(get_words_in_tweets(tweets))

In [186]:
def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        """word_features is predefined above as list of 
           3+ letter tokens from all tweets combined"""
        features['contains(%s)' % word] = (word in document_words)
    return features

## FINAL PROJECT TRIAL 

In [188]:
# STOP WORDS - from topic_modeling_11
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
stop_words.update(['.',  ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}','year', # remove it if you need punctuation
'#', '://', '/', 'www', '-', 'com', '=', '...', 'org', 'https', '@', '&', "'", '"', 'msnbc', 'foxnews', 'npr', 'nytimes', 'cnn', 'usedgov']) # added for this assignment


In [189]:
# turn df['tweet'] into token variables 

def tokenize_column(df): 
    '''From hw 8'''
    # input data
    # stem = pd.DataFrame(df)

    # iterate each col's row, use a list to add it back to the dataframe
    tokenized_list = []
    tLenList = []
    msgLen = []
    for ind in df.index: 
        msg = df['text'][ind]           #tweet--> text
        # tokens = word_tokenize(msg)
        tokens = TweetTokenizer().tokenize(msg) # https://stackoverflow.com/questions/34714162/preventing-splitting-at-apostrophies-when-tokenizing-words-using-nltk
        # tknzr = TweetTokenizer()
        # tknzr.tokenize("@Kenichan I haven't dived many times for the ball. Man")

        # tokenized_list.append(tokens) <-- uncomment for non-stopwords token
        # tLenList.append(len(tokens))
        # msgLen.append(len(msg))


        # remove stopwords 
        ts = [i.lower() for i in tokens if i.lower not in stop_words]
        tokenized_list.append(ts)
        tLenList.append(len(ts))
        msgLen.append(len(ts))
        

    df['wordTokenize'] = tokenized_list
    df['tokenLength'] = tLenList
    df['msgLen'] = msgLen

    return df

## Get training data 

In [197]:
input_path = "data/master_annotated.csv"
fp = pd.read_csv(input_path, encoding='latin-1')

In [198]:
# add a tokenized list from Twitter text
df1 = tokenize_column(fp)
df1.tail(3)

Unnamed: 0.1,Unnamed: 0,experiment_id,experiment_group,text,tweet_id,tweet_likes,retweets,tweet_created_at,user_id,in_reply_to_status_id,...,text_word_count,opinion_key,opinion_label,opinion_annotation_confidence,ego_involvement_key,ego_involvement_label,ego_involvement_annotation_confidence,wordTokenize,tokenLength,msgLen
465,465,466,usedgov,@usedgov why are my student loans not transfer...,1.599892e+18,0,0,Mon Dec 05 22:24:29 +0000 2022,7.925171e+17,,...,33,0,FOR student loan forgiveness,0.95,1,Somewhat important,0.79,"[@usedgov, why, are, my, student, loans, not, ...",39,39
466,466,467,foxnews,@FoxNews Just another way of screwing the taxp...,1.599894e+18,0,0,Mon Dec 05 22:32:26 +0000 2022,1.518825e+18,1.599351e+18,...,45,2,AGAINST student loan forgiveness,0.42,3,cannot judge importance,0.4,"[@foxnews, just, another, way, of, screwing, t...",44,44
467,467,468,foxnews,@FoxNews The Democrats donât seem to be tryi...,1.599904e+18,0,0,Mon Dec 05 23:09:08 +0000 2022,1.586128e+18,1.599901e+18,...,35,3,cannot judge support,0.66,0,Very important,0.69,"[@foxnews, the, democrats, donâ, , , t, seem...",40,40


In [199]:
# check how balanced the data is. total count of negative, neutral, and positive sentiment. 
df1['opinion_label'].value_counts() # sentiment --> opinion_label

NEUTRAL support                     193
AGAINST student loan forgiveness    136
FOR student loan forgiveness        120
cannot judge support                 19
Name: opinion_label, dtype: int64

In [200]:
df1 = df1[['wordTokenize', 'opinion_label']]

In [201]:
df1.tail(2)
# fp_data = fb[['wordTokenize', 'opinion_label']]


Unnamed: 0,wordTokenize,opinion_label
466,"[@foxnews, just, another, way, of, screwing, t...",AGAINST student loan forgiveness
467,"[@foxnews, the, democrats, donâ, , , t, seem...",cannot judge support


In [213]:
import random
random.seed(1033)
training_count = len(df1)*.8
# training = df1.sample(n = training_count)

# https://www.geeksforgeeks.org/how-to-do-train-test-split-using-sklearn-in-python/

from sklearn.model_selection import train_test_split

X = df1.get('wordTokenize')
y = df1.get('opinion_label')

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 1002, test_size = .2, shuffle = False)

training_data = pd.concat([X_train, y_train], axis = 1)

In [214]:
def records(df): 
    # https://stackoverflow.com/questions/9758450/pandas-convert-dataframe-to-array-of-tuples
    return df.to_records(index=False).tolist()
training_data = records(training_data)

In [215]:
# input need columns wordTokenize, sentiment 
word_features = get_word_features(get_words_in_tweets(training_data))

In [217]:
def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        """word_features is predefined above as list of 
           3+ letter tokens from all tweets combined"""
        features['contains(%s)' % word] = (word in document_words)
    return features

# apply features to classifier with our feature_extract function 
# it outputs a list of tuple, each tuple holds the "feature dictionary"
training_set = nltk.classify.apply_features(extract_features, training_data)

In [218]:
# train our classifier using our training data set
classifier = nltk.NaiveBayesClassifier.train(training_set)


In [219]:
# test it out 
tweet = "@FoxNews He\'s having issues isn't he. He can't pass a ban on pew pews, he can't do student loan forgiveness (kind of intentional btw,) he can't pass gas because his heads in the way of natural progression in his bum. He just can't catch a break man. 😪"

classifier.classify(extract_features(tweet.split()))

'AGAINST student loan forgiveness'

### Apply classifier to Student Loan Twitter Data 

In [None]:
# student_data = pd.read_csv('data/master_annotated.csv')
# student_data.head(3)

In [None]:
# Test Data 
test_data = pd.concat(X_test, y_test)


In [None]:
test_data = student_data[['text', 'opinion_label']]
test_data.head(3)

In [220]:
class_list = []
for row in test_data.index:
    msg = test_data['text'][row]
    msg_split = msg.split()
    result = classifier.classify(extract_features(msg_split))
    class_list.append(result)
test_data["predicted_sentiment"] = class_list

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data["predicted_sentiment"] = class_list


In [221]:
test_data.head(3)

Unnamed: 0,text,opinion_label,predicted_sentiment,match
0,@MSNBC @MaddowBlog “Simpleton’s defense”? You...,AGAINST student loan forgiveness,AGAINST student loan forgiveness,yes
1,@MSNBC @MaddowBlog I feel sorry for the sucker...,NEUTRAL support,NEUTRAL support,yes
2,@MSNBC @MaddowBlog Setting up a 2024 elections...,AGAINST student loan forgiveness,AGAINST student loan forgiveness,yes


In [222]:
# create column to show if predicted_sentiment is the same as sentiment
conditions = [(test_data['opinion_label']==test_data['predicted_sentiment']),
(test_data['opinion_label'] != test_data['predicted_sentiment'])]
values = ['yes', 'no']
test_data['match'] = np.select(conditions, values)
test_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['match'] = np.select(conditions, values)


Unnamed: 0,text,opinion_label,predicted_sentiment,match
0,@MSNBC @MaddowBlog “Simpleton’s defense”? You...,AGAINST student loan forgiveness,AGAINST student loan forgiveness,yes
1,@MSNBC @MaddowBlog I feel sorry for the sucker...,NEUTRAL support,NEUTRAL support,yes
2,@MSNBC @MaddowBlog Setting up a 2024 elections...,AGAINST student loan forgiveness,AGAINST student loan forgiveness,yes
3,@MSNBC @MaddowBlog If you can't pay off studen...,NEUTRAL support,NEUTRAL support,yes
4,@MSNBC @MaddowBlog The simple defense is why s...,FOR student loan forgiveness,NEUTRAL support,no
...,...,...,...,...
463,@FoxNews I don't need any bias media to tell m...,AGAINST student loan forgiveness,NEUTRAL support,no
464,@FoxNews He still trying to get college studen...,FOR student loan forgiveness,NEUTRAL support,no
465,@usedgov why are my student loans not transfer...,FOR student loan forgiveness,AGAINST student loan forgiveness,no
466,@FoxNews Just another way of screwing the taxp...,AGAINST student loan forgiveness,NEUTRAL support,no


In [223]:
# count up now many matches
test_data['match'].value_counts() # 342/468 = 73% correct  

yes    342
no     126
Name: match, dtype: int64

In [None]:
# ACCURACY RATE 

----------------

## EDIT AFTER MODEL COMPLETED

There are 96 matches between `predicted_sentiment` and `sentiment` out of 163 test data points. That is 59% accuracy.

Our model is accurate more than 1/2 of the time. Given its constraints, 59% is acceptable. We believe that if future work address the limitations of this model, the result will improve. Below is a list of the model's limitation: 
- not able to use emoticons 
- not recognizing @username as an entity/subject
- no treatment for commas and periods
- treat lower/upper cases differenlty
- special characters and hashtags are still in test data, unaddressed
- needed to remove stop words from the training model
- A larger training data set might yield better result. We only used 0.125% of the provided Sentiment 140 dataset (2K out of 1.6 million rows). 

Finally, the pre-labeled test data could not be neatly categorize. For example, when we sense "hope" in the text, we would label it as positive, even though there are negative sentiment that prefaces the hope/resolution. 
ex: 
>@POTUS since your student loan forgiveness move is not going to pass muster with the courts, why not do something legitimate and fair. Lock all student loans at 1% interest for all existing and future loans. #StudentLoans2022 #loanforgiveness #studentloans #college

The manual we gave this tweet was 'positive' but our model categorizes it as 'negative'. 

In [None]:
contain_values = test_data[test_data['tweet'].str.contains('@POTUS since your student loan forgiveness move is not going to pass muster with the courts')]
contain_values

-------------
