# Jonathan Bunch

26 September 2021

Bellevue University

DSC550-T301

---

# Week 4 Exercise: Sentiment Analysis

In [31]:
# Import libraries.
import nltk
import pandas as pd
from nltk.corpus import opinion_lexicon
from nltk.sentiment.vader import SentimentIntensityAnalyzer

## Load the data file DailyComments.csv from the Week 4 Data Files into a data frame.

In [32]:
df_daily = pd.read_csv("DailyComments.csv")
df_daily.head()

Unnamed: 0,Day of Week,comments
0,Monday,"Hello, how are you?"
1,Tuesday,Today is a good day!
2,Wednesday,It's my birthday so it's a really special day!
3,Thursday,Today is neither a good day or a bad day!
4,Friday,I'm having a bad day.


## Identify a scheme to categorize each comment as positive or negative. You can devise your own scheme or find a  commonly used scheme to perform this sentiment analysis. However you decide to do this, make sure to explain the scheme you decide to use.

I decided to try a simple word count method for categorizing the comments.  The nltk corpus includes pre-categorized
lists of positive and negative words (opinion_lexicon), which I will compare to the tokenized words from each comment.
Words that match those in the opinion lexicon are counted to determine the predicted sentiment.

For comparison, I tried using an implementation of the VADER model from nltk.sentiment.vader.  This method returns a
"sentiment score" with a value between (-1, 1) for the provided text.  Positve values represent positive sentiment,
negative values represente negative sentiment, and distance from zero indicates intensity of sentiment.

In [33]:
def sentiment_by_count(text):
    """
    This function tokenizes the provided string and compares the tokens to pre-sorted lists of words indicating
    positive or negative sentiment (courtesy of the nltk corpus). The number of words matching those in the pre-sorted
    lists are counted and compared, and we return the name of the sentiment with the greatest number of matches.
    """
    word_list = nltk.word_tokenize(text)
    pos = len([word for word in word_list if word in opinion_lexicon.positive()])
    neg = len([word for word in word_list if word in opinion_lexicon.negative()])
    if pos > neg:
        return "positive"
    elif pos < neg:
        return "negative"
    else:
        return "neutral"


def sentiment_by_vader(text):
    """This function returns the normalized sentiment score for the provided text. This function utilizes methods
    from the nltk.sentiment.vader module, provicded by:
    Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for
    Sentiment Analysis of Social Media Text. Eighth International Conference on
    Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014."""
    sent_int = SentimentIntensityAnalyzer()
    compound_score = sent_int.polarity_scores(text)['compound']
    return compound_score


## Implement your sentiment analysis with code and display the results. Note: DailyComments.csv is a purposely small file, so you will be able to clearly see why the results are what they are.

In [34]:
# Apply the sentiment_by_count function, and store the results in a new dataframe column.
df_daily['sentiment_by_count'] = df_daily.comments.apply(sentiment_by_count)
df_daily

Unnamed: 0,Day of Week,comments,sentiment_by_count
0,Monday,"Hello, how are you?",neutral
1,Tuesday,Today is a good day!,positive
2,Wednesday,It's my birthday so it's a really special day!,neutral
3,Thursday,Today is neither a good day or a bad day!,neutral
4,Friday,I'm having a bad day.,negative
5,Saturday,There' s nothing special happening today.,neutral
6,Sunday,Today is a SUPER good day!,positive


In [35]:
# Apply the sentiment_by_vader function the same way.
df_daily['vader_scores'] = df_daily.comments.apply(sentiment_by_vader)
df_daily

Unnamed: 0,Day of Week,comments,sentiment_by_count,vader_scores
0,Monday,"Hello, how are you?",neutral,0.0
1,Tuesday,Today is a good day!,positive,0.4926
2,Wednesday,It's my birthday so it's a really special day!,neutral,0.5497
3,Thursday,Today is neither a good day or a bad day!,neutral,-0.735
4,Friday,I'm having a bad day.,negative,-0.5423
5,Saturday,There' s nothing special happening today.,neutral,-0.3089
6,Sunday,Today is a SUPER good day!,positive,0.8327


## For up to 5% extra credit, find another set of comments, e.g., some tweets, and perform the same sentiment analysis.

I will try applying these functions to a twitter dataset I found at:
https://www.kaggle.com/crowdflower/twitter-airline-sentiment

In [36]:
# Import dataset.
df_t_raw = pd.read_csv("Tweets.csv")
df_t_raw

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0000,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0000,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0000,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0000,Can't Tell,1.0000,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14635,569587686496825344,positive,0.3487,,0.0000,American,,KristenReenders,,0,@AmericanAir thank you we got on a different f...,,2015-02-22 12:01:01 -0800,,
14636,569587371693355008,negative,1.0000,Customer Service Issue,1.0000,American,,itsropes,,0,@AmericanAir leaving over 20 minutes Late Flig...,,2015-02-22 11:59:46 -0800,Texas,
14637,569587242672398336,neutral,1.0000,,,American,,sanyabun,,0,@AmericanAir Please bring American Airlines to...,,2015-02-22 11:59:15 -0800,"Nigeria,lagos",
14638,569587188687634433,negative,1.0000,Customer Service Issue,0.6659,American,,SraJackson,,0,"@AmericanAir you have my money, you change my ...",,2015-02-22 11:59:02 -0800,New Jersey,Eastern Time (US & Canada)


In [37]:
# Pre-process data.
df_t_raw = df_t_raw[df_t_raw.airline_sentiment_confidence == 1]
df_tweets = df_t_raw.loc[:, ['text', 'airline_sentiment']]
df_tweets = df_tweets.rename(columns={'text': 'tweet', 'airline_sentiment': 'provided_sentiment'})

# I will take a small sample to work with for now.
df_tweets = df_tweets.sample(n=10, random_state=7).reset_index(drop=True)
df_tweets

Unnamed: 0,tweet,provided_sentiment
0,@USAirways please have your people hold flight...,neutral
1,@united second time flying into Houston and 45...,negative
2,@USAirways YOU ARE THE BEST AIRWAYS!!! FOLLOW ...,positive
3,@united finally got through and they were very...,positive
4,@USAirways thanks for making me miss my meetin...,negative
5,@USAirways what is going on with the computers...,negative
6,@JetBlue can't change it. True blue points and...,negative
7,@united WHERE IS MY RECEIPT! I upgraded return...,negative
8,@united flight 4567 have already been Cancelle...,negative
9,@JetBlue is gettin fancy! #Mint #LieFlat Nice ...,positive


In [38]:
# Apply the sentiment_by_count function.
df_tweets['sentiment_by_count'] = df_tweets.tweet.apply(sentiment_by_count)
df_tweets

Unnamed: 0,tweet,provided_sentiment,sentiment_by_count
0,@USAirways please have your people hold flight...,neutral,neutral
1,@united second time flying into Houston and 45...,negative,neutral
2,@USAirways YOU ARE THE BEST AIRWAYS!!! FOLLOW ...,positive,neutral
3,@united finally got through and they were very...,positive,positive
4,@USAirways thanks for making me miss my meetin...,negative,negative
5,@USAirways what is going on with the computers...,negative,negative
6,@JetBlue can't change it. True blue points and...,negative,neutral
7,@united WHERE IS MY RECEIPT! I upgraded return...,negative,positive
8,@united flight 4567 have already been Cancelle...,negative,negative
9,@JetBlue is gettin fancy! #Mint #LieFlat Nice ...,positive,positive


In [39]:
# Apply the sentiment_by_vader function.
df_tweets['vader_scores'] = df_tweets.tweet.apply(sentiment_by_vader)
df_tweets

Unnamed: 0,tweet,provided_sentiment,sentiment_by_count,vader_scores
0,@USAirways please have your people hold flight...,neutral,neutral,0.3802
1,@united second time flying into Houston and 45...,negative,neutral,0.0
2,@USAirways YOU ARE THE BEST AIRWAYS!!! FOLLOW ...,positive,neutral,0.7788
3,@united finally got through and they were very...,positive,positive,0.7479
4,@USAirways thanks for making me miss my meetin...,negative,negative,0.3182
5,@USAirways what is going on with the computers...,negative,negative,-0.533
6,@JetBlue can't change it. True blue points and...,negative,neutral,-0.3252
7,@united WHERE IS MY RECEIPT! I upgraded return...,negative,positive,-0.5449
8,@united flight 4567 have already been Cancelle...,negative,negative,-0.25
9,@JetBlue is gettin fancy! #Mint #LieFlat Nice ...,positive,positive,0.4753
