## Source

[Zoë Wilkinson Saldaña, "Sentiment Analysis for Exploratory Data Analysis," Programming Historian 7 (2018), https://doi.org/10.46430/phen0079.](https://programminghistorian.org/en/lessons/sentiment-analysis)

## Reflection

Sentiment analysis is the use of natural language processing to extract subjective information from a textual corpus. It does so by attempting to quantify the emotional intensity of specific words and phrases in the text at hand. In other words, it uses quantitative methods to acquire qualitative data. In this example from the Programming Historian, sentiment analysis is used as the basis for an exploratory data analysis of a case study called "the Enron E-mail Corpus". Exploratory data analysis is essentially a strategy of summarizing or pointing out features of interest within a dataset which would otherwise likely be overlooked. The case study introduces the Enron scandal, where the company Enron was exposed for fraud and the Enron E-mail Corpus contained over 600,000 messages.


TODO:
I was not able to follow the discussion of the math underlying the methods here, but the code was not too complicated to follow once I sat down and really read through it. One of the main things I learned was how complex it can be to get data into the right "shape" for analysis. All of the dictionaries used here were complex, but I see how it was necessary to put the data together in that way. Another thing I learned was that "stopwords" which are often not important for other kinds of analysis are very important when it comes to stylometry because authors often use "stopwords" in unique and identifiable ways.

## Code

## Preparing the Data for Analysis

In [14]:
# first, we import the relevant modules from the NLTK library
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [15]:
# next, we initialize VADER so we can use it within our Python script
sid = SentimentIntensityAnalyzer()

In [16]:
# the variable 'message_text' now contains the text we will analyze.
message_text = '''Like you, I am getting very frustrated with this process. I am genuinely trying to be as reasonable as possible. I am not trying to "hold up" the deal at the last minute. I'm afraid that I am being asked to take a fairly large leap of faith after this company (I don't mean the two of you -- I mean Enron) has screwed me and the people who work for me.'''

In [17]:
print(message_text)

# Calling the polarity_scores method on sid and passing in the message_text outputs a dictionary with negative, neutral, positive, and compound scores for the input text
scores = sid.polarity_scores(message_text)

Like you, I am getting very frustrated with this process.
I am genuinely trying to be as reasonable as possible.
I am not trying to "hold up" the deal at the last minute.
I'm afraid that I am being asked to take a fairly large leap
of faith after this company (I don't mean the two of you -- 
I mean Enron) has screwed me and the people who work for me.


In [18]:
# Here we loop through the keys contained in scores (pos, neu, neg, and compound scores) and print the key-value pairs on the screen
for key in sorted(scores):
        print('{0}: {1}, '.format(key, scores[key]), end='')

compound: -0.3804, neg: 0.093, neu: 0.836, pos: 0.071, 