## Getting started with NLP and Sentiment Analysis ##

*Lauren F. Klein wrote version 1.0 of this notebook, based off [Sentiment Analysis for Exploratory Data Analysis](https://programminghistorian.org/en/lessons/sentiment-analysis) by Zöe Wilkinson Saldaña with additional info by [Parul Pandey](https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f).*

Thus far, we have learned to access APIs, scrape, parse, and clean text. We're ready for NLP.

But what's NLP?

Natural Language Processing (NLP) names a broad range of techniques that involve applying computational analytical methods to text. ("Natural language" in this context just means human language as opposed to a programming language, like Python). In this unit, we'll explore many popular NLP techniques, beginning with sentiment analysis. 

Sentiment analysis is a method of quantifying the "sentiment," or emotional intensity, of words and phrases in a text. Some sentiment analysis tools, including the one we'll be working with today, also factor in the emotional weight of other features of language, such as punctuation marks or emojis. In general, sentiment analysis processes a unit of text (a sentence, a paragraph, a book, an email, a song, a tweet) and outputs scores or other classifications that indicate whether that unit of text conveys a positive or negative sentiment (according to the particular algorithm and dictionary employed). Some tools go as far as to quantify the *degree* of positivity or degree of negativity within a text. 

How might this be helpful? A researcher interested in attitudes toward a political event, for example, might use sentiment analysis to characterize how people describe that event on Twitter. Combined with geographic data, sentiment analysis can be used to make comparisons across regions. Combined with demographic data, sentiment analysis can be used to understand how different groups of people view any particular event (or issue, or individual). Sentiment analysis can be easily scaled up, which makes it possible to analyze hundreds of thousands or even millions of speech events.

Like any computational tool, sentiment analysis has limitations that must be taken into account. We'll explore some of those in our readings and Canvas discussion. But when wielding sentiment analysis critically *and* creatively, it can lead to interesting results.

### NLTK and VADER ###

You will be using a few tools from Python's [NLTK](https://www.nltk.org/) (short for Natural Language Toolkit) to generate sentiment scores for the corpus that we created in Unit 1: the lyrics of the candidate playlists that we scraped from Genius.com. After completing today's other Jupyter notebook, you should have those lyrics in a directory on your computer.

NLTK is a collection of libraries and tools that help researchers apply computational methods to texts. It's been in development since 2001--almost as old as you (or older)!--and it's used widely in the field of NLP. The tools included in NLTK range from methods of breaking up text into smaller pieces, to identifying whether a word belongs in a given language, to providing sample corpora (that's the plural of corpus) that researchers can use for training and development purposes. We'll be using NLTK a lot in the coming weeks. As with the previous unit, I'll introduce you to its features as we require them for our specific goals.   

Today, we will be using one NLTK tool: [VADER](https://github.com/cjhutto/vaderSentiment) (short for Valence Aware Dictionary and sEntiment Reasoner), which generates positive, negative, and neutral sentiment scores for textual input.

VADER has a lot of advantages over traditional methods of sentiment analysis, including:
* It works well on social media text, yet readily generalizes to multiple domains;
* It doesn’t require training data but is constructed from a generalizable, valence-based, human-curated gold standard sentiment lexicon;
* It is fast enough to be used online with streaming data, and;
* It does not severely suffer from a speed-performance tradeoff.

The source of the above is an easy-to-read paper published by the creaters of VADER library, one of whom use to work at Georgia Tech. You can read it [here](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf).

### Sentiment Lexicons, Sentiment Intensity, and Context-Awareness ###

Traditionally, sentiment analysis worked by comparing text to a list of lexical features (i.e. words) that were determined by people to be either positive or negative. These are known as *sentiment lexicons.* (It's possible to create lexicons for other types of language as well; we'll talk about this more in Unit 3, when we discuss modeling.) 

More recently, tools have improved upon the positive/negative binary by offering more fine-tuned distinctions between varying degrees of positivity and negativity. This is known as *sentiment intensity*, and VADER does this well. For example, VADER scores “comfort” as moderately positive and “euphoria” as extremely positive. 

VADER also attempts to capture and score textual features common in informal online text such as capitalizations, exclamation points, and emoticons, as shown in the table below:

![VADER table](https://programminghistorian.org/images/sentiment-analysis/sentiment-analysis1.png)

### Caveat Emptor! ###

Like any text analysis tool, VADER should be evaluated critically and in the context of the assumptions it makes about communication. VADER was developed to analyze English language microblogging and social media sites (especially Twitter). This context is more informal than, for instance, political speeches; and more contemporary than, for instance, Shakespeare. But VADER was also developed as a general purpose sentiment analyzer, and the authors’ initial study shows it compares favorably against tools that have been trained for specific domains, use specialized lexicons, or resource-heavy machine learning techniques. That said, sentiment analysis continues to struggle to capture complex sentiments like irony, sarcasm, and mockery, when the average reader would be able to make the distinction between the literal text and its intended meaning.

A few more caveats: while VADER is a good general purpose tool for English language texts, VADER only provides partial native support for non-English texts (it detects emojis/capitalization/etc., but not word choice). The developers encourage users to use automatic translation to pre-process non-English texts and then input the results into VADER. There might be better tools for non-English langauge texts. 

### Some examples of hard-to-classify sentences ###

“The premise of the film was great, but it could have been better.”
“The best I can say about the movie is that it was interesting.”

* What words would you identify as being associated with either positive or negative sentiment?
    * ANSWER HERE
* Would you say that these sentence have a positive or negative seniment? 
    * ANSWER HERE
* What are some reasons that these sentence might be tricky for a sentiment analysis tool?
    * ANSWER HERE

### Enough Talk, Time for Action! ###

To use VADER, we need to import the nltk library and download and install the VADER lexicon. You do it like this:

In [None]:
import nltk
nltk.download('vader_lexicon')
# nltk.download('punkt') # took this out of today's lesson 

In order to get a sense of what VADER can do, let’s calculate the sentiment scores for one of the songs we scraped from Genius.com.

The main component of VADER is its SentimentIntensityAnalyzer, so let's import that too:

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

(You can ignore the warning, if you get it, about not having twython installed). 

Technically, SentimentIntensityAnalyzer is a class, which we will use to build our own sentiment analyzer object.

To do so, we call SentimentIntensityAnalyzer() and assign the output - our brand-new sentiment analyzer - to a variable, which we will name ‘sid’.

In [None]:
sid = SentimentIntensityAnalyzer()

By doing this we have given "sid" all of the features of the VADER sentiment analysis object. It has become our sentiment analysis tool, but by a shorter name.

Now, let's open up one of the lyrics files we created. **Be sure that you have a folder titled 'lyrics' in the same folder with this notebook on your computer.**

In [None]:
with open("./lyrics/Aretha-franklin-respect.txt", "r") as file:
    lyrics = file.read()

# and just to be sure, print out what we've loaded in:
lyrics

We want what is called the ‘polarity score,’ which is either positive or negative. 

Calling the polarity_scores method on sid with our lyrics (or any string) outputs a dictionary with negative, neutral, positive, and compound scores for the input text. Let's do a test with some recent political slogans:

In [None]:
scores = sid.polarity_scores("Make America Great Again")

for key in sorted(scores):
    print('{0}: {1}, '.format(key, scores[key]), end='')

In [None]:
scores = sid.polarity_scores("Stronger Together")

for key in sorted(scores):
    print('{0}: {1}, '.format(key, scores[key]), end='')

In [None]:
scores = sid.polarity_scores("Hope and Change")

for key in sorted(scores):
    print('{0}: {1}, '.format(key, scores[key]), end='')

And now, the lyrics of "Respect"

In [None]:
scores = sid.polarity_scores(lyrics)

for key in sorted(scores):
    print('{0}: {1}, '.format(key, scores[key]), end='')

Amazing! We just performed our first text analysis! 

But how do we analyze the analysis?

VADER collects and scores negative, neutral, and positive words and features (and accounts for factors like negation along the way). The “neg”, “neu”, and “pos” values describe the fraction of weighted scores that fall into each category. In this case, VADER determined our song lyrics to consist of 3.5% negative words/features, 87.5% neutral words/features, and 9% positive words/features. 

VADER also sums all weighted scores to calculate a compound value normalized between -1 and 1; this value attempts to describe the overall sentiment of the entire chunk of text from strongly negative (-1) to strongly positive (1). In this case, the VADER analysis describes the song as strongly positive (.9342). We can think of this value as estimating the overall impression of an average listener when considering the song as a whole.

This [post](https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f) has a bit more about how VADER calculates its scores.  

Let's [listen to the song](https://www.youtube.com/watch?v=6FOUqQt3Kg0) and see if we agree. 

### A Quick Note on Thresholds ###

It can be helpful to set a minimum threshold for positivity or negativity so that you can classify a text either positive or negative. The official VADER documentation suggests a threshold of -0.5 and 0.5, meaning to be counted negative it should be below -0.5 and as positive above 0.5. "Respect" easily meets this threshold.

To get a sense of how "Respect" compares to another song, let's try "Dis Generation" by A Tribe Called Quest.

In [None]:
with open("./lyrics/A-tribe-called-quest-dis-generation.txt", "r") as file:
    lyrics2 = file.read()
    
# and just to be sure, print out what we've loaded in:
lyrics

In [None]:
scores2 = sid.polarity_scores(lyrics2)

for key in sorted(scores2):
    print('{0}: {1}, '.format(key, scores2[key]), end='')

Wow! VADER sees "Dis Generation" as very negative. Listen for yourself [here](https://www.youtube.com/watch?v=kQaSDJYwdh4). Do you agree?

ANSWER HERE

### Determining Appropriate Scope ###

There isn't much calibration, or pre-processing, required of sentiment analysis. But there is one important aspect of calibration to consider when employing sentiment analysis: the unit of the text being analyzed.

In the case of song lyrics, for example, we might want to analyze the entire song as a single unit, or we might want to analyze each line. 

* What are some research questions for which you might want to look at the entire song as a whole?
* What are some research questions for which you might want to look at each line at a time?

ANSWER HERE

Let's redo our sentiment analysis so that we look at each line of the song individually.

In [None]:
# re-intialize VADER
sid = SentimentIntensityAnalyzer()

# then split our song lyrics into lines broken up by newlines 
lines = lyrics.split('\n') # note handy built-in python string function! 

# let's take a look
print(lines)

In [None]:
# We add the additional step of iterating through the list of lines and 
# calculating and printing polarity scores for each one.

for line in lines:
    print(line)
    scores = sid.polarity_scores(line)
    for key in scores:
        print('{0}: {1}, '.format(key, scores[key]), end='')
    print()

Here you’ll note a much more detailed picture of the sentiment in the song. 

* What seems interesting?
* Did you notice any errors?
* What are some research questions we could ask of our song lyrics corpus with sentiment analysis?

ANSWER HERE