## Getting started with NLP and Sentiment Analysis ##

Thus far, we have learned to access APIs, scrape, parse, and clean text. We're ready for NLP.

But what's NLP?

Natural Language Processing (NLP) names a broad range of techniques that involve applying computational analytical methods to text. ("Natural language" in this context just means human language as opposed to a programming language, like Python). In this unit, we'll explore many popular NLP techniques, beginning with sentiment analysis. 

Sentiment analysis is a method of quantifying the "sentiment," or emotional intensity, of words and phrases in a text. Some sentiment analysis tools, including the one we'll be working with today, also factor in the emotional weight of other features of language, such as punctuation marks or emojis. In general, sentiment analysis processes a unit of text (a sentence, a paragraph, a book, an email, a song, a tweet) and outputs scores or other classifications that indicate whether that unit of text conveys a positive or negative sentiment (according to the particular algorithm and dictionary employed). Some tools go as far as to quantify the *degree* of positivity or degree of negativity within a text. 

How might this be helpful? A researcher interested in attitudes toward a political event, for example, might use sentiment analysis to characterize how people describe that event on Twitter. Combined with geographic data, sentiment analysis can be used to make comparisons across regions. Combined with demographic data, sentiment analysis can be used to understand how different groups of people view any particular event (or issue, or individual). Sentiment analysis can be easily scaled up, which makes it possible to analyze hundreds of thousands or even millions of speech events.

Like any computational tool, sentiment analysis has limitations that must be taken into account. We'll explore some of these limitations via our readings and in class. (If you want a deep/fun/nerdy dive into these limitations, see [The Data-Sitters Club: Katia and the Sentiment Snobs](https://datasittersclub.github.io/site/dsc11.html)). In any case, when wielding sentiment analysis critically, creatively, and appropriately, it can lead to interesting results.

### NLTK and VADER ###

You will be using a few tools from Python's [NLTK](https://www.nltk.org/) (short for Natural Language Toolkit) to generate sentiment scores for the corpus that we created in Unit 1: the lyrics of the candidate playlists that we scraped from Genius.com. After completing today's other Jupyter notebook, you should have those lyrics in a directory in your "my-work" folder on JupyterHub.

NLTK is a collection of libraries and tools that help researchers apply computational methods to texts. It's been in development since 2001--almost as old as you (or older)!--and it's used widely in the field of NLP. The tools included in NLTK range from methods of breaking up text into smaller pieces, to identifying whether a word belongs in a given language, to providing sample corpora (that's the plural of corpus) that researchers can use for training and development purposes. We'll be using NLTK a lot in the coming weeks. As with the previous unit, I'll introduce you to its features as we require them for our specific goals.   

Today, we will be using one NLTK tool: [VADER](https://github.com/cjhutto/vaderSentiment) (short for Valence Aware Dictionary and sEntiment Reasoner), which generates positive, negative, and neutral sentiment scores for textual input.

VADER has a lot of advantages over traditional methods of sentiment analysis, including:
* It works well on social media text, yet readily generalizes to multiple domains;
* It doesn’t require training data but is constructed from a generalizable, valence-based, human-curated gold standard sentiment lexicon;
* It is fast enough to be used online with streaming data, and;
* It does not severely suffer from a speed-performance tradeoff.

The source of the above is an easy-to-read paper published by the creaters of VADER library, one of whom use to be my colleague at Georgia Tech. You can read it [here](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf).

### Sentiment Lexicons, Sentiment Intensity, and Context-Awareness ###

Traditionally, sentiment analysis worked by comparing text to a list of lexical features (i.e. words) that were determined by people to be either positive or negative. These are known as *sentiment lexicons.* (It's possible to create lexicons for other types of language as well; we'll talk about this more in the coming few days, as well as in Unit 3, when we discuss modeling in more detail.) 

More recently, tools have improved upon the positive/negative binary by offering more fine-tuned distinctions between varying degrees of positivity and negativity. This is known as *sentiment intensity*, and VADER does this well. For example, VADER scores “comfort” as moderately positive and “euphoria” as extremely positive. 

VADER also attempts to capture and score textual features common in informal online text such as capitalizations, exclamation points, and emoticons, as shown in the table below:

![VADER table](https://programminghistorian.org/images/sentiment-analysis/sentiment-analysis1.png)

### Caveat Emptor! ###

Like any text analysis tool, VADER should be evaluated critically and in the context of the assumptions it makes about communication. VADER was developed to analyze English language microblogging and social media sites (especially Twitter). This context is more informal than, for instance, political speeches; and more contemporary than, for instance, Shakespeare. But VADER was also developed as a general purpose sentiment analyzer, and the authors’ initial study shows it compares favorably against tools that have been trained for specific domains, use specialized lexicons, or resource-heavy machine learning techniques. That said, sentiment analysis continues to struggle to capture complex sentiments like irony, sarcasm, and mockery, when the average reader would be able to make the distinction between the literal text and its intended meaning.

A few more caveats: while VADER is a good general purpose tool for English language texts, VADER only provides partial native support for non-English texts (it detects emojis/capitalization/etc., but not word choice). The developers encourage users to use automatic translation to pre-process non-English texts and then input the results into VADER. There might be better tools for non-English language texts. 

### Some examples of hard-to-classify sentences ###

“The premise of the film was great, but it could have been better.”

* What words would you identify as being associated with either positive or negative sentiment?
* Would you say that this sentence have a positive or negative seniment? 
* What are some reasons that this sentence might be tricky for a sentiment analysis tool?

“The best I can say about the movie is that it was interesting.”

* What words would you identify as being associated with either positive or negative sentiment?
* Would you say that this sentence have a positive or negative seniment? 
* What are some reasons that this sentence might be tricky for a sentiment analysis tool?

### Enough Talk, Time for Action! ###

To use VADER, we need to import the nltk library and download and install the VADER lexicon. You do it like this:

In [None]:
import nltk
nltk.download('vader_lexicon')

In order to get a sense of what VADER can do, let’s calculate the sentiment scores for one of the songs we scraped from Genius.com.

The main component of VADER is its SentimentIntensityAnalyzer, so let's import that too:

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

(You can ignore the warning, if you get it, about not having twython installed). 

Technically, SentimentIntensityAnalyzer is a class, which we will use to build our own sentiment analyzer object.

To do so, we call SentimentIntensityAnalyzer() and assign the output - our brand-new sentiment analyzer - to a variable, which we will name ‘sid’.

In [None]:
sid = SentimentIntensityAnalyzer()

By doing this we have given "sid" all of the features of the VADER sentiment analysis object. It has become our sentiment analysis tool, but by a shorter name.

The method associated with the VADER sentiment analysis object that we want to use is what's called ‘polarity scores.’

Calling the `polarity_scores` method on sid with our lyrics (or any string) outputs a dictionary with negative, neutral, positive, and compound scores for the input text. 

Let's do a quick test of how this works with some Tweets from the current Georgia governor's race:

In [None]:
# https://twitter.com/staceyabrams/status/1572632059272695823 -- most recent tweet by Stacey Abrams as of 9/21/22
scores = sid.polarity_scores("""In Georgia, our differences are our superpower. The 
Latino community is diverse — that means their 
experiences and priorities in our state are also diverse 
whether it be through accessible health care or small businesses""") # remember multiline string formatting!  

for key in scores:
    print(key + ": " + str(scores[key]))

Amazing! We just performed our first text analysis! 

But how do we analyze the analysis?

VADER collects and scores negative, neutral, and positive words and features (and accounts for factors like negation along the way). The “neg”, “neu”, and “pos” values describe the fraction of weighted scores that fall into each category. So in this case, this song contains 0% negative features, 91.2% netural featurs, and 8.8% features. 

But what does the "compound" score mean? Clearly, this is not just the sum of the other three. It's more accurately described as a holistic score of the sentiment of whatever text has been passed in. It ranges from -1 to 1, and if you're curious, [here is one detailed explanation of how it's calculated](https://stackoverflow.com/questions/40325980/how-is-the-vader-compound-polarity-score-calculated-in-python-nltk). 

Now that we (sort of) know what these scores mean, let's try a few more political tweets:

In [None]:
# https://twitter.com/GovKemp/status/1572637818614325254 -- most recent tweet by Brian Kemp as of 9/21/22
scores = sid.polarity_scores("""(1/3) Fentanyl overdose deaths in Georgia teens are up 
800%, according to @GaDPH's latest report. As these 
poisons continue to flood across the U.S. southern 
border, the consequences of President Biden’s 
inaction on border security are being felt nationwide.""") 

for key in scores:
    print(key + ": " + str(scores[key]))

What do these numbers mean?

In [None]:
# https://twitter.com/HerschelWalker/status/1572335168676831233 -- most recent tweet by Hershel Walker as of 9/21/22

scores = sid.polarity_scores("""Wonderful to be in Jonesboro today at Crane 
Hardware, a successful small business founded in 1972 
that has served the Jonesboro community ever since. 

@ReverendWarnock has hurt our small businesses. I’m 
going to Washington to change that. #gapol #gasen""")

for key in scores:
    print(key + ": " + str(scores[key]))

In [None]:
# https://twitter.com/ReverendWarnock/status/1572661444080013312 -- most recent tweet by Raphael Warnock as of 9/21/22

scores = sid.polarity_scores("""Do you remember 
The 21st night of September? 
Love was changin' the minds of pretenders 
While chasin' the clouds away 🎶 
Happy @EarthWindFire Day!""")

for key in scores:
    print(key + ": " + str(scores[key]))

So this is (hopefully) starting to make sense. Now let's get our Beyoncé lyrics again:

In [None]:
# get the text of some song lyrics
import requests 

resp = requests.get('https://raw.githubusercontent.com/laurenfklein/QTM340-Fall22/main/corpora/lyrics/Beyonce-break-my-soul-lyrics.html')
html_str = resp.text

In [None]:
# use BeautifulSoup to find the lyrics tags on the page
from bs4 import BeautifulSoup
document = BeautifulSoup(html_str, "html.parser")

lyrics_divs = document.find("div", attrs={"data-lyrics-container": "true"})

# convert the contents of these tags into text, preserving linebreaks 
lyrics = lyrics_divs.get_text(separator='\n')

In [None]:
# remove the [CHORUS] and [VERSE] annotation

import re

cleaner_lyrics = re.sub(("\[.*\]"), "", lyrics) # this does not get the few that cross newlines

cleanest_lyrics = re.sub("^\[.+\]", "", cleaner_lyrics, flags=re.S) # this flag tells re.sub to look across multiple lines 

print(cleanest_lyrics)

**One thing I forgot to tell you about in the last lab:**

**This is how you save a file from Colab to Google Drive**

In [None]:
# this is how you save a file from Colab to Google Drive

from google.colab import drive
drive.mount('/content/gdrive')

with open("/content/gdrive/My Drive/beyonce-lyrics.txt", "w") as file:
    file.writelines(cleanest_lyrics)

In [None]:
# and this is how you read a file from your Google Drive into Colab 

with open("/content/gdrive/My Drive/beyonce-lyrics.txt", "r") as file: 
  breakmysoul = file.read()

print(breakmysoul)



And now let's see what its sentiment turns out to be:

In [None]:
scores = sid.polarity_scores(lines)

for key in scores:
    print(key + ": " + str(scores[key]))

What do these nummbers tell us?

Let's [listen to the song](https://www.youtube.com/watch?v=yjki-9Pthh0) and see if we agree. 

### A Quick Note on Thresholds ###

It can be helpful to set a minimum threshold for positivity or negativity so that you can classify a string/doc/etc as either positive or negative. The official VADER documentation suggests a threshold of -0.5 and 0.5, meaning to be counted negative it should be below -0.5 and as positive above 0.5. "Respect" easily meets the threshold for positive sentiment.

To get a sense of how "Break My Soul" compares to another song, let's try "As it Was" by Harry Styles 

In [None]:
resp = requests.get('https://raw.githubusercontent.com/laurenfklein/QTM340-Fall22/main/corpora/lyrics/Harry-Styles-as-it-was.txt')
asitwas = resp.text

print(asitwas)

In [None]:
scores2 = sid.polarity_scores(asitwas)

for key in scores2:
    print(key + ": " + str(scores2[key]))

So, VADER sees "As It Was" as neutral-to-positive. Listen for yourself [here](https://www.youtube.com/watch?v=H5v3kku4y6Q). Do you agree?

### Determining Appropriate Scope ###

There isn't much calibration, or pre-processing, required of sentiment analysis. But there is one important aspect of calibration to consider when employing sentiment analysis: the unit of the text being analyzed.

In the case of song lyrics, for example, we might want to analyze the entire song as a single unit, or we might want to analyze each line. 

* What are some research questions for which you might want to look at the entire song as a whole?
* What are some research questions for which you might want to look at each line at a time?

Let's redo our sentiment analysis so that we look at each line of the song individually.

In [None]:
# re-intialize VADER
sid = SentimentIntensityAnalyzer()

# then split our song lyrics into lines broken up by newlines 
styleslines = asitwas.split('\n') # note handy built-in python string function! 

# let's take a look
styleslines

In [None]:
# We add the additional step of iterating through the list of lines and 
# calculating and printing polarity scores for each one.

for line in styleslines:
    scores = sid.polarity_scores(line)
    print("Comp. score: " + str(scores['compound']) + " " + line)

Here you’ll note a much more detailed picture of the sentiment in the song. 

* What seems interesting?
* Did you notice any errors?
* What are some research questions we could ask of our song lyrics corpus with sentiment analysis?

*Lauren F. Klein wrote version 1.0 of this notebook, based off [Sentiment Analysis for Exploratory Data Analysis](https://programminghistorian.org/en/lessons/sentiment-analysis) by Zöe Wilkinson Saldaña with additional info by [Parul Pandey](https://medium.com/analytics-vidhya/simplifying-social-media-sentiment-analysis-using-vader-in-python-f9e6ec6fc52f). It was updated in 2020 by Dan Sinykin and again by Lauren Klein in 2021 and 2022.*

