## Sentiment Analysis — Workbook

In this notebook, we're going to learn how to use [VADER](https://github.com/cjhutto/vaderSentiment) (Valence Aware Dictionary and sEntiment Reasoner), a sentiment analysis tool designed for social media. (Read the VADER paper [here](https://www.aaai.org/ocs/index.php/ICWSM/ICWSM14/paper/view/8109/8122).)

We're going to see how well VADER works with our own sentences and with sentences from *The House on Mango Street*. Can we create an accurate plot arc of Sandra Cisneros's coming-of-age novel?

---

## Install and Import Libraries/Packages

Import Pandas and set Pandas display column width to 400 characters

In [None]:
import pandas as pd
pd.options.display.max_colwidth = 400

Install [vaderSentiment package](https://github.com/cjhutto/vaderSentiment) with pip

In [None]:
!pip install vaderSentiment

Import the `SentimentIntensityAnalyser` and initlaize it

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sentimentAnalyser = SentimentIntensityAnalyzer()

## Calculate Sentiment Scores

To calculate sentiment scores for a sentence or paragraph, we can use the `.polarity_scores()` method.

In [None]:
sentimentAnalyser.polarity_scores("I like the Marvel movies")

In [None]:
sentimentAnalyser.polarity_scores("I don't like the Marvel movies")

In [None]:
sentimentAnalyser.polarity_scores("I don't *not* like the Marvel movies")

## Your Turn!

Try out the `sentimentAnalyzer` on some sentences of your own!

Experiment with capitalization, punctuation, emojis, historical words, slangy language, poetry, or non-English words. How does VADER handle it? What does VADER seem to do well and not so well?

In [None]:
#Your code here

In [None]:
#Your code here

## Calculate Sentiment Scores for *The House on Mango Street*

To calculate sentiment scores for *The House on Mango Street*, we first need a quick-and-easy way to break the novel up into sentences.

### Install and Import NLTK

Install [NLTK](https://www.nltk.org/), a Python library for text analysis natural language processing

In [None]:
!pip install nltk

Import nltk and download the model that will help us get sentences

In [None]:
import nltk
nltk.download('punkt')

### Load Text and Break Into Sentences

Read in the text file for "Hairs"

In [None]:
text_file = "../texts/literature/House-on-Mango-Street/02-Hairs.txt"
chapter = open(text_file, encoding="utf-8").read()

In [None]:
text_file = "../texts/literature/"
chapter = open(text_file, encoding="utf-8").read()

In [None]:
import math
number_of_chunks = 12

chunk_size = math.ceil(len(text) / number_of_chunks)

text_chunks = []

for number in range(0, len(text), chunk_size):
    text_chunk = text[number:number+chunk_size]
    text_chunks.append(text_chunk)

To break a string into individual sentences, we can use `nltk.sent_tokenize()`

In [None]:
nltk.sent_tokenize(chapter)

In [None]:
sentences = nltk.sent_tokenize(chapter)

### Calculate Scores for Each Sentence

We can loop through the sentences and calculate sentiment scores for every sentence.

*How would we print just the "compound" score for each sentence?*

In [None]:
for sentence in sentences:
    scores = sentimentAnalyser.polarity_scores(sentence)
    
    print(sentence, '\n', scores, '\n')

### Make DataFrame

A convenient way to make a DataFrame is to first make a list of dictionaries.

Below we loop through the sentences, calculate sentiment scores, and then create a mini-dictionary with the sentence and compound score, which we append to the list `sentence_scores`.

In [None]:
sentence_scores = []
for sentence in sentences:
    scores = sentimentAnalyser.polarity_scores(sentence)
    sentence_scores.append({'sentence': sentence, 'score': scores['compound']})

To make this list of dictionaries into a DataFrame, we can simply use `pd.DataFrame()`

In [None]:
pd.DataFrame(sentence_scores)

Let's examine the sentences from negative to positive sentiment scores.

In [None]:
hairs_df = pd.DataFrame(sentence_scores)
hairs_df.sort_values(by='score')

### Calculate Sentiment Scores By Chapter

To calculate sentiment scores for the sentences in each chapter of *The House on Mango Street*, we need to read in each file indviidually.

Below we will import `glob` and `Path`, which will allow us to get all the filenames for the chapters and extract the titles.

In [None]:
import glob
from pathlib import Path

Create a list of filenames for every `.txt` file in the directory

In [None]:
directory_path = "../texts/literature/House-on-Mango-Street/"
text_files = glob.glob(f"{directory_path}/*.txt")

Loop through each file in the "House on Mango Street" directory, calculate sentiment scores, and make a list of dictionaries

In [None]:
sentence_scores = []

# Loop through all the filenames
for text_file in text_files:
    
    #Read in the file
    chapter = open(text_file, encoding="utf-8").read()
    #Extract the end of the filename
    title = Path(text_file).stem
    
    #Loop through each sentence in the 
    for sentence in nltk.sent_tokenize(chapter):
        #Calculate sentiment scores for sentence
        scores = sentimentAnalyser.polarity_scores(sentence)
        
        #Make mini-dictionary with chapter name, sentence, and sentiment score
        sentence_scores.append({'chapter': title,
                                'sentence': sentence,
                                'score': scores['compound']})

Let's create a DataFrame from our list of dictionaries

In [None]:
chapter_df = pd.DataFrame(sentence_scores)
# Make the DataFrame alphabetical by chapter
chapter_df = chapter_df.sort_values(by='chapter')

How would we examine the most negative 15 sentences?

In [None]:
chapter_df...

How would we examine the most positive 15 sentences?

In [None]:
chapter_df...

### Make a Plot Arc

To create a data visualization of sentiment over the course of *The House on Mango Street*, we first need to calculate the average sentiment for each chapter.

In [None]:
chapter_df.groupby('chapter')['score'].mean()

In [None]:
chapter_means = chapter_df.groupby('chapter')['score'].mean().reset_index()

#### Bar Chart

In [None]:
import matplotlib.pyplot as plt

In [None]:
ax = chapter_means.plot(x='chapter', y='score', kind='bar',
                        figsize=(13,10), rot=90, title='Sentiment in Mango Street')

# Plot a horizontal line at 0
plt.axhline(y=0, color='orange', linestyle='-')

#### Line Chart

In [None]:
ax = chapter_means.plot(x='chapter', y='score', kind='line',
                        figsize=(13,10), rot=90, title='Sentiment in Mango Street')

#Not all xtick labels will show up in a line plot by default, so we have to make it explicit
ax.set_xticks(range(0, 44))
ax.set_xticklabels(chapter_means['chapter'].unique())

# Plot a horizontal line at 0
plt.axhline(y=0, color='orange', linestyle='-')
plt.show()

### Your Turn! 

How do these plot arcs align with your reading experience of *The House on Mango Street*? Examine some specific chapters and sentences below, and discuss how well VADER seems to be working or not working.

*Note: if you want to read the sentences in order, you can use the `.sort_index()` method*

In [None]:
chapter_df[chapter_df['chapter'].str.contains('Papa-Who')]

Examine another chapter or chapters

In [None]:
chapter_df[chapter_df['chapter'].str.contains('INSERT-PART-OF-CHAPTER-NAME')]

- How well do you think VADER sentiment analysis works with literary texts?
- How do social media posts and literary texts different in the way they express sentiment? (What is "sentiment", anyway?)
- Could you imagine using sentiment analysis in a project? If so, how?