# Week 10: Sentiment Analysis

Our task this week is as follows:
* Learn about sentiment analysis, and learn how to use the sentiment analysis package in TextBlob
* Load a novel into a dataframe, sentence by sentence.
* Record the sentiment values for each sentence in that dataframe
* Extract the sentences identified as the "happiest" and the "saddest" by the sentiment analysis system
* Plot the raw values for sentiment in the novel

# Sentiment Analysis demo

Before we get into how to use TextBlob, let's play around with sentiment analysis a little bit, shall we?

## TextBlob (Default, Lexicon-based)

The [documentation for TextBlob](https://textblob.readthedocs.io/en/dev/) isn't the best, but the default sentiment system is based on a tool called [pattern](https://github.com/clips/pattern), which employs a sentiment lexicon — a list of words with values, many of them hand-coded. 
- You can see the source code [here](https://github.com/sloria/TextBlob/blob/6396e24e85af7462cbed648fee21db5082a1f3fb/textblob/en/__init__.py#L8) (around line 80): it basically averages the sentiment scores for the all the words in the span, and applies some rule-based heuristics to identify negations. 
- You can see the full lexicon [here](https://github.com/sloria/TextBlob/blob/6396e24e85af7462cbed648fee21db5082a1f3fb/textblob/en/en-sentiment.xml); it's mostly adjective-based. 

In [None]:
from textblob import TextBlob

In [None]:
TextBlob("Neil Young is the greatest artist to come out of this country").polarity

In [None]:
TextBlob("I hate Neil Young and his stupid, whiny voice").polarity

In [None]:
TextBlob("Sometimes I feel like Neil Young is the greatest singer of his generation").polarity

In [None]:
TextBlob("Neil Young isn’t the worst Canadian musician").polarity

In [None]:
TextBlob("Oh yeah, Neil Young’s voice is as lovely as Josh Groban’s").polarity

In [None]:
TextBlob("Hating on amazing music isn’t something I’m known for").polarity

In [None]:
TextBlob("Neil Young").polarity

In [None]:
TextBlob("That would be the greatest misfortune of all").polarity

## TextBlob (Naive Bayes Classifier)

TextBlob has a second sentiment system, which uses a machine learning approach: a naive Bayes classifier trained on a set of movie reviews. 

TextBlob actually allows us to make our OWN naive Bayes classifiers... so let's make one to get a sense of how they work. (This example is from Nick Montfort's book *Exploratory Programming for the Arts and Humanities*, and it follows TextBlob's [tutorial "Building a Text Classification system"](https://textblob.readthedocs.io/en/dev/classifiers.html).)

First, we create our "training data" — a list containing a bunch of *tuples*, which are like two-item mini lists, each of which here contains some text and a label, `pos` or `neg`. Then we run the classifier on this training data.

In [None]:
sentiments = [
    ('Wittgenstein wrote one of the greatest philosophical works ever, an incredible contribution.', 'pos'),
    ('The Oulipo is a radical, pioneering group that has shaped literary history.', 'pos'),
    ('What an awesome sunset.', 'pos'),
    ('I love it!', 'pos'),
    ('Very good plan.', 'pos'),
    ('The final season of Game of Thrones made my eyes bleed.', 'neg'),
    ('Movies based on DC comic books are extremely tiresome.', 'neg'),
    ('That is a horrible idea.', 'neg'),
    ('I hate that sort of thing.', 'neg'),
    ('You lack imagination.', 'neg')]

In [None]:
from textblob.classifiers import NaiveBayesClassifier
cl = NaiveBayesClassifier(sentiments)

TextBlob will tell us what it considers the "most imfortmative features" (aka words) in our training data. What do you think they will be?

In [None]:
cl.show_informative_features(10)

In [None]:
cl.classify("Neil Young is the greatest artist to come out of this country")

In [None]:
cl.classify("I hate Neil Young and his stupid, whiny voice.")

In [None]:
cl.classify("Hating on amazing music isn’t something I’m known for.")

To use TextBlob's sentiment system based on a naive Bayes classifier model of a [movie reviews dataset](hhttp://www.cs.cornell.edu/people/pabo/movie-review-data/), we need to import it and then pass the `analyzer=NaiveBayesAnalyzer()` parameter to the commands we used above.

In [None]:
from textblob.sentiments import NaiveBayesAnalyzer

In [None]:
TextBlob("Neil Young is the greatest artist to come out of this country", analyzer=NaiveBayesAnalyzer()).sentiment

In [None]:
TextBlob("I hate Neil Young and his stupid, whiny voice.", analyzer=NaiveBayesAnalyzer()).sentiment

In [None]:
TextBlob("Sometimes I feel like Neil Young is the greatest singer of his generation", analyzer=NaiveBayesAnalyzer()).sentiment

In [None]:
TextBlob("Neil Young isn’t the worst Canadian musician", analyzer=NaiveBayesAnalyzer()).sentiment

In [None]:
TextBlob("Oh yeah, Neil Young’s voice is as lovely as Josh Groban’s", analyzer=NaiveBayesAnalyzer()).sentiment

In [None]:
TextBlob("Hating on amazing music isn’t something I’m known for", analyzer=NaiveBayesAnalyzer()).sentiment

In [None]:
TextBlob("That would be the greatest misfortune of all", analyzer=NaiveBayesAnalyzer()).sentiment

# Meet TextBlob!

Okay, let's now properly meet TextBlob: a Python library specifically designed for working with text. As you'll see, it does very easily a lot of things that we've been doing the hard way. But we did need to learn how to program in Python!

Let's start by importing TextBlob, which we accomplish as follows:

In [None]:
from textblob import TextBlob

The way we work with TextBlob is first by "blobbing" a string of text (aka, turning it from a string to a TextBlob object). This is done by passing the string as argument to the `TextBlob` function.

In [None]:
text = "It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife."

In [None]:
pride_blob = TextBlob(text)

In [None]:
type(pride_blob)

# Using TextBlob to Tokenize Strings and Split Them Into Sentences

Once a text is blobbed, we can start calling the special TextBlob methods on it. Note that TextBlob methods don't take arguments, and indeed don't even have the usual method syntax of being followed by `()` — which I personally find a bit ugly.

Let's look at two to start with:
- `blob.words`: This tokenizes the string, turning into words. We've been accomplishing this with Python's built-in `string.split()` for many weeks now, then doing some extra stuff like removing punctuation with regular expressions. TextBlob does it all in one fell swoop, and does a good job with it — although we get less control over the process, and I personally prefer our previous method (can you see why??). The object it returns behaves like a `list`.
- `blob.sentences`: This returns all the sentences in a string. We've been accomplishing this with `string.split(".")`. This does exactly the same thing, from what I can tell; for instance, it isn't smart enough to also split on `?` or `!`, and it is just as confused by contractions like `per cent.`. The object it returns again behaves like a `list`'

In [None]:
pride_blob.words()

In [None]:
pride_blob.words

In [None]:
type(pride_blob.words)

In [None]:
pride_blob.words[0]

In [None]:
for word in pride_blob.words:
    print(word.upper())

In [None]:
sot4 = open("sign-of-four.txt", encoding="utf-8").read()

In [None]:
sot4_blob = TextBlob(sot4)

In [None]:
sot4_blob.words[255:269]

In [None]:
sot4_blob.sentences[9:20]

# TextBlob Word Counts... and Python Dictionaries

TextBlob has another use method, `blob.word_counts`, which returns a list of the most commonly used terms in a document, along with a count for each of those words.

In [None]:
pride_blob.word_counts

In [None]:
sot4_blob.word_counts

Above, I called that a "list" of all the words in the strings, along with word counts... But in terms of the **Python data type** returned by the `blob.words_counts` method — well, that's not a `list` at all, but rather a **dictionary (`dict`)**... a data type we've been skirting around for a few weeks now.

We used **dictionaries** in the Week 8 Supplemented Lecture to generate our `gender_signal` column, and you played around with them in the Week 8 lecture. But now it's time to properly meet them.

## Python Dictionaries

As [Melanie Walsh explains](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/11-Dictionaries.html), dictionaries are mainly differentiated from `list`s by their use of **key-value pairs**. Whereas we access items in a list by their index position, we access the **values** of items in a dictionary by their **key**. 

Python dictionaries are always surrounded by curly brackets `{ }`. You can make a dictionary in this manner:

```
variable_name = {
   'key1': value1,
   'key2': value2,
   'key3': value3,
}
```

Note:
- Keys are `string`s; values can be of any data type.
- Note that a `,` comes between each key-value pair your define
- You don't need to arrange things like this typographically, with key-values pairs each on their own line, but it does make things look prettier

In [None]:
carnivores = {
    'python': 'a large heavy-bodied nonvenomous snake that kills prey by constriction and asphyxiation',
    'panda': 'a large bearlike mammal that, while technically a carnivore, is in practice a vegetarian, eating only bamboo',
    'blob': 'amoeboidal alien that envelops living beings, asphyxiating them'
}

In [None]:
type(carnivores)

In [None]:
carnivores

In [None]:
carnivores = {'python': 'a large heavy-bodied nonvenomous snake that kills prey by constriction and asphyxiation','panda': 'a large bearlike mammal that, while technically a carnivore, is in practice a vegetarian, eating only bamboo','blob': 'amoeboidal alien that envelops living beings'}

In [None]:
carnivores

You can see all the keys in a dictionary by calling the `dict.keys()` method, all the values in a dictionary by calling the `dict.values()` method, and all the key-value pairs in a dictionary by calling `dict.items()`.

In [None]:
carnivores.keys()

In [None]:
carnivores.values()

In [None]:
carnivores.items()

## Accessing Items in a Dictionary

Similarly to the way that we access items in a list, we can access items in a dictionary with square brackets `[]` and the **key name** of the value we want to extract.

In [None]:
carnivores['python']

In [None]:
carnivores['panda']

In [None]:
carnivores['blob']

## Changing Values and Adding Key-Value Pairs

This is accomplished as follows:

In [None]:
carnivores['blob'] = "a third-party Python library that slowly kills you by sucking up all of your time, because the textual analysis it facilitates is so fascinating"

In [None]:
carnivores['blob']

In [None]:
carnivores['kitten'] = "a delightful, fuzzy creature whose natural prey is cat food (dry or wet) and, especially, treats"

In [None]:
carnivores['kitten']

In [None]:
carnivores.values()

## Nested Dictionaries

I said earlier that the value of a particular key could be any data type... and that includes a dictionary. Yes, you can have dictionaries within dictionaries. Indeed, that's how our name-gender count list works in the `gender_signal` task.

In [None]:
name_counts = {
    'Adam': {'F': 0, 'M': 1},
    'Marta': {'F': 1, 'M': 0},
    'Rosie': {'F': 1, 'M': 0},
    'Jazz': {'F': 1, 'M': 0}
}

In [None]:
name_counts['Jazz']

In [None]:
name_counts['Jazz']['F']

## Iterating Through Dictionaries

You can iterate through dictionaries — but first you need to specify, by calling the appropriate method, if you want to iterate over keys, values, of key-value pairs.

In [None]:
for key in carnivores.keys():
    print(f"I am so afraid of {key.upper()}S!!!!")

In [None]:
for value in carnivores.values():
    print(f"Did you know there is a kind of carnivore that is {value}???")

In [None]:
for key, value in carnivores.items():
    print(f"A {key} is {value}")

## Back to `blob.word_counts`!

So... as I said, TextBlob's `word_counts` method produces a dictionary-like object, in which each key is a unique word in the string, and each value is a count of how many times that word occurs in the string.

In [None]:
sot4_counts = sot4_blob.word_counts

In [None]:
type(sot4_counts)

In [None]:
sot4_counts['cocaine']

By the way, since `blob.word_counts` produces a dictionary-like object in which each key is a unique word... can you tell me the one-line command we could use use to calculate the TTR of any TextBlob object?

In [None]:
# We'll figure this one out together...

# Sentiment Analysis in TextBlob

Okay, it's finally time to get back to the thing we really want to do in TextBlob: use its sentiment analysis package! 

This is accessible with the `blob.sentiment`, `blob.polarity`, and `blob.subjectivity` methods.

In [None]:
pride_blob.sentiment

In [None]:
pride_blob.polarity

In [None]:
pride_blob.subjectivity

Today we are going to focus on sentiment polarity today (how positive or negative, happy or sad, a particular span of text is. 

In [None]:
TextBlob("My life is ruined and I am miserable").polarity

In [None]:
TextBlob("My life is amazing and I am overjoyed").polarity

In [None]:
TextBlob("My life is not ruined and I am not miserable").polarity

In [None]:
TextBlob("My life is not amazing and I am not overjoyed").polarity

In [None]:
TextBlob("It's kind of like a potato").polarity

# Creating a DataFrame of Polarity Values for *The Sign of the Four*

We now have pretty much all the pieces in place to accomplish our task: creating a DataFrame in which each row contains a sentence from *The Sign of the Four* and the TextBlob polarity and subjectivity score for that sentence. Let's go!

We will create three parallel lists:
- one containing the text of every sentence, in the form of a `string`
- one containing a polarity value for each sentence, in the form of a `float`
- one containing a subjectivity value for each sentence, also in the form of a `float`

How would we do this, using skills we learned back in the first half of the course?

## Using `blob.sentences`

Let's start by examining the output of TextBlob's `blob.sentences` method more closely, so we get a better sense of how we'll produce our three desired lists.

In [None]:
sot4_sentences_blob = sot4_blob.sentences

In [None]:
type(sot4_sentences_blob)

In [None]:
sot4_sentences_blob[22]

In [None]:
type(sot4_sentences_blob[22])

In [None]:
sot4_sentences_blob[22].polarity

In [None]:
sot4_polarities = []

for sentence in sot4_sentences_blob:
    sot4_polarities.append(sentence.polarity)

In [None]:
sot4_polarities[:10]

In [None]:
sot4_subjectivities = []

for sentence in sot4_sentences_blob:
    sot4_subjectivities.append(sentence.subjectivity)

In [None]:
sot4_subjectivities[:10]

In [None]:
sot4_sentences_blob[22]

In [None]:
sot4_sentences_blob[22].raw

In [None]:
type(sot4_sentences_blob[22].raw)

In [None]:
sot4_sentences_blob[0]

In [None]:
sot4_sentences_blob[0].raw

Since that output is a bit ugly, with all those `\n\n\n`s, let's create our `string` of each sentence in a slightly different way: by using Python's `string.join()` method, which we met wayyyyy back in Week 3 (go look if you don't believe me!). 

Here, we'll use `string.join()` to join together all the `blob.word`s with spaces, which gives us a pretty string to work with.

In [None]:
sot4_sentences_blob[0].words

In [None]:
" ".join(sot4_sentences_blob[0].words)

In [None]:
type(" ".join(sot4_sentences_blob[0].words))

In [None]:
sot4_sentences = []

for sentence in sot4_sentences_blob:
    sot4_sentences.append(" ".join(sentence.words))

In [None]:
sot4_sentences[:10]

# Creating a DataFrame from Three Parallel Lists

Okay, we have all the contents of our desired DataFrame.

- A list containing all the sentences of *The Sign of the Four*, in order
- A list containing the polarity values for each of those sentences, in order
- A list containing the subjectivity values for each of those sentences, in order

Our friend Pandas allows us to quite easily make a new DataFrame out of this kind of data, with its `pd.DataFrame()` method.

The `pd.DataFrame()` method takes as its argument... **a dictionary**! (See why we had to finally learn about dictionaries??). It expects this argument to be formatted as follows:

```
new_df = pd.DataFrame(
    {
        'column1': list1,
        'column2': list2,
        'column3': list3
    }
)
```

Of course, you could also write this same command without all the tabs and newlines as follows:

`new_df = pd.DataFrame({'column1': list1, 'column2': list2, 'column3': list3})`


In [None]:
import pandas as pd

In [None]:
sot4_sentence_sentiment_df = pd.DataFrame({
    'sentence': sot4_sentences,
    'polarity': sot4_polarities,
    'pubjectivity': sot4_subjectivities
})

In [None]:
sot4_sentence_sentiment_df

Let's now have a look at the sentences that TextBlob considers the most positive, as well as the most negative ones...

In [None]:
sot4_sentence_sentiment_df.sort_values(by='polarity', ascending=False)[:15]

Pretty hard to read what's in the `Sentence` column! We could export it to a CSV and explore it in Excel or Google Sheets... or we can set this Pandas parameter so that there is no maximum column width, and it will just show us everything!

In [None]:
pd.set_option('display.max_colwidth', 0)

In [None]:
sot4_sentence_sentiment_df.sort_values(by='polarity', ascending=False)[:15]

In [None]:
sot4_sentence_sentiment_df.sort_values(by='polarity', ascending=True)[:15]

## Preview of Next Time: Plotting Polarity Values

Let's quickly do a plot of the polarity values for all the sentences in *The Sign of the Four*. What does this plot tell you? Does it correspond to your sense of the "emotional trajectory" of the novel? What works about it, and what doesn't? How could it be improved? 

In [None]:
sot4_sentence_sentiment_df[['Polarity']].plot(figsize=(20,8))