<a href="https://colab.research.google.com/github/lmrhody/femethodsS23/blob/main/week13notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Week 13 Notebook: Analyzing Sentiment 

Name:

Date:

Class:


## What is Sentiment Analysis? 
* opinion mining
* The author's attitude toward the subject (tone). 
* Measuring emotion --> 

  - how much? 
  - directed toward? 
  - negation? 

Opinion / emotion are equated for the purpose of measuring in sentiment analysis. There is work by linguists on how to measure sarcasm, refining negation measures, etc. 

## 3 Elements of Sentiment 
### Opinion / Emotion
#### Polarity
    - positive (+)
    - negative (-) 
    - sometimes though not often_ neutral (=)
    
#### Multi-class
    - joy 
    - surprise
    - anger 
    - love
    
#### Quantitative
    - Likert scale
    - rating
    - numerical grading, number of likes, endorsements, etc. 
    
### Subject
    - What is discussed? 
        - Book, movie, song, product, service, teacher... 
    - Can be mixed 
        -  ie. I like the main character of the novel, but the plot was slow and the dialogue was flat. 
    - In other words, the level of granularity at which the sentiment is leveled becomes part of the test. 
        
### Opinion Holder or Entity
    - Who holds the opinion? 
    - What do we know about the opinion holder?
    - How much does it matter? 

_Sentiment analysis does not work well with null values._
    
## How is Sentiment Analysis used
Social listening is a job category in which people are paid to use data such as Amazon reviews, Twitter hashtags, Rate My Professor reviews, etc, and identify: what is discussed (including granularity), how is it being discussed (opinion equated to sentiment / emotion), and by whom is it being discussed. Sentiment analysis is performed on single sentences, single words, collections of words, social media datasets, but also blog posts, online forums, and the news. 

It is meant to _enrich_ an assessment of a brand and opinions held about it. The same tools have also been used for social science research, and literary studies. For example, Matthew Jocker's analysis of _forms of the novel_ using the Syuzhet Package (an R package described here: https://cran.r-project.org/web/packages/syuzhet/vignettes/syuzhet-vignette.html). 

## Question for reflection: How does "sentiment analysis" work in a rhetorical analysis? 
Consider how we approach questions of "sentiment" (arguably, tone) in a close reading "rhetorial analysis" and what the slippages might be between that form of close reading analysis of sentiment and the "operationalized" reading that measures language's sentimental or emotional weight. 


## Question for reflection: Consider what it means to bring a metric and method designed for consumer research into the humanities? What are the potential advantages or disadvantages of doing so? 
You might consider as a starting off point that many scholars, especially Marxist scholars, of text have pointed to the tensions between market value of texts (consumer interest, purchasing, annd production) and the _literary tastes_ of "high fiction." Or, one might consider the tensions between mass market production of books and reader reception theory in conversation with the marketing and reception or "literariness" of small press literary production. 





In [None]:
import nltk
import pandas as pd
import numpy as np
import sklearn 

In [None]:
# import dataset (a sample from IMDB)
movies = 'IMDB_sample.csv'

In [None]:
# convert the IMDB comma separated values file into a pandas dataframe
movies_df = pd.read_csv(movies)

In [None]:
# Let's look at how the dataframe is organized by displaying the first 5 entries. 

movies_df.head()

Unnamed: 0.1,Unnamed: 0,review,label
0,18245,This short spoof can be found on Elite's Mille...,0
1,19801,A singularly unfunny musical comedy that artif...,0
2,3222,"An excellent series, masterfully acted and dir...",1
3,6010,The master of movie spectacle Cecil B. De Mill...,1
4,16355,I was gifted with this movie as it had such a ...,0


Ok, so when we're looking at this dataframe, what do we see? Each row has a unique identifier, which begins with 0. There is also an unnamed label. This number is probably either the number of the review or the number of characters, but we don't know b/c it's not labeled. Then, the text of the review can be found in the "review" column. Finally, the "label" colum holds a series of 0s and 1s. These 0s and 1s represent _polarity_ data. That is to say that it is a binary distinction between "positive" (1) and "negative" (0) reviews. The word "label" is confusing, because "reviews" is, technically also a label. Try not to get hung up there. But for now, the "label" label is where we know that the "sentiment" measure is held. 

In [None]:
# What if we wanted to know how many "positive" and "negative" reviews are in the dataset? 
# We'd call the name of the dataframe, we'd name the column that we want to perform an action on. We'd then say what
# action we'd like to perform. Here, value_counts will count the total number of each value and give you the totals.
# It will also tell you what "data type" you are counting. If you want to eventually do a sentiment analysis, 
# you would want an even distribution. So, this is a layer of exploratory work before you can get started. 

movies_df.label.value_counts()

0    3782
1    3719
Name: label, dtype: int64

In [None]:
# So, you know the numbers... but maybe it's hard to tell whether or not the two groups of data are proportionate. 
# Another way to look at your data would be to turn those label values into proportions. We do that by creating
# a mathematical equation. This says, take the label column from the movies dataframe and count the values for each
# unique label. Then, divide those numbers by the total number of rows in the movies_df dataframe. Here's how that looks:

movies_df.label.value_counts()/len(movies_df)

0    0.504199
1    0.495801
Name: label, dtype: float64

In [None]:
# Look back at the dataframe. What is the degree of granularity at which each review is assigned a sentimental value? 
# Do you have any thoughts about whether or not the length of a review relates to the amount of sentiment that might
# be found in side? Well, one thing you might want to do is to figure out how long the longest review is. 
# We do this by cfreating a pandas series. We're taking the reviews column from the movies dataframe. We are defining
# the data type as a string (str), and then we are calculating the length of each.

length_reviews = movies_df.review.str.len()

In [None]:
# The output when we do this is called a "series."
type(length_reviews)

pandas.core.series.Series

In [None]:
# We can use the "max" function to search the dataframe and then to find how many characters are in the longest review
max(length_reviews)

10321

In [None]:
# Same thing for the shortest review. 
min(length_reviews)

52

If we believe that the length of the review is a valuable feature that we want to continue working with, we could append this information to the dataframe. If you want to continue learning outside of class, this is a good challenge problem to work on. See if you can take this new data about each review and add it as an additional "feature" in your dataset. 

## Exploring Data in Detail

### Levels of granularity
Measuring "sentiment" this way is highly dependent on how closely you want to look at the text.
* Document level?
* Sentence level? 
* Aspect level? (relating sentiment to a direct referent, even within a single statement)

### Types of Sentiment Analysis
In general, there are two kinds of sentiment analysis.
#### Rule or Lexicon-based
In this approach, algorithms match the words in the lexicon to the dataset and eitheer sums the whole or averages them, depending on what function you choose. The result is a combinatory values. In other words, the reviews above are measured in 1/0 because they are a "net positive" or "net negative"--and the algorithm assignes the value based on where the total text ends up on a scale. 
* List of words
* Balance score --> nice: +2, good: +1, miserable: -4, happy: +3
* Relies on a hand-crafted set of valence scores as dictionaries / lexicons. 
* Fails at some tasks because different words have different valences in different contexts. 
** Polarity of words may change with the topic
** These changes can't be reflected easily in the dictionary
** Can work fast and is less computationally resource intensive

##### Example: 
Today, was, a, good, day. 
0, 0, 0, 1, 0 --> 1 - positive

#### Automatic / Machine Learning
* Modeled as a classification problem
* Using a dataset with "known sentiment" we need to predict the sentence of a dataset with unknown sentiment. 
* relies on labeled historical data
* is resource intensive (uses lots of a computer's reources to train models)
* can be "powerful" (ie. goes fast and changes flexibly, depending on how it is deployed)


In [None]:
from textblob import TextBlob
text = "Today was a good day."
my_valence = TextBlob(text)
my_valence.sentiment

Sentiment(polarity=0.7, subjectivity=0.6000000000000001)

THe sentiment function in TextBlob returns a tuple (that means data in pairs). Sentiment has 2 components, a _polarity_ value, and a _subjectivity_ value. Polarity is measured on a scale of -1 (negative) to 1 (positive) with 0 as a neutral value. Subjectivity, however, is measured in a range from 0 to 1. Measures the calculated degree to which a value may be accurately assessed at the assigned polarity value. 

In [None]:
# The datatype becomes a specific thing, a textblob object that has NLP processing performed on it. 
# Sentiment is one part of the processing that we're just calling from the new textblob object.
type(my_valence)

textblob.blob.TextBlob

In [None]:
twocities = "It was the best of times, it was the worst of times,it was the age of wisdom,it was the age of foolishness,it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way— in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only."""

In [None]:
citiesblob = TextBlob(twocities)

In [None]:
citiesblob.sentiment

Sentiment(polarity=0.022916666666666658, subjectivity=0.5895833333333332)

In [None]:
# Create a text string. Turn it into a TextBlob object. Call out the sentiment measures. 
tomcabin = "Late in the afternoon of a chilly day in February, two gentlemen were sitting alone over their wine, in a well-furnished dining parlor, in the town of P----, in Kentucky. There were no servants present, and the gentlemen, with chairs closely approaching, seemed to be discussing some subject with great earnestness."
tomblob = TextBlob(tomcabin)
tomblob.sentiment

Sentiment(polarity=-0.04444444444444443, subjectivity=0.4305555555555555)

## Questions? 


## Tutorials and resources for future study: 
    * https://pythonspot.com/python-sentiment-analysis/ 
    * https://github.com/nltk/nltk/wiki/Sentiment-Analysis
        