# Lecture 11: Text Analysis

**Sentiment Analysis**

* Tokenization
* Stop words
* Stemming

**TF-IDF**

* Bag of Words
* Term frequency
* Inverse document frequency

Tools: `nltk`

In [None]:
# pandas and matplotlib setup
import pandas as pd

import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (17, 7)
plt.rcParams.update({'font.size': 14})
import seaborn as sns

#improve resolution
#comment this line if erroring on your machine/screen
%config InlineBackend.figure_format ='retina'

import warnings
warnings.filterwarnings('ignore')

# import natural language toolkit
# if you don't already have nltk installed, uncomment and run this
#import sys
#!{sys.executable} -m pip install nltk

import nltk

# download stopwords & punkt
nltk.download('stopwords')
nltk.download('punkt')

### Reminder: Natural Language Processing is a whole field of study.

Like most topics in this course, there are many courses solely focused on the appropriate analysis of text. We'll cover the general concepts in this course, but know you're missing lots of important details.


## Natural Language Toolkit (nltk)

For more details on using the functionality within this package, check out the NLTK Book.

0. Preface
1. Language Processing and Python
2. Accessing Text Corpora and Lexical Resources
3. Processing Raw Text
4. Writing Structured Programs
5. Categorizing and Tagging Words
6. Learning to Classify Text
7. Extracting Information from Text
8. Analyzing Sentence Structure
9. Building Feature Based Grammars
10. Analyzing the Meaning of Sentences
11. Managing Linguistic Data
12. Afterword: Facing the Language Challenge

[VADER](https://github.com/cjhutto/vaderSentiment) is a particularly helpful tool/lexicon when working with sentiments expressed in social media (tweets, online reviews, etc.)

Its functionality is available through nltk, so we'll download the vader lexicon for use later in this notebook.

In [None]:
# get lexicon we'll be working with today
nltk.download('vader_lexicon')

## The Data

In [None]:
df_wi21 = pd.read_csv('https://raw.githubusercontent.com/shanellis/datasets/master/COGS108_feedback_Wi21.csv')
df_wi21.head(6)

In [None]:
# read in feedback dataset - Fall 2020
df_fa20 = pd.read_csv('https://raw.githubusercontent.com/shanellis/datasets/master/COGS108_feedback_Fa20.csv')
df_fa20.head(6)

In [None]:
# read in feedback dataset - Spring 2020
df_sp20 = pd.read_csv('https://raw.githubusercontent.com/shanellis/datasets/master/COGS108_feedback_Sp20.csv')
df_sp20.head(6)

In [None]:
# read in feedback dataset - Winter 2020
df_wi20 = pd.read_csv('https://raw.githubusercontent.com/shanellis/datasets/master/COGS108_feedback_Wi20.csv')
df_wi20.head(6)

In [None]:
# read in feedback dataset - Spring 2019
df_sp19 = pd.read_csv('https://raw.githubusercontent.com/shanellis/datasets/master/COGS108_feedback_Sp19.csv')
df_sp19.head(6)

## Describe & Explore

We'll quickly describe and explore the data to see what information we have before moving on to Text Analysis.

### Data Considerations

* Duplicate responses?
* PIDs for individuals in the class (typos?)
* Missingness?
* Reflect reality?

### Missingness

In [None]:
# how many nonresponses
df_wi21.isnull().sum()

In [None]:
# how many nonresponses
df_fa20.isnull().sum()

In [None]:
# how many nonresponses
df_sp19.isnull().sum()

In [None]:
# how many nonresponses
df_wi20.isnull().sum()

In [None]:
# how many nonresponses
df_sp20.isnull().sum()

We see that there are more nonresponses in the `enjoyed_least` category than the `enjoyed_most` category. So, more people left what they enjoyed least blank than they did what they enjoyed most.

### Previous Quarters

Typically, there are a few people who have what they enjoy least but don't have an enjoy most. We don't have any last quarter...but often these students' feedback is of particular interest to me.

In [None]:
# Fall 2020
check_least = df_fa20[df_fa20['enjoyed_most'].isnull() & df_fa20['enjoyed_least'].notnull()]
list(check_least['enjoyed_least'])

In [None]:
# Spring 2020
check_least = df_sp20[df_sp20['enjoyed_most'].isnull() & df_sp20['enjoyed_least'].notnull()]
list(check_least['enjoyed_least'])

In [None]:
# Winter 2020
check_least = df_wi20[df_wi20['enjoyed_most'].isnull() & df_wi20['enjoyed_least'].notnull()]
list(check_least['enjoyed_least'])

In [None]:
# Spring 2019
check_least = df_sp19[df_sp19['enjoyed_most'].isnull() & df_sp19['enjoyed_least'].notnull()]
list(check_least['enjoyed_least'])

Missing data causes a problem in `nltk`, so we either get rid of individuals who didn't respond to both, or we can replace their missing data with 'No response', knowing that this text will be included in the analysis now.

In [None]:
def fill_no_response(df):
    '''replace missing data in enjoyed_most/least series with string No response'''
    
    df['enjoyed_most'] = df['enjoyed_most'].fillna('No response')
    df['enjoyed_least'] = df['enjoyed_least'].fillna('No response')

In [None]:
# fill NAs with string 'No response'
fill_no_response(df_wi21)
fill_no_response(df_fa20)
fill_no_response(df_sp20)
fill_no_response(df_sp19)
fill_no_response(df_wi20)

### Exploratory Plots

These can give us a quick idea of students' thoughts on the course. (I didn't ask these this quarter because I added the open ended question about how you're doing.)

* Time Spent
* (Relative Difficulty)
* (Quiz responses)

In [None]:
df = df_fa20

plt.subplot(1, 3, 1)
ax = sns.distplot(df['a1'], bins = 10)
ax.axvline(df['a1'].median(), color='#2e2e2e', linestyle='--')
plt.title('Approximately how long (hours) did you spend?', loc='left')
ax.text(x=df['a1'].median()+2, y=0.3, s=df['a1'].median(), fontsize=14, alpha=0.75, ha='center')
plt.xlabel('A1')

plt.subplot(1, 3, 2)
ax = sns.distplot(df['a2'], bins = 10)
ax.axvline(df['a2'].median(), color='#2e2e2e', linestyle='--')
ax.text(x=df['a1'].median()+2, y=0.21, s=df['a2'].median(), fontsize=14, alpha=0.75, ha='center')
plt.xlabel('A2')

plt.subplot(1, 3, 3)
ax = sns.distplot(df['proposal'], bins = 10)
ax.axvline(df['proposal'].median(), color='#2e2e2e', linestyle='--')
ax.text(x=df['proposal'].median()+2, y=0.22, s=df['proposal'].median(), fontsize=14, alpha=0.75, ha='center')
plt.xlabel('Project Proposal')

## Quick Checks: Words of Interest

In [None]:
df['enjoyed_least']

In [None]:
def check_word_freq(df, word):
    """checks for frequenccy of word specified in most and least enjoyed responses"""
    
    most = df['enjoyed_most'].str.lower().str.contains(word).sum()/df['enjoyed_most'].notnull().sum()
    least = df['enjoyed_least'].str.lower().str.contains(word).sum()/df['enjoyed_least'].notnull().sum()
    
    out = pd.DataFrame({'most_freq': [most], 'least_freq': [least]})
    return out

### Assignments

In [None]:
# check for assignment
df = df_wi21
check_word_freq(df, 'assignment')

### Projects

In [None]:
## check for project in free text
check_word_freq(df, 'project')

In [None]:
## check for group in free text
check_word_freq(df, 'group')

In [None]:
## check for individual in free text
check_word_freq(df, 'individual')

### Quizzes

In [None]:
check_word_freq(df_wi21, 'quiz')

In [None]:
check_word_freq(df_fa20, 'quiz')

In [None]:
check_word_freq(df_sp20, 'quiz')

In [None]:
check_word_freq(df_wi20, 'quiz')

In [None]:
check_word_freq(df_sp19, 'quiz')

### Labs

In [None]:
check_word_freq(df_wi21, 'lab')

In [None]:
check_word_freq(df_fa20, 'lab')

## Sentiment Analysis

We get a quick snapshot of what's going on in COGS 108, but we really want to understand the details. To do this, analyzing the sentiment of the text is a good next step.

### Step 1: Tokenization

Tokenization is the first step in analyzing text.

1. Aquire text of interest
2. Break text down (tokenize) into smaller chunks (i.e. words, bigrams, sentences, etc.)

A **token** is a single entity - think of it as a building block of language.

#### Tokenization Example

Here we demonstrate what a tokenized single response looks like.

In [None]:
# import word tokenizer
from nltk.tokenize import word_tokenize

In [None]:
# just focus on last quarter's responses
df = df_wi21

In [None]:
df.loc[25,'enjoyed_most']

In [None]:
tokenized_word = word_tokenize(df.loc[25,'enjoyed_most'])
print(tokenized_word)

In [None]:
df.loc[25,'enjoyed_most']

#### Tokenize COGS108 data

Using that concept we'll tokenize the words in the `enjoyed_most` and `enjoyed_least` columns for the data in our COGS108 data.

In [None]:
# tokenize most and least responses
df['most_token'] = df['enjoyed_most'].apply(word_tokenize) 
df['least_token'] = df['enjoyed_least'].apply(word_tokenize) 
df.head()

### Step 2: Stop Words

Stop words are words that are of less interest to your analysis.

For example, you wouldn't expect the following words to be important: is, am, are, this, a, an, the, etc.

By removing stopwords, you can lower the computational burden, focusing on only the words of interest.

To do so in nltk, you need to create a list of stopwords and filter them from your tokens.

In [None]:
# import stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

# look at stop words
print(stop_words)

####  Stop Words Example

Here we compare a sentence after tokenization to one that has been tokenized and had stop words removed.

In [None]:
# example of removing stop words
filtered_sent=[]
for w in tokenized_word:
    if w not in stop_words:
        filtered_sent.append(w)
print("Tokenized Sentence:", tokenized_word)
print("Filtered Sentence:", filtered_sent)

#### Remove Stop Words: COGS108 data

Using that idea, we can go ahead and remove stop words from our tokenized most and least liked tokenized data.

In [None]:
df['most_token']

In [None]:
# remove stop words
df['most_stop'] = df['most_token'].apply(lambda x: [item for item in x if item not in stop_words])
df['least_stop'] = df['least_token'].apply(lambda x: [item for item in x if item not in stop_words])
df.head()


### Step 3: Lexicon Normalization (Stemming)

In language, many different words come from the same root word.

For example, "intersection", "intersecting", "intersects", and "intersected" are all related to the common root word - "intersect".

Stemming is how linguistic normalization occurs - it reduces words to their root words (and chops off additional things like 'ing') - all of the above words would be reduced to their common stem "intersect."

#### Stemming Example

After tokenization and removing stop words, we can get the stem for all tokens (words) in our dataset.

In [None]:
# Stemming
from nltk.stem import PorterStemmer

ps = PorterStemmer()

stemmed_words=[]
for w in filtered_sent:
    stemmed_words.append(ps.stem(w))

print("Filtered Sentence:", filtered_sent)
print("Stemmed Sentence:", stemmed_words)

#### Stemming: COGS108 data

Here, we obtain the stem (root word) for all tokens in our dataset.

In [None]:
df['most_stem'] = df['most_stop'].apply(lambda x: [ps.stem(y) for y in x])
df['least_stem'] = df['least_stop'].apply(lambda x: [ps.stem(y) for y in x])
df.head()

### Step 4: Frequency Distribution

It can be helpful to get a sense of which words are most frequent in our dataset.

In [None]:
# get series of all most and least liked words after stemming
# note that "No Response" is still being included in the analysis
most = df['most_stem'].apply(pd.Series).stack()
least = df['least_stem'].apply(pd.Series).stack()

In [None]:
most

`FreqDist` calculates the frequency of each word in the text and we can plot the most frequent words.

In [None]:
from nltk.probability import FreqDist
import string

# calculation word frequency
fdist_most = FreqDist(most)
fdist_least = FreqDist(least)

# remove punctuation counts
for punc in string.punctuation:
    del fdist_most[punc]
    del fdist_least[punc]

In [None]:
# Frequency Distribution Plot - top 20
# for words in what students like most
fdist_most.plot(20, cumulative=False)

### Step 5: Sentiment Analysis!

**Sentiment Analysis** quantifies the content, idea, beliefs and opinions conveyed in text.

Two general approaches:

1. **Lexicon-based** - count number of words in a text belonging to each sentiment (positive, negative, happy, angry, etc.)
2. **Machine learning-based** - develop a classification model on pre-labeled data

#### Sentiment Example

To get a measure of overall sentiment in our text, we'll compare our text to the VADER lexicon.

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer 
analyser = SentimentIntensityAnalyzer()

VADER handles:

* Capitalization (great vs GREAT) & punctuation (exclamation makes more positive!)
* Emojis and emoticons
* Degree modifiers (extremely good vs. marginally good)
* Contractions and conjunctions (but signals shift)

`pos` + `neg` + `neu` = 1

`compound` score - metric that calculates sum of all the lexicon ratings and normalizes between -1 (most extreme negative) and +1 (most extreme positive)

* positive: compound >= 0.05
* neutral: -0.05 < compound < 0.05
* negative : compound <= -0.05

In [None]:
analyser.polarity_scores("The class is super cool.")

In [None]:
analyser.polarity_scores("The class is not super cool.")

In [None]:
analyser.polarity_scores("The class is NOT super cool!")

#### Sentiment Analysis: COGS108 data

Here, we will calculate the sentiment of each most liked and least liked student response from the survey.

In [None]:
# get list of the 'sentences' (responses) from each individual
most_list = list(df['enjoyed_most'].values)
least_list = list(df['enjoyed_least'].values)
most_list

In [None]:
# create function that will output dataframe 
# that stores sentiment information
def get_sentiments(input_list):
    
    output = pd.DataFrame()

    for sentence in input_list:
        ss = analyser.polarity_scores(sentence)
        ss['sentence'] = sentence
        output = output.append(ss, ignore_index=True)

    return output

In [None]:
# get sentiment measures
least_sentiments = get_sentiments(least_list)
most_sentiments = get_sentiments(most_list)

#### Sentiment Analysis: COGS108 data output

After calculating the sentiment of each response, we can look at the output of each.

In [None]:
# take a look at the output
least_sentiments.head(10)

In [None]:
# take a look at the output
most_sentiments.head(10)

Let's deal with those `No response` values

We've left them in there long enough. Let's remove the `No response` values before we look at any overall patterns.

In [None]:
most_sentiments = most_sentiments[most_sentiments['sentence'] != 'No response']
least_sentiments = least_sentiments[least_sentiments['sentence'] != 'No response']

#### Sentiment Analysis: COGS108 data - `describe`

To get an overall sense of the values stored in each of these dataframes, we can use describe.

In [None]:
most_sentiments.describe()

In [None]:
least_sentiments.describe()

#### Sentiment Analysis: COGS108 data - plotting

We can compare the distribution of the compound metric between the two analyses.

In [None]:
most_sentiments['compound'].plot.density(label='most')
least_sentiments['compound'].plot.density(label='least')
plt.legend()
plt.xlabel('Compound Sentiment Scores')
plt.xlim(-1,1)

In [None]:
# include label for boxplot
most_sentiments['which'] = 'most'
least_sentiments['which'] = 'least'
# concatenate data frames together
compound_out = pd.concat([most_sentiments, least_sentiments])
compound_out.head()

In [None]:
# plot compound by resonse type
sns.boxplot(data=compound_out, x='which', y='compound')
plt.xlabel('response')

Probably unsurprisingly, the overall sentiment of what students like tends to be more positive than what students like less.

Probably not surprising given the data and question on the survey. But, let's dig deeper into these data moving beyond sentiment analysis...

## TF-IDF

Term Frequency - Inverse Document Frequency (**TF-IDF**) sets out to identify the tokens most unique to your document of interest (relative to all documents in your corpus).

**Term Frequency (TF)** - counts the number of times a given word (or token, term, etc) occurs in each document divided by the number of words in that document

**Inverse Document Frequency (IDF)** - weights the word by their relative frequency across documents.

Words with a high TF-IDF are those with high frequency in one document & relatively low frequency in other documents

For our purposes, our **corpus** will be students' responses to what they like most and least about COGS108.

We'll treat this as **two separate documents**:

1. What students like most
2. What students like least

### Bag of Words (BoW) approach

Converts the text into a co-occurrence matrix across documents within the corpus.

To do this, let's get our text ready.

We're going to make sure all our words are lower case, remove punctuation from each, and then provide the text (`corpus`) to `TfidfVectorizer`.

In [None]:
import string 

# lowercase text
least = list(map(str.lower, least_list))
most = list(map(str.lower, most_list))

# remove punctuation
for c in string.punctuation:
    least = str(least).replace(c, "")
    most = str(most).replace(c, "")

# get list of two documents together
corpus = [str(least), str(most)]

In [None]:
corpus

### Calculate TF-IDF

With our text ready for analysis, it's time to calculate TF-IDF

To start our TF-IDF analysis, we'll first create a TfidfVectorizer object to transform our text data into vectors.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
TfidfVectorizer?

In [None]:
# create vectorizer
tfidf = TfidfVectorizer(sublinear_tf=True,
                        analyzer='word',
                        max_features=2000,
                        tokenizer=word_tokenize,
                        stop_words=stop_words)

#### TF-IDF: COGS108 data - calculation

Here, we use our vectorizer to calculate TF-IDF across the words in our word matrix.

In [None]:
# calculate TF-IDF
cogs_tfidf = pd.DataFrame(tfidf.fit_transform(corpus).toarray())
cogs_tfidf.columns = tfidf.get_feature_names()
cogs_tfidf = cogs_tfidf.rename(index={0:'least', 1:'most'})

In [None]:
cogs_tfidf.T.iloc[30:100]

#### TF-IDF: COGS108 data - output

If we just want to look at the word most uniuqe in each document...

In [None]:
most_unique = cogs_tfidf.idxmax(axis=1) 
most_unique

Alternatively, we can sort by the set or words most unique to each document:

In [None]:
cogs_tfidf.sort_values(by='most', axis=1, ascending=False)

In [None]:
cogs_tfidf.sort_values(by='least', axis=1, ascending=False)


**Sentiment Analysis** and **TF-IDF** are really helpful when analyzing documents and corpuses of text.

But, what if, from the text itself we wanted to predict whether or not the text was likely a 'most' liked or a 'least' liked comment? We'll discuss how to do this in the coming **machine learning** lectures!