<center>
<h1>Cultural Analytics</h1><br>
<h2>ENGL64.05 / QSS 30.16 22F</h2>
</center>

----

# Lab 5: Machine Learning and Sentiment Analysis

 <center><pre>Created: 05/13/2021; Updated 10/24/2022</pre></center>

## Section 1: Experiment with TextBlob's built-in Sentiment Tool

In [None]:
from textblob import TextBlob

In [None]:
# Montfort (pp. 201-204) explains how to use TextBlob's included sentiment analysis tool
# In the cell below, use this to evaluate the sentiment and subjectivity of several sample 
# sections of text. Search for text from news reports, tweets, web site text, 
# fictional texts, etc.

In [None]:
# Examine how the sentiment was calculated with the "sentiment_assessments" method. Try other 
# texts. What do you notice about these terms?

## 1.1: Construct a Scaled Sentiment System

In [None]:
# Now try to convert the sentiment score from TextBlob into a four-star system for 
# determining the degree to which you've found a positive section of text. Create a new
# function using def to call TextBlob and return a positivity score rendered in the 
# four-star scale.
#
# HINT: You can print an emoji (!) with cut-and-paste. If you want to display multiple 
# characters you can use string multiplication: "hey" * 10 will print "hey" ten times.

##  1.2: Building Our Own Sentiment Classifier

In [None]:
# Python's native storage format for saving serialized
# data objects is called "pickle." We can "dump" and "load"
# objects from these files. This file contains a set of Goodreads
# reviews pulled from the Web.

import pickle
reviews = pickle.load(open("shared/ENGL64.05-22F/data/book-reviews.pkl","rb"))
print("loaded {0} book reviews".format(len(reviews)))

In [None]:
# Preprocessing:
# The reviews contain "newline" characters, so let's drop those:
reviews = [r.replace("\n","") for r in reviews]

# Remove quotation marks for clarity and ease of processing
reviews = [r.replace("'","") for r in reviews]
reviews = [r.replace('"',"") for r in reviews]
              
# Do you want to make everything lowercase? What else might you do?

In [None]:
reviews[4]

In [None]:
# Let's split the data into training and testing datasets.
# The training set will contain the first 500 reviews
# and the testing set will be the remainder
training_reviews = reviews[:500]
testing_reviews = reviews[500:]

In [None]:
# Here k should be the number of samples that we want to extract
# you should select a large enough number to contain both positive
# and negative reviews but not too many to make creating your labeled 
# training set too large.
#
# We're also displaying the index within the training dataset, which could
# potentially be useful if you are not into cut-and-paste in constructing 
# your labeled data.

from random import sample
sample(list(enumerate(training_reviews)), k=NUM_SAMPLES)

### Create Training Data

In [None]:
# Now we need to create a list of tuples in which the first item
# is the text of the review and the second a label of either "pos"
# for a positive review or "neg" for a negative review. Both items
# in the tuple are strings and need to be quoted. The tuples need
# to be separated with commas both within and without the tuple. 
training_sentiments = [
]

### Classify, Show Features, and Evaluate

In [None]:
# Load TextBlob's wrapper for the Naive Bayes classifier
# and train on your training sentiment data
from textblob.classifiers import NaiveBayesClassifier
clf = NaiveBayesClassifier(training_sentiments)

In [None]:
# Now review the document with help(NaiveBayesClassifier) to learn
# how to display the most informative features for your trained
# classifier.
#
# Do you want to go back and change anything from above? 

In [None]:
# Now run through some selection of your testing data (testing_reviews) and 
# classify using prob_classify and display the label and optionally the 
# assigned probability that it is that label.

## Section 2: Classification with SVM

Now we'll use the more familiar CountVectorizer to create a bag-of-words model that we can use for classification with an algorithm called Support Vector Machines (SVM). 

The SGDClassifer model gives us access to a linear model that is quite flexible and yields interpretable results.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn import metrics
import numpy as np
import json

In [None]:
# we are going to load saved course reviews from the LayupList
old_reviews = json.load(open('shared/ENGL64.05-22F/data/LayupList/old_reviews.json'))
new_reviews = json.load(open('shared/ENGL64.05-22F/data/LayupList/new_reviews.json'))

In [None]:
# display the reviews for each department
for d in set([r['department'] for r in new_reviews]):
    print(d,len([r for r in new_reviews if r['department'] == d]))

In [None]:
# This is a sample way to get comments for one department. Select *at least three*
# departments or programs from the above list and assemble lists with these
# narrative comments.
math_data = [r['comments'] for r in new_reviews if r['department'] == 'MATH']
math_data = [' '.join(list(r.values())) for r in math_data if len(r) != 0]
math_data[1]

### Create training data and labels

Review the above and decide if you need to do any additional preprocessing of the data. Create labels as a list of simple identifiers (I'd suggest department/program subject code) for each document that you will vectorize. The labels must match the data (item x in the labels should match item x in the data).

In [None]:
# Now you'll need to make training labels (a list that will identify each item vectorized)
train_labels = 

In [None]:
# Now we want to vectorize the data. What options/parameters should we use? 

# 1: create an instance of CountVectorizer with suitable parameters to vectorize 
#    content data (list of strings)
# 2: call this "vec"
# 3: create the document-term matrix by fit_transforming your data to your
#    instance of CountVectorizer.
# 4: call this "dtm"

In [None]:
# what is the shape? 
#
# documents x vocabulary
dtm.shape

In [None]:
# train the model using SVM
clf = SGDClassifier(tol=None,max_iter=1000).fit(dtm, train_labels)

In [None]:
# Within Scikit-learn, the weights are stored in the "coef_" attribute for the classifier.
# This is a numpy array with the dimensions of the classes and total vocabulary (i.e., dtm.shape[1])
clf.coef_.shape

In [None]:
# The most important/influential features as weights can be extracted by sorting.
# We can get the top ten more important features for the first class with the following:
np.argsort(clf.coef_[0])[-10:]

In [None]:
# These the vocabulary indices for the terms (also index for obtaining values from the coefficients/weights. 
# We can lookup the feature (term/token) from the vectorizer, which maintains the mapping 
# in get_feature_names_out().

In [None]:
# display the most important features to learning the correct classification of these reviews
feat_number = 10
feature_names = vec.get_feature_names_out()
for i, class_label in enumerate(clf.classes_):
    terms = np.argsort(clf.coef_[i])[-feat_number:]
    print("%s: %s" % (class_label,
                      ", ".join(feature_names[j] for j in terms)))