# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [20]:
def pretty_print_review_and_label(reviews, labels, i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [21]:
len(reviews)

25000

In [22]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [23]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [24]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(reviews, labels, 2137)
pretty_print_review_and_label(reviews, labels, 12816)
pretty_print_review_and_label(reviews, labels, 6267)
pretty_print_review_and_label(reviews, labels, 21934)
pretty_print_review_and_label(reviews, labels, 5297)
pretty_print_review_and_label(reviews, labels, 4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


# Project 1: Quick Theory Validation

A quick first step might be to simply count the occurrences of all the words. To remove unnecessary filler, we'll defined a `stop_words` list that has common words -- e.g., "and", "is", "a", "the", etc.

In [26]:
from collections import Counter

def print_word_freq_and_label(reviews, labels, i):
    stop_words = ["is", "this", "it", "has", ".", "but", "a", "and", "of"]
    counter = Counter(filter(lambda w: w not in stop_words, reviews[i].split()))
    print(labels[i], counter.most_common())

print("labels.txt \t : \t freq of reviews.txt\n")
print_word_freq_and_label(reviews, labels, 2137)
print_word_freq_and_label(reviews, labels, 12816)
print_word_freq_and_label(reviews, labels, 6267)
print_word_freq_and_label(reviews, labels, 21934)
print_word_freq_and_label(reviews, labels, 5297)
print_word_freq_and_label(reviews, labels, 4998)

labels.txt 	 : 	 freq of reviews.txt

NEGATIVE [('movie', 1), ('terrible', 1), ('effects', 1), ('some', 1), ('good', 1)]
POSITIVE [('woman', 1), ('fascinating', 1), ('adrian', 1), ('he', 1), ('excellent', 1), ('film', 1), ('pasdar', 1), ('makes', 1)]
NEGATIVE [('very', 1), ('terrible', 1), ('impossible', 1), ('direction', 1), ('look', 1), ('interpretation', 1), ('e', 1), ('bad', 1), ('not', 1), ('movie', 1), ('comment', 1), ('improbable', 1)]
POSITIVE [('rating', 2), ('movie', 2), ('suicides', 1), ('days', 1), ('fiction', 1), ('episode', 1), ('doesnt', 1), ('depressing', 1), ('than', 1), ('pulp', 1), ('ala', 1), ('more', 1), ('excellent', 1), ('get', 1), ('music', 1)]
NEGATIVE [('i', 2), ('pure', 1), ('terrible', 1), ('up', 1), ('m', 1), ('about', 1), ('saw', 1), ('screwed', 1), ('years', 1), ('you', 1), ('still', 1), ('seen', 1), ('t', 1), ('trash', 1), ('if', 1), ('ago', 1), ('haven', 1), ('from', 1), ('s', 1)]
POSITIVE [('br', 2), ('i', 2), ('guy', 1), ('both', 1), ('genius', 1), ('

The above frequency theory is a bit too dense, so let's try an alternative approach. From a quick glance, we see that "NEGATIVE" reviews have words like "trash", "bad", "terrible", whereas "POSITIVE" reviews have words like "excellent", "entertaining", "fascinating", "good", etc. 

Let's count the occurrences of the "NEGATIVE" words and the "POSITIVE" words and assign negative words -occurrence and positive words +occurrence. The hypothesis is the net sum should loosely correlate with the label.

In [32]:
def specific_words_theory(reviews, labels, i):
    pos_words = ["excellent", "good", "fascinating", "entertaining"]
    neg_words = ["terrible", "bad", "trash"]
    frequency = {w: reviews[i].count(w) for w in pos_words + neg_words}
    net = sum(frequency[w] if w in pos_words else -frequency[w] for w in pos_words + neg_words)
    print(labels[i], net, frequency)

print("labels.txt \t : \t freq of reviews.txt\n")
specific_words_theory(reviews, labels, 2137)
specific_words_theory(reviews, labels, 12816)
specific_words_theory(reviews, labels, 6267)
specific_words_theory(reviews, labels, 21934)
specific_words_theory(reviews, labels, 5297)
specific_words_theory(reviews, labels, 4998)

labels.txt 	 : 	 freq of reviews.txt

NEGATIVE 0 {'entertaining': 0, 'terrible': 1, 'bad': 0, 'good': 1, 'fascinating': 0, 'trash': 0, 'excellent': 0}
POSITIVE 2 {'entertaining': 0, 'terrible': 0, 'bad': 0, 'good': 0, 'fascinating': 1, 'trash': 0, 'excellent': 1}
NEGATIVE -2 {'entertaining': 0, 'terrible': 1, 'bad': 1, 'good': 0, 'fascinating': 0, 'trash': 0, 'excellent': 0}
POSITIVE 1 {'entertaining': 0, 'terrible': 0, 'bad': 0, 'good': 0, 'fascinating': 0, 'trash': 0, 'excellent': 1}
NEGATIVE -2 {'entertaining': 0, 'terrible': 1, 'bad': 0, 'good': 0, 'fascinating': 0, 'trash': 1, 'excellent': 0}
POSITIVE 2 {'entertaining': 1, 'terrible': 0, 'bad': 0, 'good': 0, 'fascinating': 0, 'trash': 0, 'excellent': 1}


## Andrew Trask's Solution

In [34]:
from collections import Counter
import numpy as np

positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

for label, review in zip(labels, reviews):
    if label == "POSITIVE":
        for word in review.split():
            positive_counts[word] += 1
    else:
        for word in review.split():
            negative_counts[word] += 1
    total_counts[word] += 1
    
positive_counts.most_common()

[('the', 173324),
 ('.', 159654),
 ('and', 89722),
 ('a', 83688),
 ('of', 76855),
 ('to', 66746),
 ('is', 57245),
 ('in', 50215),
 ('br', 49235),
 ('it', 48025),
 ('i', 40743),
 ('that', 35630),
 ('this', 35080),
 ('s', 33815),
 ('as', 26308),
 ('with', 23247),
 ('for', 22416),
 ('was', 21917),
 ('film', 20937),
 ('but', 20822),
 ('movie', 19074),
 ('his', 17227),
 ('on', 17008),
 ('you', 16681),
 ('he', 16282),
 ('are', 14807),
 ('not', 14272),
 ('t', 13720),
 ('one', 13655),
 ('have', 12587),
 ('be', 12416),
 ('by', 11997),
 ('all', 11942),
 ('who', 11464),
 ('an', 11294),
 ('at', 11234),
 ('from', 10767),
 ('her', 10474),
 ('they', 9895),
 ('has', 9186),
 ('so', 9154),
 ('like', 9038),
 ('about', 8313),
 ('very', 8305),
 ('out', 8134),
 ('there', 8057),
 ('she', 7779),
 ('what', 7737),
 ('or', 7732),
 ('good', 7720),
 ('more', 7521),
 ('when', 7456),
 ('some', 7441),
 ('if', 7285),
 ('just', 7152),
 ('can', 7001),
 ('story', 6780),
 ('time', 6515),
 ('my', 6488),
 ('great', 6419),
 

In [35]:
pos_neg_ratios = Counter()
for term, cnt in total_counts.most_common():
    if cnt > 10:
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term] + 1)
        pos_neg_ratios[term] = pos_neg_ratio
    
for word, ratio in pos_neg_ratios.most_common():
    if ratio > 1:
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = -np.log(1 / (ratio + 0.01))

pos_neg_ratios.most_common()

[('recommended', 1.2163953243244932),
 ('today', 1.1050431789984001),
 ('cox', 0.95343027882361187),
 ('harp', 0.8938178760220965),
 ('great', 0.88810470901464589),
 ('best', 0.72347034060446314),
 ('classic', 0.62504956428050518),
 ('life', 0.54695514434959924),
 ('enjoy', 0.54124285135906114),
 ('fun', 0.51439068993048687),
 ('com', 0.51326167856387173),
 ('story', 0.26358920318924967),
 ('dvd', 0.25889596573330381),
 ('films', 0.21413624545873913),
 ('entertainment', 0.16857992894165819),
 ('polarisdib', 0.15415067982725836),
 ('again', 0.15399622774570954),
 ('is', 0.13363870656907434),
 ('more', 0.11097538681569005),
 ('of', 0.10766910804438501),
 ('from', 0.10106647591230367),
 ('see', 0.099917524764184903),
 ('stars', 0.096659404205090227),
 ('film', 0.085618555650856729),
 ('now', 0.056463107132025653),
 ('a', 0.053580080386005403),
 ('back', 0.049084671607873435),
 ('time', 0.047946314631421037),
 ('good', 0.039096368552785317),
 ('one', 0.038825331788165052),
 ('it', 0.003723

In [37]:
list(reversed(pos_neg_ratios.most_common()))

[('merk', -2.111046881095167),
 ('crap', -1.7677639636718392),
 ('avoid', -1.5285820410306898),
 ('costs', -1.3990063967082527),
 ('bad', -1.3181383703873577),
 ('f', -0.90624205165116323),
 ('money', -0.79601605681206233),
 ('okay', -0.75476155582444937),
 ('warned', -0.72448043757149894),
 ('instead', -0.71048388522675199),
 ('disappointed', -0.47718726461716088),
 ('b', -0.38448832030297636),
 ('better', -0.33011381019528946),
 ('rating', -0.31784967005772319),
 ('d', -0.29035079124211877),
 ('stuff', -0.2832606053774161),
 ('movie', -0.25618464739237767),
 ('did', -0.21622782810556321),
 ('ever', -0.16531984956808601),
 ('ten', -0.16471254278698103),
 ('this', -0.14241467209332659),
 ('p', -0.13760162693540845),
 ('too', -0.12057885649135439),
 ('movies', -0.1179885823861343),
 ('c', -0.11023804143432583),
 ('away', -0.10710050172302327),
 ('out', -0.087967492963159935),
 ('up', -0.086913313176089316),
 ('end', -0.080828408246384278),
 ('them', -0.079751103530691161),
 ('me', -0.07