# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Import and curate dataset

In [6]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")
    
f = open('reviews.txt', 'r')
reviews = list(map(lambda x: x[:-1], f.readlines()))
f.close()

f = open('labels.txt', 'r')
labels = list(map(lambda x: x[:-1].upper(), f.readlines()))
f.close()

# Develop a predictive theory

Look at the data and come up with an initial theory of how it should be classified. In our dataset, we can observe that positive and negative *words* might tell you if the review is good or bad. So perhaps the word count between those two categories?

In [19]:
# Project 1: Quick validation of our theory
correct_pred = 0
incorrect_pred = 0

for i in range(0,len(reviews)):
    pos = 0
    neg = 0
    pos_words = ["excellent", "amazing", "impressive", "good", "great", "genius"]
    neg_words = ["terrible", "awful", "boring", "bad", "worst", "dumb"]
    for word in pos_words:
        pos += reviews[i].count(word)
    for word in neg_words:
        neg += reviews[i].count(word)
    if pos > neg:
        predict = "POSITIVE"
    else:
        predict = "NEGATIVE"
    if predict == labels[i]:
        correct_pred += 1
    else:
        incorrect_pred += 1

print("Correct: ", correct_pred)
print("Incorrect: ", incorrect_pred)
# end my solution

# Trask's solution
# Count the words that show up in the different reviews (I lazily skipped this)
# but it's a good idea
from collections import Counter
import numpy as np

positive_counts, negative_counts, total_counts = Counter(), Counter(), Counter()
for i in range(0,len(reviews)):
    if labels[i] == 'POSITIVE':
        for word in reviews[i].split(" "):
            positive_counts[word] += 1
            total_counts[word] += 1
    else:
        for word in reviews[i].split(" "):
            negative_counts[word] += 1
            total_counts[word] += 1

Correct:  16704
Incorrect:  8296


In [29]:
pos_neg_ratios = Counter()

for term, cnt in list(total_counts.most_common()):
    if cnt > 100:
        pos_neg_ratios[term] = positive_counts[term] / float(negative_counts[term]+1)

for word, ratio in pos_neg_ratios.most_common():
    if ratio > 1:
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = -np.log((1 / (ratio + 0.01)))

In [31]:
list(reversed(pos_neg_ratios.most_common()))

[('boll', -4.0778152602708904),
 ('uwe', -3.9218753018711578),
 ('seagal', -3.3202501058581921),
 ('unwatchable', -3.0269848170580955),
 ('stinker', -2.9876839403711624),
 ('mst', -2.7753833211707968),
 ('incoherent', -2.7641396677532537),
 ('unfunny', -2.5545257844967644),
 ('waste', -2.4907515123361046),
 ('blah', -2.4475792789485005),
 ('horrid', -2.3715779644809971),
 ('pointless', -2.3451073877136341),
 ('atrocious', -2.3187369339642556),
 ('redeeming', -2.2667790015910296),
 ('prom', -2.2601040980178784),
 ('drivel', -2.2476029585766928),
 ('lousy', -2.2118080125207054),
 ('worst', -2.1930856334332267),
 ('laughable', -2.172468615469592),
 ('awful', -2.1385076866397488),
 ('poorly', -2.1326133844207011),
 ('wasting', -2.1178155545614512),
 ('remotely', -2.111046881095167),
 ('existent', -2.0024805005437076),
 ('boredom', -1.9241486572738005),
 ('miserably', -1.9216610938019989),
 ('sucks', -1.9166645809588516),
 ('uninspired', -1.9131499212248517),
 ('lame', -1.9117232884159072),