# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [34]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [35]:
len(reviews)

25000

In [36]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [37]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [38]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


In [39]:
words_positive = {}
words_negative = {}
words_total = {}

for i in range(len(reviews)):
    words = reviews[i].split()
    if labels[i] == "POSITIVE":
        for word in words:
            try:
                words_positive[word] = words_positive[word] + 1
            except KeyError:
                words_positive[word] = 0

            try:
                words_total[word] = words_total[word] + 1
            except KeyError:
                words_total[word] = 0
            
    elif labels[i] == "NEGATIVE":
        for word in words:
            try:
                words_negative[word] = words_negative[word] + 1
            except KeyError:
                words_negative[word] = 0
            
            try:
                words_total[word] = words_total[word] + 1
            except KeyError:
                words_total[word] = 0
            

In [40]:
words_negative

{'story': 5207,
 'of': 69008,
 'a': 79320,
 'man': 2515,
 'who': 9968,
 'has': 7603,
 'unnatural': 40,
 'feelings': 108,
 'for': 21926,
 'pig': 55,
 '.': 167537,
 'starts': 674,
 'out': 8978,
 'with': 20877,
 'opening': 500,
 'scene': 2815,
 'that': 37614,
 'is': 50082,
 'terrific': 77,
 'example': 743,
 'absurd': 224,
 'comedy': 1503,
 'formal': 5,
 'orchestra': 13,
 'audience': 1177,
 'turned': 517,
 'into': 4566,
 'an': 10265,
 'insane': 136,
 'violent': 230,
 'mob': 70,
 'by': 10548,
 'the': 163388,
 'crazy': 341,
 'chantings': 0,
 'it': 48326,
 's': 31545,
 'singers': 33,
 'unfortunately': 930,
 'stays': 56,
 'whole': 1767,
 'time': 6208,
 'no': 8142,
 'general': 380,
 'narrative': 180,
 'eventually': 326,
 'making': 1694,
 'just': 10618,
 'too': 4173,
 'off': 3566,
 'putting': 188,
 'even': 7686,
 'those': 2211,
 'from': 9730,
 'era': 209,
 'should': 2919,
 'be': 14540,
 'cryptic': 10,
 'dialogue': 941,
 'would': 7035,
 'make': 4721,
 'shakespeare': 111,
 'seem': 1146,
 'easy': 2

In [41]:
words_positive

{'bromwell': 7,
 'high': 1094,
 'is': 57244,
 'a': 83687,
 'cartoon': 248,
 'comedy': 1741,
 '.': 159653,
 'it': 48024,
 'ran': 121,
 'at': 11233,
 'the': 173323,
 'same': 1971,
 'time': 6514,
 'as': 26307,
 'some': 7440,
 'other': 4857,
 'programs': 38,
 'about': 8312,
 'school': 782,
 'life': 4198,
 'such': 2589,
 'teachers': 34,
 'my': 6487,
 'years': 2757,
 'in': 50214,
 'teaching': 39,
 'profession': 36,
 'lead': 629,
 'me': 5166,
 'to': 66745,
 'believe': 1070,
 'that': 35629,
 's': 33814,
 'satire': 135,
 'much': 4684,
 'closer': 125,
 'reality': 604,
 'than': 4735,
 'scramble': 1,
 'survive': 146,
 'financially': 19,
 'insightful': 40,
 'students': 160,
 'who': 11463,
 'can': 7000,
 'see': 6025,
 'right': 1687,
 'through': 2408,
 'their': 6106,
 'pathetic': 60,
 'pomp': 4,
 'pettiness': 1,
 'of': 76854,
 'whole': 1309,
 'situation': 385,
 'all': 11941,
 'remind': 83,
 'schools': 29,
 'i': 40742,
 'knew': 461,
 'and': 89721,
 'when': 7455,
 'saw': 1691,
 'episode': 1022,
 'which

In [42]:
from collections import Counter
import numpy as np
        

In [43]:
positive_counts = Counter()
negative_counts = Counter()
total_counts = Counter()

In [44]:
for i in range(len(reviews)):
    if(labels[i] == 'POSITIVE'):
        for word in reviews[i].split(" "):
            positive_counts[word] += 1
            total_counts[word] += 1
    else:
        for word in reviews[i].split(" "):
            negative_counts[word] += 1
            total_counts[word] += 1
            

In [45]:
positive_counts.most_common()

[('', 550468),
 ('the', 173324),
 ('.', 159654),
 ('and', 89722),
 ('a', 83688),
 ('of', 76855),
 ('to', 66746),
 ('is', 57245),
 ('in', 50215),
 ('br', 49235),
 ('it', 48025),
 ('i', 40743),
 ('that', 35630),
 ('this', 35080),
 ('s', 33815),
 ('as', 26308),
 ('with', 23247),
 ('for', 22416),
 ('was', 21917),
 ('film', 20937),
 ('but', 20822),
 ('movie', 19074),
 ('his', 17227),
 ('on', 17008),
 ('you', 16681),
 ('he', 16282),
 ('are', 14807),
 ('not', 14272),
 ('t', 13720),
 ('one', 13655),
 ('have', 12587),
 ('be', 12416),
 ('by', 11997),
 ('all', 11942),
 ('who', 11464),
 ('an', 11294),
 ('at', 11234),
 ('from', 10767),
 ('her', 10474),
 ('they', 9895),
 ('has', 9186),
 ('so', 9154),
 ('like', 9038),
 ('about', 8313),
 ('very', 8305),
 ('out', 8134),
 ('there', 8057),
 ('she', 7779),
 ('what', 7737),
 ('or', 7732),
 ('good', 7720),
 ('more', 7521),
 ('when', 7456),
 ('some', 7441),
 ('if', 7285),
 ('just', 7152),
 ('can', 7001),
 ('story', 6780),
 ('time', 6515),
 ('my', 6488),
 ('g

In [46]:
pos_neg_ratios = Counter()

for term, cnt in list(total_counts.most_common()):
    if (cnt>100):
        pos_neg_ratio = positive_counts[term] / float(negative_counts[term] + 1)
        pos_neg_ratios[term] = pos_neg_ratio
        
for word, ratio in pos_neg_ratios.most_common():
    if (ratio > 1):
        pos_neg_ratios[word] = np.log(ratio)
    else:
        pos_neg_ratios[word] = -np.log((1 / (ratio + 0.01)))

In [47]:
pos_neg_ratios.most_common()

[('edie', 4.6913478822291435),
 ('paulie', 4.0775374439057197),
 ('felix', 3.1527360223636558),
 ('polanski', 2.8233610476132043),
 ('matthau', 2.8067217286092401),
 ('victoria', 2.6810215287142909),
 ('mildred', 2.6026896854443837),
 ('gandhi', 2.5389738710582761),
 ('flawless', 2.451005098112319),
 ('superbly', 2.2600254785752498),
 ('perfection', 2.1594842493533721),
 ('astaire', 2.1400661634962708),
 ('captures', 2.0386195471595809),
 ('voight', 2.0301704926730531),
 ('wonderfully', 2.0218960560332353),
 ('powell', 1.9783454248084671),
 ('brosnan', 1.9547990964725592),
 ('lily', 1.9203768470501485),
 ('bakshi', 1.9029851043382795),
 ('lincoln', 1.9014583864844796),
 ('refreshing', 1.8551812956655511),
 ('breathtaking', 1.8481124057791867),
 ('bourne', 1.8478489358790986),
 ('lemmon', 1.8458266904983307),
 ('delightful', 1.8002701588959635),
 ('flynn', 1.7996646487351682),
 ('andrews', 1.7764919970972666),
 ('homer', 1.7692866133759964),
 ('beautifully', 1.7626953362841438),
 ('socc

In [50]:
vocab = set(total_counts.keys())
vocab_size = len(vocab)
print(vocab_size)

74074


In [52]:
import numpy as np

layer_0 = np.zeros((1, len(vocab)))
layer_0

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [None]:
def update_input_layer(review):
    """ Modify the global layer_0 to represent the vector form of review.
    The element at a given index of layer_0 should represent \
    how many times the given word occurs in the review.
    Args:
        review(string) - the string of the review
    Returns:
        None
    """
    global layer_0
    # clear out previous state, reset the layer to be all 0s
    layer_0 *= 0
    ## Your code here
    pass


In [None]:
def get_target_for_label(label):
    """Convert a label to `0` or `1`.
    Args:
        label(string) - Either "POSITIVE" or "NEGATIVE".
    Returns:
        `0` or `1`.
    """
    pass