# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [7]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [8]:
len(reviews)

25000

In [9]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [10]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [11]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


In [12]:
# Count of words like terrible and excellent should be a good indicative
from collections import Counter
import numpy as np

In [13]:
pos_counter= Counter()
neg_counter= Counter()
total_counter = Counter()

In [18]:
for i in range(len(reviews)):
    if(labels[i]== "POSITIVE"):
        for word in reviews[i].split(" "):
            pos_counter[word] +=1
            total_counter[word] +=1
    else:
         for word in reviews[i].split(" "):
            neg_counter[word] +=1
            total_counter[word] +=1

In [41]:
pos_neg_ratios = Counter()
for word,count in list(total_counter.most_common()):
    if (count>200):
        pos_neg_ratio = pos_counter[word]/float(neg_counter[word]+1)
        pos_neg_ratios[word] = pos_neg_ratio
for word,ratio in pos_neg_ratios.most_common():
    if (ratio>1) :
        pos_neg_ratios[word]=np.log(ratio)
    else:
        pos_neg_ratios[word]=-np.log((1/(ratio+0.01)))

In [42]:
pos_neg_ratios.most_common()

[('edie', 6.0776422433490342),
 ('antwone', 5.8636311755980968),
 ('din', 5.7930136083841441),
 ('gunga', 5.575949103146316),
 ('goldsworthy', 5.5606816310155276),
 ('gypo', 5.4806389233419912),
 ('yokai', 5.4806389233419912),
 ('visconti', 5.3181199938442161),
 ('flavia', 5.3181199938442161),
 ('paulie', 4.5475410731514554),
 ('iturbi', 3.7471483622379123),
 ('gundam', 3.7108583608327113),
 ('felix', 3.3152549518614309),
 ('capote', 3.2999436479569826),
 ('excellently', 3.2666072276893905),
 ('fido', 3.2157941583307927),
 ('deanna', 3.2144214745188204),
 ('hayworth', 3.2144214745188204),
 ('vance', 3.1520783439446851),
 ('sammo', 3.1403135023650988),
 ('kolchak', 3.1388331171946642),
 ('laputa', 3.120895416507997),
 ('sox', 3.0408527088344606),
 ('alvin', 3.0260376230493202),
 ('biko', 2.9957322735539909),
 ('mathieu', 2.980228087018026),
 ('polanski', 2.956892440237727),
 ('gilliam', 2.932218867831665),
 ('matthau', 2.8937331055988698),
 ('harriet', 2.8466966943935033),
 ('chavez', 2

In [43]:
list(reversed(pos_neg_ratios.most_common()))[0:30]

[('boll', -4.0756718205168259),
 ('uwe', -3.9182149756760669),
 ('thunderbirds', -3.6371780797370459),
 ('beowulf', -3.6269376625580763),
 ('dahmer', -3.5480974157613048),
 ('wayans', -3.5098801595119595),
 ('seagal', -3.3166932502778712),
 ('gamera', -3.0747950693411785),
 ('dreck', -3.0295236627398889),
 ('unwatchable', -3.0212363533093325),
 ('stinker', -2.9815834776445578),
 ('mst', -2.7717131932184365),
 ('incoherent', -2.7593434954897607),
 ('flimsy', -2.7442167594087437),
 ('dillinger', -2.66353354771016),
 ('yawn', -2.6538836798963978),
 ('unfunny', -2.551917755714999),
 ('waste', -2.4902660992113286),
 ('ugh', -2.4801254711323799),
 ('turd', -2.4801254711323799),
 ('blah', -2.4439463932214442),
 ('slugs', -2.4298642943595183),
 ('tripe', -2.3746722178316588),
 ('horrid', -2.3653536197247611),
 ('wtf', -2.3603393656330929),
 ('pointless', -2.3436645987258586),
 ('atrocious', -2.3150058792909158),
 ('damme', -2.2964550528270831),
 ('redeeming', -2.2645095320211333),
 ('prom', -2