# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [7]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [8]:
len(reviews)

25000

In [9]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [10]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [11]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


In [12]:
# Count of words like terrible and excellent should be a good indicative
from collections import Counter
import numpy as np

In [13]:
pos_counter= Counter()
neg_counter= Counter()
total_counter = Counter()

In [18]:
for i in range(len(reviews)):
    if(labels[i]== "POSITIVE"):
        for word in reviews[i].split(" "):
            pos_counter[word] +=1
            total_counter[word] +=1
    else:
         for word in reviews[i].split(" "):
            neg_counter[word] +=1
            total_counter[word] +=1

In [33]:
pos_neg_ratios = Counter()
for word,count in list(total_counter.most_common()):
    if (count>100):
        pos_neg_ratio = pos_counter[word]/float(neg_counter[word]+1)
        pos_neg_ratios[word] = pos_neg_ratio
for word,ratio in pos_neg_counter.most_common():
    if ratio>1 :
        pos_neg_ratios[word]=np.log(ratio)
    else:
        pos_neg_ratios[word]=-np.log((1/(ratio+0.01)))

In [34]:
pos_neg_ratios.most_common()

[('edie', 436.0),
 ('antwone', 352.0),
 ('din', 328.0),
 ('gunga', 264.0),
 ('goldsworthy', 260.0),
 ('gypo', 240.0),
 ('yokai', 240.0),
 ('visconti', 204.0),
 ('flavia', 204.0),
 ('blandings', 192.0),
 ('kells', 192.0),
 ('brashear', 188.0),
 ('gino', 184.0),
 ('deathtrap', 180.0),
 ('harilal', 164.0),
 ('panahi', 164.0),
 ('ossessione', 156.0),
 ('tsui', 152.0),
 ('caruso', 152.0),
 ('sabu', 148.0),
 ('ahmad', 148.0),
 ('khouri', 144.0),
 ('dominick', 144.0),
 ('aweigh', 140.0),
 ('mj', 140.0),
 ('mcintire', 136.0),
 ('kriemhild', 136.0),
 ('blackie', 132.0),
 ('daisies', 132.0),
 ('newcombe', 132.0),
 ('kei', 128.0),
 ('trelkovsky', 128.0),
 ('jaffar', 124.0),
 ('hilliard', 124.0),
 ('bathsheba', 120.0),
 ('pazu', 120.0),
 ('sheeta', 120.0),
 ('krell', 120.0),
 ('offside', 120.0),
 ('venoms', 116.0),
 ('fineman', 116.0),
 ('paine', 112.0),
 ('pimlico', 112.0),
 ('ranma', 112.0),
 ('ronny', 112.0),
 ('abhay', 108.0),
 ('kipling', 104.0),
 ('pym', 104.0),
 ('paulie', 94.4),
 ('iturbi'

In [35]:
list(reversed(pos_neg_ratios.most_common()))[0:30]

[('rosanna', 0.0),
 ('berkowitz', 0.0),
 ('orca', 0.0),
 ('mraovich', 0.0),
 ('kibbutz', 0.0),
 ('dunaway', 0.0),
 ('savini', 0.0),
 ('lordi', 0.0),
 ('hackenstein', 0.0),
 ('manos', 0.0),
 ('zenia', 0.0),
 ('shaq', 0.0),
 ('tremors', 0.0),
 ('darkman', 0.0),
 ('btk', 0.0),
 ('saif', 0.0),
 ('delia', 0.0),
 ('sarne', 0.0),
 ('gram', 0.0),
 ('kornbluth', 0.0),
 ('kareena', 0.0),
 ('hobgoblins', 0.0),
 ('tashan', 0.0),
 ('slater', 0.0),
 ('boll', 0.006980802792321117),
 ('uwe', 0.009876543209876543),
 ('thunderbirds', 0.0163265306122449),
 ('beowulf', 0.016597510373443983),
 ('dahmer', 0.018779342723004695),
 ('wayans', 0.01990049751243781)]