# Sentiment Classification & How To "Frame Problems" for a Neural Network

by Andrew Trask

- **Twitter**: @iamtrask
- **Blog**: http://iamtrask.github.io

### What You Should Already Know

- neural networks, forward and back-propagation
- stochastic gradient descent
- mean squared error
- and train/test splits

### Where to Get Help if You Need it
- Re-watch previous Udacity Lectures
- Leverage the recommended Course Reading Material - [Grokking Deep Learning](https://www.manning.com/books/grokking-deep-learning) (40% Off: **traskud17**)
- Shoot me a tweet @iamtrask


### Tutorial Outline:

- Intro: The Importance of "Framing a Problem"


- Curate a Dataset
- Developing a "Predictive Theory"
- **PROJECT 1**: Quick Theory Validation


- Transforming Text to Numbers
- **PROJECT 2**: Creating the Input/Output Data


- Putting it all together in a Neural Network
- **PROJECT 3**: Building our Neural Network


- Understanding Neural Noise
- **PROJECT 4**: Making Learning Faster by Reducing Noise


- Analyzing Inefficiencies in our Network
- **PROJECT 5**: Making our Network Train and Run Faster


- Further Noise Reduction
- **PROJECT 6**: Reducing Noise by Strategically Reducing the Vocabulary


- Analysis: What's going on in the weights?

# Lesson: Curate a Dataset

In [18]:
g = open('reviews.txt','r')
g.readlines()[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   \n'

In [20]:
g = open('reviews.txt','r')
first_line=g.readlines()[0]
print(first_line)
print(first_line[:-1])
#g.readlines()

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead 

In [21]:
g = open('labels.txt','r')
g.readlines()[0]

'positive\n'

In [22]:
g = open('labels.txt','r')
first_line=g.readlines()[0]
print(first_line)
print(first_line[:-1])

positive

positive


### Begin Coding Now

In [23]:
def pretty_print_review_and_label(i):
    print(labels[i] + "\t:\t" + reviews[i][:80] + "...")

g = open('reviews.txt','r') # What we know!
reviews = list(map(lambda x:x[:-1],g.readlines()))
g.close()

g = open('labels.txt','r') # What we WANT to know!
labels = list(map(lambda x:x[:-1].upper(),g.readlines()))
g.close()

In [24]:
len(reviews)

25000

In [25]:
reviews[0]

'bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   '

In [26]:
labels[0]

'POSITIVE'

# Lesson: Develop a Predictive Theory

In [27]:
print("labels.txt \t : \t reviews.txt\n")
pretty_print_review_and_label(2137)
pretty_print_review_and_label(12816)
pretty_print_review_and_label(6267)
pretty_print_review_and_label(21934)
pretty_print_review_and_label(5297)
pretty_print_review_and_label(4998)

labels.txt 	 : 	 reviews.txt

NEGATIVE	:	this movie is terrible but it has some good effects .  ...
POSITIVE	:	adrian pasdar is excellent is this film . he makes a fascinating woman .  ...
NEGATIVE	:	comment this movie is impossible . is terrible  very improbable  bad interpretat...
POSITIVE	:	excellent episode movie ala pulp fiction .  days   suicides . it doesnt get more...
NEGATIVE	:	if you haven  t seen this  it  s terrible . it is pure trash . i saw this about ...
POSITIVE	:	this schiffer guy is a real genius  the movie is of excellent quality and both e...


# Project 1: Quick Theory Validation

In [29]:
# Formulate a count based heuristics and think about how to take this dataset 
# and validate the theory that words (in reviews) are predictive of labels

In [30]:
pos_words=['fascinating','excellent']
neg_words=['terrible', 'trash']

In [31]:
# My hypthesis is that reviews that have pos_words will tend to be POSITIVE
# and those that have neg_words tend to be NEGATIVE

In [32]:
#find the number of labels (POSITIVE and NEGATIVE) that have the word 'fascinating'

In [47]:
def count_heuristics(word):
    pos_count=0
    neg_count=0
    #word='fascinating'
    for irvw in range(len(reviews)):
        review = reviews[irvw]
        flag=review.find(word)
        if flag>=0:
            if labels[irvw]=='POSITIVE':
                pos_count +=1
            else:
                neg_count +=1
    return pos_count, neg_count

In [48]:
word='fascinating'
pos_count, neg_count = count_heuristics(word)
print(word,pos_count,neg_count)

fascinating 270 98


In [49]:
word='excellent'
pos_count, neg_count = count_heuristics(word)
print(word,pos_count,neg_count)

excellent 1476 352


In [50]:
word='good'
pos_count, neg_count = count_heuristics(word)
print(word,pos_count,neg_count)

good 4920 4949


In [51]:
word='terrible'
pos_count, neg_count = count_heuristics(word)
print(word,pos_count,neg_count)

terrible 217 1122


In [52]:
word='trash'
pos_count, neg_count = count_heuristics(word)
print(word,pos_count,neg_count)

trash 160 414


In [53]:
word='horrible'
pos_count, neg_count = count_heuristics(word)
print(word,pos_count,neg_count)

horrible 153 853
